Utilities for working with parsed XML trees
14 Nov 2015Groovy’s XML parser is a convenient way to get an in-memory parse tree from an XML source like a file or a string of XML data, but there are a few recursive tasks I find myself constantly writing yet one more time. I’ve released the first version of a package that abstracts one of the most common: extracting all text content from a tree or subtree of the parsed document.
It normalizes white space outside of text nodes, so that extracting the text from a sequence like this
<div>
<l n="1">I met a traveller from an antique land</l>
<l n="2">Who said ... </l>
</div>
yields
I met a traveller from an antique land Who said ...
rather than
I met a traveller from an antique landWho said ...
It also provides a configuration mechanism to suppress white space in specified markup context. For example, a markup sequence like this:
<w>Part<unclear>iall</unclear>y<w>
would by default produce
Part iall y
but if you define w
elements as “tokenizing markup,” extracting text from the same passage yields
Partially
See the project webpage for more information.