Utilities for working with parsed XML trees

14 Nov 2015

Groovy’s XML parser is a convenient way to get an in-memory parse tree from an XML source like a file or a string of XML data, but there are a few recursive tasks I find myself constantly writing yet one more time. I’ve released the first version of a package that abstracts one of the most common: extracting all text content from a tree or subtree of the parsed document.

It normalizes white space outside of text nodes, so that extracting the text from a sequence like this

<div>
    <l n="1">I met a traveller from an antique land</l>
    <l n="2">Who said ... </l>
</div>

yields

I met a traveller from an antique land Who said ...

rather than

I met a traveller from an antique landWho said ...

It also provides a configuration mechanism to suppress white space in specified markup context. For example, a markup sequence like this:

<w>Part<unclear>iall</unclear>y<w>

would by default produce

Part iall y

but if you define w elements as “tokenizing markup,” extracting text from the same passage yields

Partially

See the project webpage for more information.

Neel Smith on github Openly available work in digital classics

Utilities for working with parsed XML trees

Related Posts