Neel Smith on github Openly available work in digital classics

Utilities for working with parsed XML trees

Groovy’s XML parser is a convenient way to get an in-memory parse tree from an XML source like a file or a string of XML data, but there are a few recursive tasks I find myself constantly writing yet one more time. I’ve released the first version of a package that abstracts one of the most common: extracting all text content from a tree or subtree of the parsed document.

It normalizes white space outside of text nodes, so that extracting the text from a sequence like this

<div>
    <l n="1">I met a traveller from an antique land</l>
    <l n="2">Who said ... </l>
</div>

yields

I met a traveller from an antique land Who said ...

rather than

I met a traveller from an antique landWho said ...

It also provides a configuration mechanism to suppress white space in specified markup context. For example, a markup sequence like this:

<w>Part<unclear>iall</unclear>y<w>

would by default produce

Part iall y

but if you define w elements as “tokenizing markup,” extracting text from the same passage yields

Partially

See the project webpage for more information.