One common task is to collect all text content for a tree or subtree: that is, all text contained in a node and any of the nodes recursively contained by it. This requires collecting the String value of all contained text nodes in document order, while taking account of XML's treatment of white space outside of text nodes as not significant. The convention of the XmlNode
class is to reduce spaces between elements to a single space character.
The first part of the text content of this well-formed XML fragment is found in direct children of the root element, but the document's text continues in hierarchically separated subelements.
<l n="1">Sing, goddess, the rage of <persName n="urn:cite:hmt:pers.pers1">Achilles</persName></l>
When we extract the text from this node, we correctly get the single continuous string Sing, goddess, the rage of Achilles
Extra spaces in the XML document are regularized to a single white space character. In the following well-formed XML fragment, the text contents are identical to the preceding example, but white space is used differently in the markup:
<l n="1">Sing, goddess, the rage of <persName n="urn:cite:hmt:pers.pers1" >Achilles</persName> </l>
When we extract the text contents from this node, we correctly get an identical string Sing, goddess, the rage of Achilles
This example illustrates how the collectText()
method separates strings extracted from distinct elements with a white space. This markup markup is a possible encoding of (the beginning of) Shelley's Ozymandias. There is no white space at the end of the first poetic line and none at the beginning of the second line.
<div><l n="1">I met a traveller from an antique land</l><l n="2">Who said </l></div>
collectText()
correctly separates the last word of line 1 from the first word of line 2:
I met a traveller from an antique land Who said
Sometimes we want to override the default behavior of separating text content of adjacent elements. The XmlNode
class supports optionally defining markup conventions that cluster contained content into a single token with all white space removed. Any of the following conventions may be used to define tokenizing markup:
Editors of historical documents might need to identify unclear sections, as illustrated in this example:
<l>Sing, <w>god<unclear>dess</unclear></w></l>
Extracting text context with default settings will yield the string Sing, god dess
If we define the w element as containing a single token, we instead get the string Sing, goddess
We can configure XmlNode
to recognize tokenizing markup by attribute name. In the following example, the w
element has an ana
attribute.
<l>Sing, <w ana="token">god<unclear>dess</unclear></w></l>
If we extract the text with default settings, we again get the string Sing, god dess
If we define the ana attribute as identifyiing a tokenizing element, we instead get the string Sing, goddess
The following snippet has markup identifying a punctuation character, and uses the ana
attributes with two different values.
<l><w ana="multitoken">Sing<punct>,</punct></w> <w ana="token">god<unclear>dess</unclear></w></l>
If we extract the text with default settings, we again get the string Sing , god dess
We can configure XmlNode
to recognize tokenizing markup by a combination of attribute name and value. Using the markup from the preceding example, we could set the ana
attribute to mark tokenizing elements only when it has the value token. When we collect the text from this sample, the contents of the element with ana
attribute = multitoken
will follow the default behavior of separating subelements, while the contents of the element with ana
attribute = token
will group that text in a single string with no whitespace, producing the string Sing , goddess
In this example, we have an ana
attribute with value token
on two different elements.
<l><seg ana="token">Sing<punct>,</punct></seg> <w ana="token">god<unclear>dess</unclear></w></l>
Extracting the text with default settings, again yields the string Sing , god dess
Now we add the restriction that the token
value on ana
only identifies tokenizing markup when on the element w
. The contents of the seg
element will be broken up using the default rules, while the contents of the w
element will be Sing , goddess