Greek, sortable, and in XML
16 Nov 2015The 2.x series of the greeklang
library adds a GreekNode
class for working with XML data when text nodes contain valid values for GreekString
s. It extends the XmlNode
class in this library (introduced in this post), but verifies that all contents of text nodes are valid GreekString
values. It also offers methods for transcoding the text contents of an XML string to either of the two representations defined by the GreekString
class, as illustrated in the following snippet:
String asciiSrc = """
<div n="1">
<l n="1">mh=nin a)/eide qea\\ <persName n="urn:cite:hmt:pers.pers1">phlhi+a/dew a)xilh=os</persName>
</l>
</div>
"""
// Boolean flag means text nodes are NOT in Unicode mapping of Greek:
GreekNode gn = new GreekNode(asciiSrc, false)
// But we can get the same XML in Unicode:
String transcoded = gn.getUnicodeXml()
The value of transcoded
will be:
<div n="1">
<l n="1">μῆνιν ἄειδε θεὰ <persName n="urn:cite:hmt:pers.pers1">πηληϊάδεω ἀχιλῆος</persName>
</l>
</div>
In the 2.x release series, the GreekString
class implements java’s Comparable
interface so GreekString
s can be compared and sorted using the logic of the Greek alphabet. The following statement is true even though “q” would sort after “k” in a String:
GreekString asciiTheta = new GreekString ("e)/qhke")
GreekString asciiKappa = new GreekString ("e)/kruye")
assert asciiKappa > asciiTheta
Since we’re comparing GreekString objects, it doesn’t matter how we initialize them:
GreekString asciiTheta = new GreekString ("e)/qhke")
GreekString uniKappa = new GreekString("ἔκρυψε", true)
assert uniKappa == asciiKappa
Sortable GreekString
s and simple extraction of text from XML source data complement each other handily for tasks like indexing a text. Consider the following snippet that collects all the text from an XML source, and splits it on white space into an array of words:
String xmlSrc = """
<div n="1">
<l n="1">Μῆνιν ἄειδε θεὰ <persName n="urn:cite:hmt:pers.pers1">Πηληϊάδεω Ἀχιλῆος</persName>
</l>
</div>
"""
// Boolean flag indicates text nodes are in Unicode mapping of Greek
GreekNode gn = new GreekNode(xmlSrc, true)
// so allText is just the raw Unicode: Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος
String allText = gn.collectText()
ArrayList wordList = allText.split(/\s/)
If we sort the raw word list as Strings, the Unicode ordering will produce the following thoroughly useless sequence:
"Μῆνιν", "Πηληϊάδεω", "θεὰ", "μυρί'", "οὐλομένην·","ἄειδε", "ἄλγε'", "Ἀχαιοῖς", "Ἀχιλῆος", "ἔθηκεν·", "ἡ"
If we convert it instead to a list of GreekStrings, say, like this:
def gsList = []
wordList.each {
gsList.add(new GreekString(it, true))
}
we can then directly sort gsList
. Since it is a list of GreekString
objects, the results (displayed here in the Unicode mapping) will be far more satisfactory:
"ἄειδε","ἄλγεʼ","ἀχαιοῖς","ἀχιλῆος","ἔθηκεν·","ἡ","θεὰ","μῆνιν","μυρίʼ","οὐλομένην·","πηληϊάδεω"