Greek, sortable, and in XML

16 Nov 2015

The 2.x series of the greeklang library adds a GreekNode class for working with XML data when text nodes contain valid values for GreekStrings. It extends the XmlNode class in this library (introduced in this post), but verifies that all contents of text nodes are valid GreekString values. It also offers methods for transcoding the text contents of an XML string to either of the two representations defined by the GreekString class, as illustrated in the following snippet:

String asciiSrc = """
<div n="1">
 <l n="1">mh=nin a)/eide qea\\ <persName n="urn:cite:hmt:pers.pers1">phlhi+a/dew a)xilh=os</persName>
 </l>
</div>
"""
// Boolean flag means text nodes are NOT in Unicode mapping of Greek:
GreekNode gn = new GreekNode(asciiSrc, false)
// But we can get the same XML in Unicode:
String transcoded = gn.getUnicodeXml()

The value of transcoded will be:

<div n="1">
 <l n="1">μῆνιν ἄειδε θεὰ <persName n="urn:cite:hmt:pers.pers1">πηληϊάδεω ἀχιλῆος</persName>
 </l>
</div>

In the 2.x release series, the GreekString class implements java’s Comparable interface so GreekStrings can be compared and sorted using the logic of the Greek alphabet. The following statement is true even though “q” would sort after “k” in a String:

GreekString asciiTheta = new GreekString ("e)/qhke")
GreekString asciiKappa = new GreekString ("e)/kruye")
assert asciiKappa > asciiTheta

Since we’re comparing GreekString objects, it doesn’t matter how we initialize them:

GreekString asciiTheta = new GreekString ("e)/qhke")
GreekString uniKappa = new GreekString("ἔκρυψε", true)
assert uniKappa == asciiKappa

Sortable GreekStrings and simple extraction of text from XML source data complement each other handily for tasks like indexing a text. Consider the following snippet that collects all the text from an XML source, and splits it on white space into an array of words:

 String xmlSrc = """
 <div n="1">
   <l n="1">Μῆνιν ἄειδε θεὰ <persName n="urn:cite:hmt:pers.pers1">Πηληϊάδεω Ἀχιλῆος</persName>
   </l>
</div>
"""
// Boolean flag indicates text nodes are in Unicode mapping of Greek
GreekNode gn  = new GreekNode(xmlSrc, true)
// so allText is just the raw Unicode: Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος
String allText = gn.collectText()
ArrayList wordList = allText.split(/\s/)

If we sort the raw word list as Strings, the Unicode ordering will produce the following thoroughly useless sequence:

"Μῆνιν", "Πηληϊάδεω", "θεὰ", "μυρί'", "οὐλομένην·","ἄειδε", "ἄλγε'", "Ἀχαιοῖς", "Ἀχιλῆος", "ἔθηκεν·", "ἡ"

If we convert it instead to a list of GreekStrings, say, like this:

def gsList = []
wordList.each {
  gsList.add(new GreekString(it, true))
}

we can then directly sort gsList. Since it is a list of GreekString objects, the results (displayed here in the Unicode mapping) will be far more satisfactory:

"ἄειδε","ἄλγεʼ","ἀχαιοῖς","ἀχιλῆος","ἔθηκεν·","ἡ","θεὰ","μῆνιν","μυρίʼ","οὐλομένην·","πηληϊάδεω"

Neel Smith on github Openly available work in digital classics

Greek, sortable, and in XML

Related Posts