Neel Smith on github Openly available work in digital classics

Introducing Kanónes, a system for building Greek morphological parsers

Kanónes is a system for building Greek morphological parsers. It is distinguished from other automated parsing systems by its focus on citable results of corpus-specific parsing (featuressummarized on the project home page) .

Kanónes has an abstract interface representing a system for writing Greek. Implementations of a writing system define an alphabet, and implement methods for accenting Greek words in that writing system. This allows Kanónes to operate on Greek texts in the familiar literary alphabet, or in alphabets like the one used in Athens before 403 BCE.

Internally, Kanónes defines eight forms of morphological analysis: one each for nouns, adjectives, adverbs, and indeclinable forms, and for verbs, conjugated forms, infinitives, participles and verbal adjectives. Kanónes further defines a set of inflectional classes that apply to each form of analysis. To build a parser, you fill out two sets of tables: one assigns morphological stems to an inflectional class; the other assigns inflectional rules to their class.

Complete documentation is still in progress, but an example can serve to illustrate how this works. (Note that the tables in this example use an ASCII-only representation of the literary Greek alphabet, so in the example stem listing below poi represents ποι, and in the inflectional table, ee represents εε.)

Kanónes has a class for epsilon contract verbs, called ew_contract. This one-line entry in the table of stems defines everything needed to work with the regular verb ποιέω.

Stem ID Lexical Entity Stem value Inflectional class
smyth.n84234_0 lexent.n84234 poi ew_contract

In the table of inflectional rules, this line defines an inflectional ending to analyze as third person singular imperfect indicative active.

Rule ID Inflectional class Ending Person Number Tense Mood Voice  
verbinfl.ew_impft3 ew_contract ee 3rd sg impft indic act  

Because both stem and inflectional pattern belong to the class ew_contract, Kanónes will combine them and correctly parse ἐποίεε in a text of Herodotus. If you were parsing a corpus of Attic prose, you could use the same stem entry but replace the inflectional pattern with a line like

Rule ID Inflectional class Ending Person Number Tense Mood Voice  
verbinfl.ew_impft3 ew_contract ei 3rd sg impft indic act  

and the system will correctly parse the Attic form ἐποίει. To analyze a corpus with both contracted and uncontracted forms of -εω contract verbs, you would include both rules in the inflectional tables.

To test the internal workings of the system, I’m currently assembling a set of stems and rules that cover all the paradigms and synopses in Herbert Weir Smyth’s Greek Grammar, and am writing accompnaying unit tests that parse every for in those paradigms. (I’ve set up a scorecard, here, if you want to track the progress of this testing.)

Results of the first two tests are encouraging: Smyth 382 is a synopsis of λύω that includes, in addition to conjugated forms in the first singular, all infinitives, and the masculine nominative singular of all participles and verbal adjectives. Smyth 406 is a conjugation of four verbs that undergo euphonic changes in the perfect middle-passive system (λείπω, γράφω, πείθω, πράττω). The 155 forms successfully tested for these two sections of Smyth ensure that the basic patterns of verb formation are correctly parsed.

It will take some time to enter and verify every form in all of Smyth’s paradigms, but I hope to pass enough tests to justify releasing a 1.0 version of Kanónes in May.