Kanones' analyses

Kanones.jl implements the model of the CitableParserBuilder module. Parsing functions (like parsetoken) return a Vector of Analysis objects. In addition to a lexeme and a form, each Analysis also includes a stem and an inflectional rule. Conceptually, the stem and rule provide the rationale for an analysis: the stem explains why a specific lexeme was chosen, and the inflectional rule explains how the token was formed by applying a particular inflectional pattern to the stem. When generating tokens, pairing a stem and a rule provides enough information to identify a lexeme and a form and to compose a token. Kanones can actually produce a full Analysis object when generating tokens as well as when parsing them.

Kanones further associates an implementation of an Orthography with each parser. You can use Kanones to build parsers that are tailored not only to specific features of language (vocabulary or inflectional patterns specific to a particular corpus or dialect), but also to specific orthographic systems and the phonology they represent. The Kanones github repository, for example, includes stems and rules in two completely different orthographies: the standard orthography of printed literary Greek, and the orthography of inscriptions of Athens prior to 403 BCE.

In Kanones, each of the four components of an Analysis are Cite2Urn values. The identifiers for lexemes and morphological forms are potentially applicable to any parser you build with Kanones; stems and rules for the same lexeme and form may differ if you are parsing texts using different orthographies. The fact that you can meaningfully use references to lexems and forms drawn from parsers in different orthographies means that you can even analyze a token in one orthography, and generate the corresponding token for the same lexeme and form in another orthography.

Example: transcoding content

First, we build a parser with the conventional orthography of modern printed editions of literary texts.

using Kanones, CitableParserBuilder
using PolytonicGreek
lgfiles = joinpath(repo, "datasets", "literarygreek-rules")
lg = dataset(lgfiles)
lgparser = stringParser(lg)
StringParser(Any["ἀγαθός|lsj.n260|forms.7010001110|adjstems.n260|adjinfl.os_h_on_pos1", "ἀγαθή|lsj.n260|forms.7010002110|adjstems.n260|adjinfl.os_h_on_pos2", "ἀγαθόν|lsj.n260|forms.7010003110|adjstems.n260|adjinfl.os_h_on_pos3", "ἀγαθοῦ|lsj.n260|forms.7010001210|adjstems.n260|adjinfl.os_h_on_pos4", "ἀγαθῆς|lsj.n260|forms.7010002210|adjstems.n260|adjinfl.os_h_on_pos5", "ἀγαθοῦ|lsj.n260|forms.7010003210|adjstems.n260|adjinfl.os_h_on_pos6", "ἀγαθῷ|lsj.n260|forms.7010001310|adjstems.n260|adjinfl.os_h_on_pos7", "ἀγαθῇ|lsj.n260|forms.7010002310|adjstems.n260|adjinfl.os_h_on_pos8", "ἀγαθῷ|lsj.n260|forms.7010003310|adjstems.n260|adjinfl.os_h_on_pos9", "ἀγαθόν|lsj.n260|forms.7010001410|adjstems.n260|adjinfl.os_h_on_pos10"  …  "συνείησαν|lsj.n99858|forms.3331310000|compounds.n99858|irreginfl.irregular2", "συνείεν|lsj.n99858|forms.3331310000|compounds.n99858|irreginfl.irregular2", "συνῶ|lsj.n99858|forms.3111210000|compounds.n99858|irreginfl.irregular2", "συνῇς|lsj.n99858|forms.3211210000|compounds.n99858|irreginfl.irregular2", "συνῇ|lsj.n99858|forms.3311210000|compounds.n99858|irreginfl.irregular2", "συνῶμεν|lsj.n99858|forms.3131210000|compounds.n99858|irreginfl.irregular2", "συνῆτε|lsj.n99858|forms.3231210000|compounds.n99858|irreginfl.irregular2", "συνῶσι|lsj.n99858|forms.3331210000|compounds.n99858|irreginfl.irregular2", "συνεῖναι|lsj.n99858|forms.4001010000|compounds.n99858|irreginfl.irregular3", "συνέμμεναι|lsj.n99858|forms.4001010000|compounds.n99858|irreginfl.irregular3"])

Next, we build a parser with an orthography for the Attic alphabet used before 403 BCE.

using AtticGreek
atticfiles = joinpath(repo,"datasets","attic")
attic = dataset(atticfiles)
atticparser = stringParser(attic)
StringParser(Any["άνθροπος|lsj.n8909|forms.2010001100|atticnounstems.n8909|atticnouninfl.os_ou1", "άνθροπο|lsj.n8909|forms.2010001200|atticnounstems.n8909|atticnouninfl.os_ou2", "άνθροποι|lsj.n8909|forms.2010001300|atticnounstems.n8909|atticnouninfl.os_ou3", "άνθροπον|lsj.n8909|forms.2010001400|atticnounstems.n8909|atticnouninfl.os_ou4", "άνθροποι|lsj.n8909|forms.2030001100|atticnounstems.n8909|atticnouninfl.os_ou5", "άνθροπον|lsj.n8909|forms.2030001200|atticnounstems.n8909|atticnouninfl.os_ou6", "άνθροποις|lsj.n8909|forms.2030001300|atticnounstems.n8909|atticnouninfl.os_ou7", "άνθροπος|lsj.n8909|forms.2030001400|atticnounstems.n8909|atticnouninfl.os_ou8", "βολέ|lsj.n20600|forms.2010002100|atticnounstems.n20600|atticnouninfl.h_hs1", "βολnothingς|lsj.n20600|forms.2010002200|atticnounstems.n20600|atticnouninfl.h_hs2", "βολει|lsj.n20600|forms.2010002300|atticnounstems.n20600|atticnouninfl.h_hs3", "βολέν|lsj.n20600|forms.2010002400|atticnounstems.n20600|atticnouninfl.h_hs4", "βολαί|lsj.n20600|forms.2030002100|atticnounstems.n20600|atticnouninfl.h_hs5", "βολnothingν|lsj.n20600|forms.2030002200|atticnounstems.n20600|atticnouninfl.h_hs6", "βολαῖς|lsj.n20600|forms.2030002300|atticnounstems.n20600|atticnouninfl.h_hs7", "βολάς|lsj.n20600|forms.2030002400|atticnounstems.n20600|atticnouninfl.h_hs8", "καί|lsj.n51951|forms.1000000001|atticuninflectedstems.n51951|attic.indeclinable2"])

Now we can analyze a token written in standard orthography.

analysis = parsetoken("βουλῆς",lgparser)[1]
CitableParserBuilder.Analysis("βουλῆς", lsj.n20600, forms.2010002200, nounstems.n20600, nouninfl.h_hs2)

Kanones has lexeme and formurn functions to retrieve those elements of an analysis. With those in hand, we can now generate the corresponding token using the parser for Attic orthography.

vocab = lexeme(analysis)
form = formurn(analysis)
generate(vocab,form,attic)