Organizing and extracting source data

Formatting a data source

  • use formatentries to read the source XML, and build a 3-column structure with a unique ID, a labelling lemma, and the full XML text of the entry.

Extracting morphological data

  • use extractmorph to read a file of 3-column data and extract a set of morphological features with:

    • id
    • label
    • lemma
    • pos
    • itype
    • gen
    • mood

Working with morphological data

The morphological data extracted by extractmorph can be read into an object model by applying the morphData function to each entry.

Worked example

The cex/lewis-short directory has a three-column file named mainentries.cex.

sourcefile = joinpath(repo, "cex", "lewis-short", "mainentries.cex")
"/home/runner/work/LexiconMining.jl/LexiconMining.jl/cex/lewis-short/mainentries.cex"

Chain extractmorph and morphData to create a Vector of MorphData objects.

using LexiconMining
extractmorph(sourcefile) .|> morphData
40024-element Vector{MorphData}:
 MorphData("id", "label", "lemma", "pos", "itype", "gen", "mood")
 MorphData("n0", "A", "A1", "", "indecl.", "n.", "")
 MorphData("n1", "a", "a2", "prep.", "", "", "")
 MorphData("n2", "a", "a_3", "interj.", "", "", "")
 MorphData("n3", "Aaron", "A^a^ro_n", "", "indecl.", "m.", "")
 MorphData("n4", "ab", "a^b", "prep.", "", "", "")
 MorphData("n5", "Aba", "Aba", "", "ae", "m.", "")
 MorphData("n7", "abactor", "a^bactor", "", "o_ris", "m.", "")
 MorphData("n8", "abactus", "a^bactus1", "", "a, um", "", "part.")
 MorphData("n12", "Abaddir", "A_baddir", "", "indecl.", "m.", "")
 ⋮
 MorphData("n51572", "zonarius", "zo_na_ri^us", "adj.", "a, um", "m.", "")
 MorphData("n51574", "Zone", "Zo_ne_", "", "e_s", "f.", "")
 MorphData("n51575", "zonula", "zo_nu^la", "", "ae", "f.", "")
 MorphData("n51580", "Zopyrus", "Zo_py^rus", "adv.", "i", "m.", "")
 MorphData("n51581", "zoranisceos", "zoranisceos", "", "i", "m.", "")
 MorphData("n51582", "Zoroastres", "Zo_ro^astres", "adj.", "is", "m.", "")
 MorphData("n51584", "Zoster", "Zoster2", "", "e_ris", "m.", "")
 MorphData("n51586", "zothecula", "zo_the_cu^la", "", "ae", "f.", "")
 MorphData("n51589", "Zygia", "Zygia2", "", "ae", "f.", "")