Working with vectors of AnalyzedTokens

Same set up as before: read corpus, tokenize, parse.

parsed =  parsecorpus(tc, parser; data = parser.data)
length(parsed)
1506

Lexemes

Get a list of unique lexeme identifiers for all parsed tokens.

lexemelist = lexemes(parsed)
length(lexemelist)
154

For a given lexeme, find all surface forms appearing in the corpus. The lexeme "gburglex.and" appears in only one form, and.

stringsforlexeme(parsed, "gburglex.and" )
1-element Vector{AbstractString}:
 "and"

Get a dictionary keyed by lexeme that can be used to find all forms, and all passages for a given lexeme. It will have the same length as the list of lexemes, which are its keys.

ortho = simpleAscii() # hide
tokenindex = corpusindex(corpus, ortho)
lexdict = lexemedictionary(parsed, tokenindex)
length(lexdict)

Each entry in the dictionary is a further dictionary mapping surface forms to passages.

lexdict["gburglex.and"]