Analyzing a text corpus


We start with a corpus citable by CTS URN. In these examples, we'll work with a citable corpus of the five extant versions of the Gettysburg Address. We will then construct an AnalyticalCorpus that matches this citable corpus with an orthography and a parser. With this in hand, we can create a full, morphologically aware analysis of each token in the corpus with a single function call.

Load the source corpus

We can load the source data into the CitableTextCorpus model from a URL. The corpus_cex function works on string data, so we will use standard Julia methods to load a String from the URL.

using CitableBase, CitableCorpus
corpusurl = ""
corpus = fromcex(corpusurl, CitableTextCorpus, UrlReader)

Constructing an AnalyticalCorpus


The Orthography module includes a simple ASCII orthography that we can use with our Gettsyburg corpus.

using Orthography
orthography = simpleAscii()
typeof(orthography) |> supertype


The CitableCorpusAnalysis module includes an implementation of the CitableParser abstraction that can parse tokens in the Gettysburg Address to their corresponding Penn treebank POS code. (For details on how the parser was constructed, see the appendix to this documentation.)

using CitableParserBuilder
parser = CitableParserBuilder.gettysburgParser()
typeof(parser) |> supertype

The analytical corpus

Our analytical corpus associates these three components.

using CitableCorpusAnalysis
acorpus = AnalyticalCorpus(corpus, orthography, parser)

The analyses

The analyzecorpus function requires an AnalyticalCorpus as an argument. It first creates a tokenized edition, then analyses each token. It returns a Vector of AnalyzedTokens, where each CitablePassage is associated with a (possibly empty) Vector of Analysis objects.

Additional arguments

The analyzecorpus function allows an optional data parameter that will passed along to the parsing functions it applies. In this example, the GettysburgParser can use a dictionary of analyses to get better performance, since it otherwise loads the entire dictionary for each individual parse.

analyses = analyzecorpus(acorpus; data =