Analyzing a text corpus
Summary
We start with a corpus citable by CTS URN. In these examples, we'll work with a citable corpus of the five extant versions of the Gettysburg Address. We will then construct an AnalyticalCorpus that matches this citable corpus with an orthography and a parser. With this in hand, we can create a full, morphologically aware analysis of each token in the corpus with a single function call.
Load the source corpus
We can load the source data into the CitableTextCorpus model from a URL. The corpus_cex function works on string data, so we will use standard Julia methods to load a String from the URL.
using CitableBase, CitableCorpus
corpusurl = "https://raw.githubusercontent.com/neelsmith/CitableCorpusAnalysis.jl/dev/test/data/gettysburg/gettysburgcorpus.cex"
corpus = fromcex(corpusurl, CitableTextCorpus, UrlReader)
typeof(corpus)CitableCorpus.CitableTextCorpusConstructing an AnalyticalCorpus
Orthography
The Orthography module includes a simple ASCII orthography that we can use with our Gettsyburg corpus.
using Orthography
orthography = simpleAscii()
typeof(orthography) |> supertypeOrthography.OrthographicSystemMorphology
The CitableCorpusAnalysis module includes an implementation of the CitableParser abstraction that can parse tokens in the Gettysburg Address to their corresponding Penn treebank POS code. (For details on how the parser was constructed, see the appendix to this documentation.)
using CitableParserBuilder
parser = CitableParserBuilder.gettysburgParser()
typeof(parser) |> supertypeCitableParserBuilder.CitableParserThe analytical corpus
Our analytical corpus associates these three components.
using CitableCorpusAnalysis
acorpus = AnalyticalCorpus(corpus, orthography, parser)
typeof(acorpus)AnalyticalCorpusThe analyses
The analyzecorpus function requires an AnalyticalCorpus as an argument. It first creates a tokenized edition, then analyses each token. It returns a Vector of AnalyzedTokens, where each CitablePassage is associated with a (possibly empty) Vector of Analysis objects.
Additional arguments
The analyzecorpus function allows an optional data parameter that will passed along to the parsing functions it applies. In this example, the GettysburgParser can use a dictionary of analyses to get better performance, since it otherwise loads the entire dictionary for each individual parse.
analyses = analyzecorpus(acorpus; data = parser.data)