Using different models of a text corpus

CitableTextCorpus

The base model of a text corpus is the CitableTextCorpus: a citable, human-readable series of text passages.

using CitableBase, CitableCorpus
corpusurl = "https://raw.githubusercontent.com/neelsmith/CitableCorpusAnalysis.jl/dev/test/data/gettysburg/gettysburgcorpus.cex"
corpus = fromcex(corpusurl, CitableTextCorpus, UrlReader)
Corpus with 20 citable passages in 5 documents.

TextAnalysis.Corpus

You can convert this to the Corpus model of Julia's TextAnalysis module.

using CitableCorpusAnalysis
ta_corp = tacorpus(corpus)
A Corpus with 20 documents:
 * 20 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

The TextAnalysis module has a variety of functions for basic metrics on a text, and for corpora in English, includes more advanced tools for tokenizing, tagging for part of speech, and working with neural models. See the documentation for TextAnalysis.Corpus.

Tokenization

If the assumptions of TextAnalysis (oriented towards English) are not appropriate for your corpus, you can sometimes work around this by preprocessing your original CitableCorpus. E.g, by creating a tokenized corpus that takes account of a specified orthography, and using this as the source for a TextAnalysis.Corpus, you can protect your corpus from naive assumptions about tokenization.