Using different models of a text corpus
CitableTextCorpus
The base model of a text corpus is the CitableTextCorpus
: a citable, human-readable series of text passages.
using CitableBase, CitableCorpus
corpusurl = "https://raw.githubusercontent.com/neelsmith/CitableCorpusAnalysis.jl/dev/test/data/gettysburg/gettysburgcorpus.cex"
corpus = fromcex(corpusurl, CitableTextCorpus, UrlReader)
Corpus with 20 citable passages in 5 documents.
TextAnalysis.Corpus
You can convert this to the Corpus
model of Julia's TextAnalysis
module.
using CitableCorpusAnalysis
ta_corp = tacorpus(corpus)
A Corpus with 20 documents:
* 20 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
The TextAnalysis
module has a variety of functions for basic metrics on a text, and for corpora in English, includes more advanced tools for tokenizing, tagging for part of speech, and working with neural models. See the documentation for TextAnalysis.Corpus
.
If the assumptions of TextAnalysis
(oriented towards English) are not appropriate for your corpus, you can sometimes work around this by preprocessing your original CitableCorpus
. E.g, by creating a tokenized corpus that takes account of a specified orthography, and using this as the source for a TextAnalysis.Corpus
, you can protect your corpus from naive assumptions about tokenization.