Using different models of a text corpus
The base model of a text corpus is the CitableTextCorpus
: a citable, human-readable series of text passages.
using CitableBase, CitableCorpus
corpusurl = ""
corpus = fromcex(corpusurl, CitableTextCorpus, UrlReader)
Corpus with 20 citable passages in 5 documents.
You can convert this to the Corpus
model of Julia's TextAnalysis
using CitableCorpusAnalysis
ta_corp = tacorpus(corpus)
A Corpus with 20 documents:
* 20 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
The TextAnalysis
module has a variety of functions for basic metrics on a text, and for corpora in English, includes more advanced tools for tokenizing, tagging for part of speech, and working with neural models. See the documentation for TextAnalysis.Corpus
If the assumptions of TextAnalysis
(oriented towards English) are not appropriate for your corpus, you can sometimes work around this by preprocessing your original CitableCorpus
. E.g, by creating a tokenized corpus that takes account of a specified orthography, and using this as the source for a TextAnalysis.Corpus
, you can protect your corpus from naive assumptions about tokenization.