Topic modeling
The lda_tm
function creates a topic model for a citable corpus. It treats each citable passage as a document, and uses the lda
function of the TextAnalysis
package to create the model. The only required parameters are a citable corpus, and the number of topics to create. You can use the following optional parameters to tweak the settings of the underlying lda
function:
alpha
is theTextAnalysis
package's α. From its documentation: "The hyperparameter for topic distribution per document. α<1 yields a sparse topic mixture for each document. α>1 yields a more uniform topic mixture for each document."beta
is theTextAnalysis
package's β. "The hyperparameter for word distribution per topic. β<1 yields a sparse word mixture for each topic. β>1 yields a more uniform word mixture for each topic."iters
Number of iterations.stopwords
List of terms to omit from the model.
In addition, you can optionally supply a vector of string values to identify each document in the doclabels
parameter. By default, this parameter is an empty Vector; in that case, lda_tm
uses the string value of each passage's URN to identify the topic modeling document. Note that if you choose to supply a doclabels
parameter, values must be unique, and the length of doclabels
must equal the number of citable passages in the source citable corpus.
The following example uses the famous corpus of State of the Union addresses from 1914 through 2009, included in a citable format in the test/data
directory of this repository.
using CitableBase, CitableCorpus
corpusurl = "https://raw.githubusercontent.com/neelsmith/CitableCorpusAnalysis.jl/dev/test/data/sotu.cex"
corpus = fromcex(corpusurl, CitableTextCorpus, UrlReader)
Corpus with 9743 citable passages in 93 documents.
In the same test/data
directory, there is a brief stop-word list for this corpus.
using Downloads
stopurl = "https://raw.githubusercontent.com/neelsmith/CitableCorpusAnalysis.jl/dev/test/data/stopwords-sotu.txt"
stopfile = Downloads.download(stopurl)
stopwords = readlines(stopfile)
rm(stopfile)
length(stopwords)
62
We can now create a model for the corpus. We'll iterate 50 times and create 20 topics.
using CitableCorpusAnalysis
tm = lda_tm(corpus, 20; stopwords = stopwords, iters = 50)
LDA topic model for 18633 terms in 9743 documents.
The resulting TopicModel
object has four fields.
terms
List of terms in the model.docids
Identifiers for documents in the model. This will be either the values provided in thedoclabels
parameter, or string values of each passage's URNs.topic_terms
The topic-term matrix (ϕ of theTextAnalysis
packageslda
output)topic_docs
The topic-document matrix (θ of theTextAnalysis
packageslda
output)
The following functions provide convenient access to several simple operations on these values.
Number of topics computed (traditionally, k
, equivalent to the second parameter provided to the lda_tm
function):
k(tm)
20
A labelling string for a given topic number composed from the most common terms in the topic:
topiclabel(tm, 1)
"new_Administration_health_program"
A labelling string for all topics in the model:
topiclabels(tm)
20-element Vector{Any}:
"new_Administration_health_program"
"I_Federal_local_States"
"000_year_million_years"
"world_We_nations_peace"
"men_forces_war_military"
"production_Government_business_prices"
"new_s_program_jobs"
"We_people_world_government"
"tax_budget_I_taxes"
"war_We_any_if"
"I_Congress_program_shall"
"Government_I_Congress_upon"
"I_s_people_know"
"America_We_years_new"
"children_s_We_care"
"world_people_We_peace"
"We_new_energy_defense"
",_._'_THE"
"United_war_States_Nations"
"I_President_Congress_Mr"
Find string label and score for the top n
scoring terms in a given topic:
topterms(tm, 1; n = 5)
5-element Vector{Any}:
("new", 0.016611649451602598)
("Administration", 0.010116068576296454)
("health", 0.009051219252475774)
("program", 0.009051219252475774)
("medical", 0.008944734320093706)
Index of a given term in the terms
field:
termindex(tm, "et")
Index of a given document in the docids
field:
documentindex(tm, "urn:cts:latinLit:stoa1263.stoa001.hc:t.1")
Scores of each topic for a given document:
topicsfordoc(tm, 1)
20-element Vector{Float64}:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
Label and score of the highest-scored topic for a given document:
topicfordoc(tm, 1)
(",_._'_THE", 1.0)
Function to label and score of all n
highest-scoring topics for a given document is currently broken!
#topdocs(tm, 1; n = 5)