Topic modeling

The lda_tm function creates a topic model for a citable corpus. It treats each citable passage as a document, and uses the lda function of the TextAnalysis package to create the model. The only required parameters are a citable corpus, and the number of topics to create. You can use the following optional parameters to tweak the settings of the underlying lda function:

  • alpha is the TextAnalysis package's α. From its documentation: "The hyperparameter for topic distribution per document. α<1 yields a sparse topic mixture for each document. α>1 yields a more uniform topic mixture for each document."
  • beta is the TextAnalysis package's β. "The hyperparameter for word distribution per topic. β<1 yields a sparse word mixture for each topic. β>1 yields a more uniform word mixture for each topic."
  • iters Number of iterations.
  • stopwords List of terms to omit from the model.

In addition, you can optionally supply a vector of string values to identify each document in the doclabels parameter. By default, this parameter is an empty Vector; in that case, lda_tm uses the string value of each passage's URN to identify the topic modeling document. Note that if you choose to supply a doclabels parameter, values must be unique, and the length of doclabels must equal the number of citable passages in the source citable corpus.

The following example uses the famous corpus of State of the Union addresses from 1914 through 2009, included in a citable format in the test/data directory of this repository.

using CitableBase, CitableCorpus
corpusurl = "https://raw.githubusercontent.com/neelsmith/CitableCorpusAnalysis.jl/dev/test/data/sotu.cex"
corpus = fromcex(corpusurl, CitableTextCorpus, UrlReader)
Corpus with 9743 citable passages in 93 documents.

In the same test/data directory, there is a brief stop-word list for this corpus.

using Downloads
stopurl = "https://raw.githubusercontent.com/neelsmith/CitableCorpusAnalysis.jl/dev/test/data/stopwords-sotu.txt"
stopfile = Downloads.download(stopurl)
stopwords = readlines(stopfile)
rm(stopfile)
length(stopwords)
62

We can now create a model for the corpus. We'll iterate 50 times and create 20 topics.

using CitableCorpusAnalysis
tm = lda_tm(corpus, 20; stopwords = stopwords, iters = 50)
LDA topic model for 18633 terms in 9743 documents.

The resulting TopicModel object has four fields.

  • terms List of terms in the model.
  • docids Identifiers for documents in the model. This will be either the values provided in the doclabels parameter, or string values of each passage's URNs.
  • topic_terms The topic-term matrix (ϕ of the TextAnalysis packages lda output)
  • topic_docs The topic-document matrix (θ of the TextAnalysis packages lda output)

The following functions provide convenient access to several simple operations on these values.

Number of topics computed (traditionally, k, equivalent to the second parameter provided to the lda_tm function):

k(tm)
20

A labelling string for a given topic number composed from the most common terms in the topic:

topiclabel(tm, 1)
"new_Administration_health_program"

A labelling string for all topics in the model:

topiclabels(tm)
20-element Vector{Any}:
 "new_Administration_health_program"
 "I_Federal_local_States"
 "000_year_million_years"
 "world_We_nations_peace"
 "men_forces_war_military"
 "production_Government_business_prices"
 "new_s_program_jobs"
 "We_people_world_government"
 "tax_budget_I_taxes"
 "war_We_any_if"
 "I_Congress_program_shall"
 "Government_I_Congress_upon"
 "I_s_people_know"
 "America_We_years_new"
 "children_s_We_care"
 "world_people_We_peace"
 "We_new_energy_defense"
 ",_._'_THE"
 "United_war_States_Nations"
 "I_President_Congress_Mr"

Find string label and score for the top n scoring terms in a given topic:

topterms(tm, 1; n = 5)
5-element Vector{Any}:
 ("new", 0.016611649451602598)
 ("Administration", 0.010116068576296454)
 ("health", 0.009051219252475774)
 ("program", 0.009051219252475774)
 ("medical", 0.008944734320093706)

Index of a given term in the terms field:

termindex(tm, "et")

Index of a given document in the docids field:

 documentindex(tm, "urn:cts:latinLit:stoa1263.stoa001.hc:t.1")

Scores of each topic for a given document:

topicsfordoc(tm, 1)
20-element Vector{Float64}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 1.0
 0.0
 0.0

Label and score of the highest-scored topic for a given document:

topicfordoc(tm, 1)
(",_._'_THE", 1.0)

Function to label and score of all n highest-scoring topics for a given document is currently broken!

#topdocs(tm, 1; n = 5)