Users' guide: using a CitableParser

Any implementation of a CitableParser works in basically the same way. The parsing functions all have a common pair of signatures:

  • function(textcontent, parser)
  • function(content, parser, parserdata)

The sample parser we will use requires the third, data parameter: check the documentation for your specific parser to see how it works.

Here we instantiate the sample parser, and verify that it is indeed a subtype of CitableParser.

using CitableParserBuilder
parser = CitableParserBuilder.gettysburgParser()
typeof(parser) |> supertype
CitableParser

Parsing string values

When we parse a string token, the result is a Vector of Analysis objects. Our parser produces only one analysis for score.

scoreparses = parsetoken("score", parser; data = parser.data)
length(scoreparses)
1
typeof(scoreparses[1])
Analysis

The analysis object associates with the token a URN value, in abbreviated format, for each of the four properties of an analysis.

scoreparses[1].token
"score"
scoreparses[1].form
gburgform.NN

NN is the Penn Tree Bank code for Noun, singular or mass.

We can also parse a list of words. Here, parsing four words produces a Vector containing four Vectors of Analysis objects.

wordsparsed = parselist(split("Four score and seven"), parser; data =  parser.data)
length(wordsparsed)
4
Tip

You can use an OrthographicSystem to create generate a list of unique lexical tokens for an entire citable corpus. See the documentation for the Orthography module.

Parsing citable text content

You can also parse citable text structures: passages, documents and corpora. Here we illustrate parsing a citable passage.

using CitableText, CitableCorpus
urn = CtsUrn("urn:cts:demo:gburg.hays.v2:1.2")
psg = CitablePassage(urn, "score")
psg_analysis = parsepassage(psg, parser; data = parser.data)
typeof(psg_analysis)
AnalyzedToken

The result is a new kind of object, the AnalyzedToken, which associates a Vector of Analysis objects with a citable passage.

psg_analysis.passage
psg_analysis.analyses == scoreparses
true

Exporting to CEX format

When we export analyses to CEX format, we want to use full CITE2 URNs, rather than the abbreviated URNs of the Analysis structure. You need a dictionary mapping collection names to full CITE2 URN values for the collection.

registry = Dict(
        "gburglex" => "urn:cite2:citedemo:gburglex.v1:",
        "gburgform" => "urn:cite2:citedemo:gburgform.v1:",
        "gburgrule" => "urn:cite2:citedemo:gburgrule.v1:",
        "gburgstem" => "urn:cite2:citedemo:gburgstem.v1:"
    )
length(registry)
4

Use the cex function (from CitableBase) to format your analyzes as delimited text. To expand abbreviated URNs to full CTS and CITE2 URNs while formatting as deliimted text, use the delimited function. You can use normal Julia IO to write the results to a file, for example.

cex_output = delimited(psg_analysis, registry = registry)
open("outfile.cex", "w") do io
    write(io, cex_output)
end
211