Identification with URNs

Key points

  • all identification is by URN value
  • the CitableParserBuilder package defines four subtypes of the abstract AbbreviatedUrn for the four URNs comprising an analysis in Tabulae:
    • the FormUrn
    • the LexemeUrn
    • the StemUrn
    • the RuleUrn
  • the FormUrn is generated by the parsing system
  • you use the other three types URNs to identify content in a Tabulae data set
  • you record each collection in a URN registry that supports round-trip conversion of AbbreviatedUrns and Cite2Urns.

The URN registry

  • organized in three subdirectories of the dataset's urnregistry directory:
    • lexemes
    • rules
    • stems
  • identical three-column delimited text files: collection ID, collection URN, label
  • as many file names as you like ending in .cex; empty lines OK

Example:

CollectionId|CollectionUrn|Label
ls|urn:cite2:shot:ls.v1:|Latin lexical entities appearing in Lewis-Short's Latin Dictionary.

You can get a dictionary of collection IDs to full URNs for your data aset with the registry function.

using Tabulae
srcdata = joinpath(repo, "datasets", "core-infl-shared")
tabds = Tabulae.Dataset([srcdata])

abbrdict = registry(tabds)
Dict{String, String} with 3 entries:
  "nounstems" => "urn:cite2:tabulae:nounstems.v1:"
  "ls"        => "urn:cite2:shot:ls.v1:"
  "nouninfl"  => "urn:cite2:tabulae:nouninfl.v1:"

Working with URNs

abbreviate returns an abbreviated string value for a Cite2Urn:

using CitableParserBuilder
using CitableObject
longurn = Cite2Urn("urn:cite2:shot:ls.v1:n14736")
shortform = abbreviate(longurn)
"ls.n14736"

You can use that string to create the appropriate type of abbreviated URN:

lex = LexemeUrn(shortform)
ls.n14736

To convert an AbbreviatedUrn to a Cite2Urn, you need to supply the dataset's URN registry.

expanded = expand(lex, abbrdict)
urn:cite2:shot:ls.v1:n14736