Identification with URNs

Key points

all identification is by URN value
the CitableParserBuilder package defines four subtypes of the abstract AbbreviatedUrn for the four URNs comprising an analysis in Tabulae:
- the FormUrn
- the LexemeUrn
- the StemUrn
- the RuleUrn
the FormUrn is generated by the parsing system
you use the other three types URNs to identify content in a Tabulae data set
you record each collection in a URN registry that supports round-trip conversion of AbbreviatedUrns and Cite2Urns.

The URN registry

organized in three subdirectories of the dataset's urnregistry directory:
- lexemes
- rules
- stems
identical three-column delimited text files: collection ID, collection URN, label
as many file names as you like ending in .cex; empty lines OK

Example:

CollectionId|CollectionUrn|Label
ls|urn:cite2:shot:ls.v1:|Latin lexical entities appearing in Lewis-Short's Latin Dictionary.

You can get a dictionary of collection IDs to full URNs for your data aset with the registry function.

using Tabulae
srcdata = joinpath(repo, "datasets", "core-infl-shared")
tabds = Tabulae.Dataset([srcdata])

abbrdict = registry(tabds)

Dict{String, String} with 3 entries:
  "nounstems" => "urn:cite2:tabulae:nounstems.v1:"
  "ls"        => "urn:cite2:shot:ls.v1:"
  "nouninfl"  => "urn:cite2:tabulae:nouninfl.v1:"

Working with URNs

abbreviate returns an abbreviated string value for a Cite2Urn:

using CitableParserBuilder
using CitableObject
longurn = Cite2Urn("urn:cite2:shot:ls.v1:n14736")
shortform = abbreviate(longurn)

"ls.n14736"

You can use that string to create the appropriate type of abbreviated URN:

lex = LexemeUrn(shortform)

ls.n14736

To convert an AbbreviatedUrn to a Cite2Urn, you need to supply the dataset's URN registry.

expanded = expand(lex, abbrdict)

urn:cite2:shot:ls.v1:n14736