Neel Smith on github Openly available work in digital classics

The data/acceptor/analyzer pattern for parsing morphology

Finite state transducers (FSTs) are the core of a system for parsing Latin morphology. This post serves to illustrate my fuller discussion of “Morphological analysis of historical languages” (forthcoming in BICS, December, 2016) with a small but complete example of an important pattern in designing a morphological system, that I call the “data/acceptor/analyzer” pattern. The example parses regular second-declension Latin nouns in -us or -um using the Stuttgart FST toolkit, SFST. The documentation for the SFST toolkit includes a user manual, a tutorial, and example parsers for German and English, where the syntax and semantics of the SFST notation, SFST-PL, are fully documented.

The “data/acceptor/analyzer” pattern is implemented in a chain of three transducers. The first transducer assembles a data set by combining a lexicon of stems with a set of inflectional rules. This is the input to a second transducer, the “acceptor,” which filters the data for appropriate combinations of stem and rule. It ensures that only rules of the same inflectional class as the stem are kept. The third transducer takes these valid records, and strips out analytical information so that only a valid Latin form remains. The resulting bi-directional transducer can then produce the full analysis with morphological data when given a form, or can generate the form when given the morphological data.

The parser is organized in three files, written in SFST-PL. First, lexicon.fst contains a single entry for the neuter noun donum, doni; then inflection.fst specifies rules for masculine, feminine or neuter nouns of the second declension with nominative singular forms in -us or -um; demo.fst contains the parsing system.

To run the parser with sfst tools, first compile inflection.fst (fst-compile inflection.fst), then compile demo.fst. You can use the resulting binary demo.a with any of the SFST utilities, e.g., fst-mor demo.a to submit forms to analyze morphologically.

1. Complete text of the lexicon (lexicon.fst)

don<neut><decl2>

2. The rule set (inflection.fst)

% For nouns in -us or -um

$decl2_endings$ = <decl2>( \
  
us<nom><sg>[<masc><fem>] |\
  
um<nom><sg><neut> |\
  i<gen><sg>[<masc><fem><neut>] |\
  
o<dat><sg>[<masc><fem><neut>] |\
  
um<acc><sg>[<masc><fem><neut>]|\
  
e<voc><sg>[<masc><fem>]|\
  
um<voc><sg><neut> |\
  
o<abl><sg>[<masc><fem><neut>] |\
  
i<nom><pl>[<masc><fem>] |\
  
a<nom><pl><neut> |\
orum<gen><pl>[<masc><fem><neut>] |\
  
is<dat><pl>[<masc><fem><neut>] |\
  
os<acc><pl>[<masc><fem>] |\

  a<acc><pl><neut> |\
  
is<abl><pl>[<masc><fem><neut>]|\
  
i<voc><pl>[<masc><fem>] |\

  a<voc><pl><neut> )



$inflection$ = $decl2_endings$

$inflection$

3. The parser (demo.fst)

% Analytical symbols:
#gender# = <masc><fem><neut>
#case# = <nom><gen><dat><acc><abl>
#num# = <sg><pl>
#nounclass# = <decl1><decl2><decl3><decl4><decl5>


$separator$ = [\:]

% First transducer (data phase).
% Assemble morphological data set by reading

% lexicon and inflection rules from external files

% and generating their cross product.

$stems$ =  "lexicon.fst"
$inflection$ = "<inflection.a>"
$morphdata$ = $stems$ $separator$ $inflection$



% Acceptor pattern for Latin nouns.

% Noun class and gender must match in stem and inflectional pattern .

$=nounclass$ = [#nounclass#]
$=gender$ = [#gender#]
$nounacceptor$ = [a-z]+$=gender$$=nounclass$ $separator$ $=nounclass$[a-z]*[#case#][#num#]$=gender$


% Parser pattern. Map analytical symbols to null value.
#analysissymbol# = #gender# #case# #num# #nounclass#
ALPHABET = [a-z] [#analysissymbol#]:<> [\:]:<>

$analyzer$ = .*

% Final cascade.
$morphdata$ || $nounacceptor$   || $analyzer$