Comparing ordered lists
26 Mar 2016Scholarly discussions of texts often depend on readings or analyses that can be organized in lists (passages of the Iliad where Achilles speaks, chapters of Moby Dick that are heavy with cetological vocabulary,…) Since texts have an inherent document order, the lists of anlayses, too, can be be ordered, and the generic operation of comparing ordered lists can be applied to a wide range of problems, concerned with content (how does the chronologically ordered list of Sicilian colonies in Thucydides, book 6, compare to the list in Jerome’s Chronicles?), structure (what lines of the Iliad appear in this manuscript of vs. another manuscript?), or other aspects of the text.
Strings, too, can be thought of as ordered lists of characters, as was brought home to me recently in my current work on a system for analyzing Greek morphology. Consider the advantages of a generic library for comparing lists when working with the following two strings (with Greek represented in an ASCII-only transliteration).
One string is a surface form appearing in a Greek text, lusai
(λυσαι). A morphological analysis proposes the hypothesis that this is an aorist form pairing a stem lu_
with an ending ai
(where the underscore explicitly indicates that the u
character is phonetically long).
In this real-world example, it would be particularly nice to “merge” the surface form lusai
and the analytical string lu_ai
, in order to “project” the vowel quantity from the underlying lexicon onto the surface form. This short example may look trivial, but in its general form, this is in fact the problem of finding the Shortest Common Supersequence (SCS) of two ordered lists.
Fortunately, this generic problem is of great importance to many people outside of literary studies, perhaps most notably biologists studying gene sequences. As a result, SCS and the closely related problem of finding the Longest Common Subsequence (LCS) have become the sort of material you might be introduced to in a first-semester algorithms course.
Yesterday, I released a 1.0 version of a library for comparing ordered lists. I had already implemented the LCS; it was literally a matter of minutes to add the computation of SCS as well. As a result, a single line can compute the SCS of lu_ai
and lusai
to yield lu_sai
– exactly what I wanted.
Computing the LCS, the unique characters in each string, and the differences between the two strings produces identical results for either ordering of the two. That is, the results are identical whether you compare string 1 with string 2, or string 2 with string 1. Computing SCS, on the other hand, can result in ambiguous “ties” in the ordering when elements unique to each list appear at the same point in the list sequence.
Here, for example, both strings have unique characters at exactly the same point in the sequence (the _
character in string 1 and the s
in string 2). I give priority to the first string in case of ties, so if we compare lu_ai
with lusai
, we get the desired lu_sai
. If, on the other hand, we compare lusai
with lu_ai
, we get lus_ai
which is nonsensical in this context (since consonants cannot have a vowel quanitity marker). As always, you need to understand both your tools and your data. (Never let a plumber with a sawzall work in your house without supervision.)
I’m finding the SCS computation useful enough, that I’ve made a gist with a groovy script to find the SCS of two strings: https://gist.github.com/neelsmith/7fc5ae04ae58bcf3ac8a. Give the script two string arguments, and it prints the SCS to standard output, e.g.,
groovy scs.groovy 'Dr Rock-and-Roll Star' 'Doctor Rock Star'
will print out
Doctor Rock-and-Roll Star
I’m immediately imagining potential applications like projecting lexical information about vowel quantity onto Latin texts as an aid to automated identification of metrical patterns, but perhaps the most useful feature of the listutils
library is that it can work with lists of any type of object, so I’m certain that I have not yet recognized many other familiar problems that can be resolved to comparisons of various kinds of ordered lists.