Specification for listutils, version 1.1.0 >

Specification for listutils, version 1.1.0

Overview

The ListDiff classes compares two ordered lists of objects of any type, and computes five ordered lists:

  1. the longest common subsequence (LCS) of the two lists (an ordered intersection of l1 and l2)
  2. the shortest common supersequence (SCS) of the two lists (an ordered union of l1 and l2)
  3. the ordered list of items contained only in l1
  4. the ordered list of items contained only in l2
  5. differences between l1 and l2

Comparing lists

Construct a ListDiff object with two lists, l1 and l2.

Computation of LCS, unique elements in each list, and differences between the two lists produces identical results for either ordering of the two lists. That is, the results are identical whether you compare l1 with l2, or l2 with l1.

Computation of SCS, on the other hand, can result in ambiguous "ties" in the ordering when elements unique to each list appear at the same point in the list sequence. In these cases, listutils gives priority to l1. (See example 2 below.)

Example 1: comparing lists of integers

Consider the following two lists. List 1 has seven elements:

1,2,3,4,5,6,7

and list 2 has six:

0,1,2,3,6,7

The longest common subsequence of the two integer lists is:

Item
1
2
3
6
7

The shortest common supersequence of the two integer lists is:

Item
0
1
2
3
4
5
6
7

The elements unique to list 1 are

Item
4
5

and the elements unique to list 2 are

Item
0

Example 2: comparing lists of characters

A String can be considered a list of characters. The following real example comes from work on a system for parsing ancient Greek morphology. The first string is part of a morphological analysis that pairs a stem lu_ with an ending ai:

lu_-ai

(This is an ASCII representation of the Greek characters λυ and αι, with an explicit marker that u character is phonetically long.)

The second string is an actual surface form (λυσαι). In addition to the stem and ending, it shows a modification to the stem (the s character):

lusai

If we split these into lists of characters, finding the LCS and unique elements is straightforward. The longest commons subsequence is

Item
l
u
a
i

the characters unique to string 1 are

Item
_
-

and the only character unique to string 2 is

Item
s

SCS: order of comparison matters

In this real-world example, what would be particularly valuable is to find the the shortest common supersequence. We would like, in effect, to merge the two strings, by projecting information from the analysis about vowel quantity onto the surface form.

But note that the two strings have unique characters at exactly the same point in the sequence (the "_" character in string 1 and the "s" in string 2). Consequently, if we ask for a comparison of string 1 with string 2, we obtain the following results

Item
l
u
_
-
s
a
i

which is exactly what we want.

If, on the other hand, we reverse the order, and compare string 2 with string 1, we get the following SCS:

Item
l
u
s
_
-
a
i

The "_" is no longer adjacent to the "u" to mark it as long, and the results are nonsensical in this context (since consonants like "s" do not have quantity values of long or short).

All this really means is that it is important to understand the data you are comparing when you ask for a shortest common supersequence.

Comparing strings

While it is always possible to submit lists of characters to ListDiff by splitting strings into lists of characters, ListDiff also directly supports a second form of constructor that compares two Unicode strings. The strings are automatically split into ordered lists by Unicode code point, and then compared as normal. The resultings lists (LCS, SCS, unique elements to each list, and differences between lists) are lists of strings where each string is composed of a single Unicode code point.

It is the responsibility of the calling program to perform any desired Unicode normalization before creating the ListDiff object.