Neel Smith on github Openly available work in digital classics

Comparing Unicode strings with listutils 1.1.0

Yesterday, I pushed two gists that use the listutils package to compare strings. Since a Unicode string can by definition be considered a sequence of Unicode code points, I’ve added direct support for comparing Unicode strings with version 1.1.0 of listutils. Instead of supplying two lists of objects, give the ListDiff class two Unicode strings. It creates a list of Unicode code points from each String. The resulting comparisons (including SCS and LCS) are lists of Strings, where each String is one code point long.

I’ve updated the gists to use this new version of listutils, and the result is pleasantly concise.

For lcs.groovy, the complete computation is

ListDiff ldiff = new ListDiff(args[0],args[1])
println ldiff.lcs.join("")

For scs.groovy, the parallel lines are

ListDiff ldiff = new ListDiff(args[0],args[1])
println ldiff.lcs.join("")

If you favor still more conciseness over legibility, of course you could combine either of those into a single line in groovy, e.g.,

println new ListDiff("listutils package v 1.1" ,   "listutils version 1.1").scs.join("")

will print

listutils package version 1.1

listutils works on whatever you give it. If you want any form of Unicode normalization, apply it before passing the string to ListDiff.