Parameters and return values

lcs, scs, align and featurematrix all accept a variable number of vector parameters. lcs and scs always return a single vector.

using VectorAlignments
s1 = split("Now is the time")
s2 = split("Now is not the time")
s3 = split("Now might be the time")
lcs(s1, s2, s3)
3-element Vector{Any}:
 "Now"
 "the"
 "time"
scs(s1, s2, s3)
7-element Vector{Any}:
 "Now"
 "is"
 "not"
 "might"
 "be"
 "the"
 "time"

align returns a vector of vectors, with one output vector for each input vector. The length of each of the output vectors equals the length of the complete SCS for the input vectors.

align(s1, s2, s3)
3-element Vector{Any}:
 Any["Now", "is", nothing, nothing, nothing, "the", "time"]
 Any["Now", "is", "not", nothing, nothing, "the", "time"]
 Any["Now", nothing, nothing, "might", "be", "the", "time"]

featurematrix returns a two-dimensional matrix, with one row for each feature and one column for each input vector.

featurematrix(s1, s2, s3)
7×3 Matrix{Any}:
 "Now"    "Now"    "Now"
 "is"     "is"     nothing
 nothing  "not"    nothing
 nothing  nothing  "might"
 nothing  nothing  "be"
 "the"    "the"    "the"
 "time"   "time"   "time"

Order of alignment in gaps

In aligning sequences with different content in the same position in the sequence, the order of parameters is used to order elements in the resulting alignment. For example, in the following comparison, all three sequences match on d; the alignment add elements preceding d by taking the vectors in the order they are given.

scs("ad", "bd", "cd")
4-element Vector{Any}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)

This means that you can control the alignment of gaps by your ordering of parameters.

scs("cd", "bd", "ad")
4-element Vector{Any}:
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)

Optional parameters

lcs, scs, align and featurematrix use the Needleman–Wunsch algorithm to align sequences. Its dynamic-programming approach constructs a matrix of scores comparing two lists element by element. lcs, scs, align and featurematrix allow optional named parameters for two functions that are applied in this process:

  • norm: a function to normalize values before comparing them. The default is to use the unaltered value of the element (x -> x).
  • cf: a function for comparing the elements of the two vectors. The default is to compare for equality (==).

Examples: normalization

The following alignment normalizes characters to lowercase before comparing them, so that 'b' and 'B' are aligned. Note that the raw values (before normalization) are retained in the aligned vectors.

featurematrix("ab", "Bc", "cd"; norm = lowercase)
4×3 Matrix{Any}:
 'a'      nothing  nothing
 'b'      'B'      nothing
 nothing  'c'      'c'
 nothing  nothing  'd'

lcs and scs follow the order of parameters in selecting a value for their single output vector. Compare the following two examples.

scs("ab", "Bc", "cd"; norm = lowercase)   |> join
"abcd"
scs("aB", "bc", "cd"; norm = lowercase)   |> join
"aBcd"

Examples: comparison

In the following example, we compare elements using Julia's isapprox function with a value of 0.1 for relative tolerance:

a = [0.95, 1.1, 0.98]
b = [0.93, 0.99, 0.96]

f = (x,y) -> ≈(x,y; rtol = 0.1)

featurematrix(a,b; cf = f)
4×2 Matrix{Any}:
 0.95      0.93
 1.1        nothing
 0.98      0.99
  nothing  0.96

Compare this result with an alignment of the same vectors based on normalizing the floating point values by rounding them to integers:

featurematrix(a,b; norm = round)
3×2 Matrix{Any}:
 0.95  0.93
 1.1   0.99
 0.98  0.96