Parameters and return values
lcs, scs, align and featurematrix all accept a variable number of vector parameters. lcs and scs always return a single vector.
using VectorAlignments
s1 = split("Now is the time")
s2 = split("Now is not the time")
s3 = split("Now might be the time")
lcs(s1, s2, s3)3-element Vector{Any}:
"Now"
"the"
"time"scs(s1, s2, s3)7-element Vector{Any}:
"Now"
"is"
"not"
"might"
"be"
"the"
"time"align returns a vector of vectors, with one output vector for each input vector. The length of each of the output vectors equals the length of the complete SCS for the input vectors.
align(s1, s2, s3)3-element Vector{Any}:
Any["Now", "is", nothing, nothing, nothing, "the", "time"]
Any["Now", "is", "not", nothing, nothing, "the", "time"]
Any["Now", nothing, nothing, "might", "be", "the", "time"]featurematrix returns a two-dimensional matrix, with one row for each feature and one column for each input vector.
featurematrix(s1, s2, s3)7×3 Matrix{Any}:
"Now" "Now" "Now"
"is" "is" nothing
nothing "not" nothing
nothing nothing "might"
nothing nothing "be"
"the" "the" "the"
"time" "time" "time"Order of alignment in gaps
In aligning sequences with different content in the same position in the sequence, the order of parameters is used to order elements in the resulting alignment. For example, in the following comparison, all three sequences match on d; the alignment add elements preceding d by taking the vectors in the order they are given.
scs("ad", "bd", "cd")4-element Vector{Any}:
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)This means that you can control the alignment of gaps by your ordering of parameters.
scs("cd", "bd", "ad")4-element Vector{Any}:
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)Optional parameters
lcs, scs, align and featurematrix use the Needleman–Wunsch algorithm to align sequences. Its dynamic-programming approach constructs a matrix of scores comparing two lists element by element. lcs, scs, align and featurematrix allow optional named parameters for two functions that are applied in this process:
norm: a function to normalize values before comparing them. The default is to use the unaltered value of the element (x -> x).cf: a function for comparing the elements of the two vectors. The default is to compare for equality (==).
Examples: normalization
The following alignment normalizes characters to lowercase before comparing them, so that 'b' and 'B' are aligned. Note that the raw values (before normalization) are retained in the aligned vectors.
featurematrix("ab", "Bc", "cd"; norm = lowercase)4×3 Matrix{Any}:
'a' nothing nothing
'b' 'B' nothing
nothing 'c' 'c'
nothing nothing 'd'lcs and scs follow the order of parameters in selecting a value for their single output vector. Compare the following two examples.
scs("ab", "Bc", "cd"; norm = lowercase) |> join"abcd"scs("aB", "bc", "cd"; norm = lowercase) |> join"aBcd"Examples: comparison
In the following example, we compare elements using Julia's isapprox function with a value of 0.1 for relative tolerance:
a = [0.95, 1.1, 0.98]
b = [0.93, 0.99, 0.96]
f = (x,y) -> ≈(x,y; rtol = 0.1)
featurematrix(a,b; cf = f)4×2 Matrix{Any}:
0.95 0.93
1.1 nothing
0.98 0.99
nothing 0.96Compare this result with an alignment of the same vectors based on normalizing the floating point values by rounding them to integers:
featurematrix(a,b; norm = round)3×2 Matrix{Any}:
0.95 0.93
1.1 0.99
0.98 0.96