Parameters and return values
lcs
, scs
, align
and featurematrix
all accept a variable number of vector parameters. lcs
and scs
always return a single vector.
using VectorAlignments
s1 = split("Now is the time")
s2 = split("Now is not the time")
s3 = split("Now might be the time")
lcs(s1, s2, s3)
3-element Vector{Any}:
"Now"
"the"
"time"
scs(s1, s2, s3)
7-element Vector{Any}:
"Now"
"is"
"not"
"might"
"be"
"the"
"time"
align
returns a vector of vectors, with one output vector for each input vector. The length of each of the output vectors equals the length of the complete SCS for the input vectors.
align(s1, s2, s3)
3-element Vector{Any}:
Any["Now", "is", nothing, nothing, nothing, "the", "time"]
Any["Now", "is", "not", nothing, nothing, "the", "time"]
Any["Now", nothing, nothing, "might", "be", "the", "time"]
featurematrix
returns a two-dimensional matrix, with one row for each feature and one column for each input vector.
featurematrix(s1, s2, s3)
7×3 Matrix{Any}:
"Now" "Now" "Now"
"is" "is" nothing
nothing "not" nothing
nothing nothing "might"
nothing nothing "be"
"the" "the" "the"
"time" "time" "time"
Order of alignment in gaps
In aligning sequences with different content in the same position in the sequence, the order of parameters is used to order elements in the resulting alignment. For example, in the following comparison, all three sequences match on d
; the alignment add elements preceding d
by taking the vectors in the order they are given.
scs("ad", "bd", "cd")
4-element Vector{Any}:
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
This means that you can control the alignment of gaps by your ordering of parameters.
scs("cd", "bd", "ad")
4-element Vector{Any}:
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
Optional parameters
lcs
, scs
, align
and featurematrix
use the Needleman–Wunsch algorithm to align sequences. Its dynamic-programming approach constructs a matrix of scores comparing two lists element by element. lcs
, scs
, align
and featurematrix
allow optional named parameters for two functions that are applied in this process:
norm
: a function to normalize values before comparing them. The default is to use the unaltered value of the element (x -> x
).cf
: a function for comparing the elements of the two vectors. The default is to compare for equality (==
).
Examples: normalization
The following alignment normalizes characters to lowercase before comparing them, so that 'b'
and 'B'
are aligned. Note that the raw values (before normalization) are retained in the aligned vectors.
featurematrix("ab", "Bc", "cd"; norm = lowercase)
4×3 Matrix{Any}:
'a' nothing nothing
'b' 'B' nothing
nothing 'c' 'c'
nothing nothing 'd'
lcs
and scs
follow the order of parameters in selecting a value for their single output vector. Compare the following two examples.
scs("ab", "Bc", "cd"; norm = lowercase) |> join
"abcd"
scs("aB", "bc", "cd"; norm = lowercase) |> join
"aBcd"
Examples: comparison
In the following example, we compare elements using Julia's isapprox
function with a value of 0.1 for relative tolerance:
a = [0.95, 1.1, 0.98]
b = [0.93, 0.99, 0.96]
f = (x,y) -> ≈(x,y; rtol = 0.1)
featurematrix(a,b; cf = f)
4×2 Matrix{Any}:
0.95 0.93
1.1 nothing
0.98 0.99
nothing 0.96
Compare this result with an alignment of the same vectors based on normalizing the floating point values by rounding them to integers:
featurematrix(a,b; norm = round)
3×2 Matrix{Any}:
0.95 0.93
1.1 0.99
0.98 0.96