For example, the sets may contain:
A = ZBANICOT B = ACNTBZIO C = ANICOTZB D = ZIANCOTB
... and it should output
ANT, since these are the longest subsequences in all 4 sets (as far as I can tell from manually examining these sets).
Since none of the letters repeat, is there a more efficient algorithm I should consider rather than the one described in the Wikipedia article? If not, is there anywhere that describes how to convert the algorithm to having N sets instead of 2?
Not sure how efficient an adaption of the normal LCS algorithm would be, so this may be inefficient, but since there aren't many eligible letters, it's not too horrible.
Make a successor matrix. For each string
s, for each pair of letters
i < j, increment
matrix[s[i]][s[j]]. A pair of letters can only be part of a longest common subsequence if
matrix[a][b] = n = number of strings. Build a directed graph from the letters participating in such a pair, with an edge from
matrix[a][b] = n. Find a longest path in that graph.
What you are trying to do is called multiple sequence alignment. The algorithm 'could' be rewritten to be n-dimensional, but that would increase the space complexity to
Length ^ Dimensions.
Since none of the letters repeat, is there a more efficient algorithm I should consider rather than the one described in the Wikipedia article?
Perhaps. First create a mapping (a,b) such that a < b. For example the mappings for the first three characters of each sets is as follows
--- Mapping for character Z - B A - C A - N Z - I A N I A N I C N I B O C C Z T O O I Z T T O B B --- Mapping for character B - A C - N N - I I - A N I C N I B O C C Z T O O I Z T T O B B --- Mapping for character A - N N - I N - C I - N I B O C C Z T O O I Z T T O B B
Then retain only mappings that are common on all sets. Here's an example of following this process, traversing Set[A] (
ZBANICOT), which will give us a set of relations that are common to all sets.
So with the letter
Z you will observe that in Set[C], the only character that follows
B, so you can eliminate just about every mapping from
Z that is not to B. But to cut a long story short, every mapping for
Z is removed. Note, only mappings from Z are removed, not mappings to
Also note that since Set[D] ends with
B, we can safely eliminate all mappings from
B. By following the process one can observe that
A maps to
T. Next is N, which maps to
I maps to
O (honestly, "I/O") while
C maps to
T (how cute). And funnily enough
O maps to nothing!!! How befitting.
Finally there is one more character, and
T maps to nothing too (since
ZBANICOT ends with
In the end you are left with the following mappings:
A - C C - O I - O N - O N T T O T
Now all you have to do is start from each of the keys (
N) and search for the longest sequence without any repeating nodes (characters in this case).
Here are all of the common subsequences (all using the simple algorithm):
ACO ACT ANO ANT AO AT CO CT IO NT NO
What if I stored character mapping for each set of characters and then incremented this count each time...
For the 1st set (
ZBANICOT), this would create the following character pairs (28 of them):
ZB, ZA, ZN, ZI, ZC, ZO, ZT BA, BN, BI, BC, BO, BT AN, AI, AC, AO, AT NI, NC, NO, NT IC, IO, IT CO, CT OT
For the 2nd set (
ACNTBZIO), I would have 28 pairs again:
AC, AN, AT, AB, AZ, AI, AO CN, CT, CB, CZ, CI, CO NT, NB, NZ, NI, NO TB, TZ, TI, TO BZ, BI, BO ZI, ZO IO
For the last 2 sets, I would have:
AN, AI, AC, AO, AT, AZ, AB NI, NC, NO, NT, NZ, NB IC, IO, IT, IZ, IB CO, CT, CZ, CB OT, OZ, OB TZ, TB ZB ZI, ZA, ZN, ZC, ZO, ZT, ZB IA, IN, IC, IO, IT, IB AN, AC, AO, AT, AB NC, NO, NT, NB CO, CT, CB OT, OB TB
(So to figure out the iteration count so far, let
N equal the number of sets, and
M be the number of characters in each set, we have
4*28=112 in this example)
Next, I would count the number of times each pair exists. For example,
PairCounts["ZB"] == 3 PairCounts["ZA"] == 2 // ... PairCounts["AO"] == 4 PairCounts["AT"] == 4 // ... PairCounts["AN"] == 4 PairCounts["AC"] == 4 PairCounts["CT"] == 4 PairCounts["NO"] == 4 PairCounts["NT"] == 4
Then, since we have 4 pairs, I would discard all pairs which don't have a count of 4. And then I would try concatting each pair in order to form the longest string possible.
©2020 All rights reserved.