Micro-Project: Supersequences
As a search problem, we can consider the problem of stringing overlapping sequences together to form a larger sequence. We will refer to this as the supersequence problem. A good place to start is a special case: superpermutations.
Superpermutations
There has been a lot of interest lately in the problem of
constructing "superpermutations." The idea is this: given
a set of n letters starting at A
, find a sequence that
contains all permutations of that set.
For example, if n=2, the set will be {A, B} and the
possible permutations of that set are the two permutations
AB
and BA
. Thus, the sequence ABA
is a
superpermutation: it has as subsequences both AB
and
BA
. Of course, the sequence BAB
will work as well. Note
that these are shortest superpermutations: one can see that
no shorter sequence can possibly work.
For a superpermutation of size 3, we must construct a sequence containing all 6 of the permutations of {A, B, C}: these are
ABC ACB BAC BCA CAB CBA
One such superpermutation is
ABCABACBA
which can be seen to contain all six permutations
ABCABACBA
ABC
BCA
CAB
BAC
ACB
CBA
It is less obvious whether this is a shortest superpermutation, although it surely has to be close. It turns out that it is optimal.
For superpermutations up to size 5, there is a simple recursive algorithm for generating a shortest superpermutation. Greg Egan's Superpermutations Page delves deeply into the highly mathematical theory of superpermutations, including the discoveries over the last decade that have sparked new interest in the problem. The Wikipedia page describes the 2011 "Haruhi Problem" proof that started it: a lower bound on minimal superpermutation length published on 4chan by an anonymous poster.
This is well-trodden ground, and we won't cover it further.
Supersequences
A generalization of superpermutations goes like this: as before, take an alphabet of n letters. However, instead of trying to make our sequence cover all permutations of that alphabet, we can list a specific set of subsequences that we would like to cover.
For example, given the alphabet {A,B,C} we might try to cover these three subsequences with a shortest sequence:
ABBA BAACC ACAB CB
One such supersequence would be
ABBACABAACCB
ABBA
ACAB
BAACC
CB
A shorter supersequence would be
ACABBAACCB
ACAB
ABBA
BAACC
CB
There's a lot of elegant math and computer science here, but let's not use any of it. Instead, let's do state-space search to try to find short supersequences covering a given set of subsequences!
The Input
Inputs will be in a file formatted as follows:
The first line is an integer denoting the alphabet size.
The second line is a known upper bound on the supersequence length. This bound is there for convenience, and can be used or ignored. It is just a bound: there may be a shorter supersequence.
Subsequent lines are subsequences that must be covered.
For our previous example, the input might look like
3
10
ABBA
BAACC
ACAB
CB
You should run your program against all the instances in http://github.com/pdx-cs-ai/superseq-instances. You may choose to run against additional instances.
The Output
The output of your program should be a supersequence that covers the given subsequences using the given alphabet, as in the examples above. We are looking for "short" sequences, not necessarily a provably shortest one: you may use complete search or local search. Your program must report a supersequence of some length within 10 seconds of CPU time on your local box. (Note that merely joining the subsequences will produce a supersequence, although it will be longer than necessary.)
The Program
Your program should be named something like superseq
(for
example, superseq.py
). It should accept an input instance
file in the specified format as a command-line argument, and
produce the specified output.
Hints
The best state space I have found seems to be to treat the ordering as the variable, and the subsequence as the value. For DFS, pick a subsequence to go first, then one to go second, etc. For local search, pick an ordering of the subsequences, then swap subsequences in the ordering.
You will probably want a function that determines the amount of overlap between the end of one sequence and the start of another. This should be carefully debugged, as it is a bit tricky and must work correctly.
Your search needs a method to stop after 10 seconds and print the best answer it has found so far. This normally involves getting the start time (in Python use
time.time()
) and then checking every so often to see if the current time is more than 10 seconds later. Getting the time is expensive, so you might want to let the thing run for a while before checking again: I've found that checking every 1000 search steps or so is a good compromise.
Submission
The usual submission rules for programs in this course
apply. Submit a single ZIP archive named
superseq.zip
that contains the source code for your
program, a Makefile
if the program needs to be compiled,
and a README.md
containing a writeup describing your
solution to the problem. Your writeup should include
sequence length and runtime for each of the test instances
given in the test repo
http://github.com/pdx-cs-ai/superseq-instances.