CS 441/541 F20: Micro-Project: Supersequences

As a search problem, we can consider the problem of stringing overlapping sequences together to form a larger sequence. We will refer to this as the supersequence problem. A good place to start is a special case: superpermutations.

Superpermutations

There has been a lot of interest lately in the problem of constructing "superpermutations." The idea is this: given a set of n letters starting at A, find a sequence that contains all permutations of that set.

For example, if n=2, the set will be {A, B} and the possible permutations of that set are the two permutations AB and BA. Thus, the sequence ABA is a superpermutation: it has as subsequences both AB and BA. Of course, the sequence BAB will work as well. Note that these are shortest superpermutations: one can see that no shorter sequence can possibly work.

For a superpermutation of size 3, we must construct a sequence containing all 6 of the permutations of {A, B, C}: these are

ABC ACB BAC BCA CAB CBA

One such superpermutation is

ABCABACBA

which can be seen to contain all six permutations

ABCABACBA

ABC
 BCA
  CAB
    BAC
     ACB
      CBA

It is less obvious whether this is a shortest superpermutation, although it surely has to be close. It turns out that it is optimal.

For superpermutations up to size 5, there is a simple recursive algorithm for generating a shortest superpermutation. Greg Egan's Superpermutations Page delves deeply into the highly mathematical theory of superpermutations, including the discoveries over the last decade that have sparked new interest in the problem. The Wikipedia page describes the 2011 "Haruhi Problem" proof that started it: a lower bound on minimal superpermutation length published on 4chan by an anonymous poster.

This is well-trodden ground, and we won't cover it further.

Supersequences

A generalization of superpermutations goes like this: as before, take an alphabet of n letters. However, instead of trying to make our sequence cover all permutations of that alphabet, we can list a specific set of subsequences that we would like to cover.

For example, given the alphabet {A,B,C} we might try to cover these three subsequences with a shortest sequence:

ABBA BAACC ACAB CB

One such supersequence would be

ABBACABAACCB

ABBA
   ACAB
      BAACC
          CB

A shorter supersequence would be

ACABBAACCB
ACAB
  ABBA
    BAACC
        CB

There's a lot of elegant math and computer science here, but let's not use any of it. Instead, let's do state-space search to try to find short supersequences covering a given set of subsequences!

The Input

Inputs will be in a file formatted as follows:

The first line is an integer denoting the alphabet size.
The second line is a known upper bound on the supersequence length. This bound is there for convenience, and can be used or ignored. It is just a bound: there may be a shorter supersequence.
Subsequent lines are subsequences that must be covered.

For our previous example, the input might look like

3
10
ABBA
BAACC
ACAB
CB

You should run your program against all the instances in http://github.com/pdx-cs-ai/superseq-instances. You may choose to run against additional instances.

The Output

The output of your program should be a supersequence that covers the given subsequences using the given alphabet, as in the examples above. We are looking for "short" sequences, not necessarily a provably shortest one: you may use complete search or local search. Your program must report a supersequence of some length within 10 seconds of CPU time on your local box. (Note that merely joining the subsequences will produce a supersequence, although it will be longer than necessary.)

The Program

Your program should be named something like superseq (for example, superseq.py). It should accept an input instance file in the specified format as a command-line argument, and produce the specified output.

Hints

The best state space I have found seems to be to treat the ordering as the variable, and the subsequence as the value. For DFS, pick a subsequence to go first, then one to go second, etc. For local search, pick an ordering of the subsequences, then swap subsequences in the ordering.
You will probably want a function that determines the amount of overlap between the end of one sequence and the start of another. This should be carefully debugged, as it is a bit tricky and must work correctly.
Your search needs a method to stop after 10 seconds and print the best answer it has found so far. This normally involves getting the start time (in Python use time.time()) and then checking every so often to see if the current time is more than 10 seconds later. Getting the time is expensive, so you might want to let the thing run for a while before checking again: I've found that checking every 1000 search steps or so is a good compromise.

Submission

The usual submission rules for programs in this course apply. Submit a single ZIP archive named superseq.zip that contains the source code for your program, a Makefile if the program needs to be compiled, and a README.md containing a writeup describing your solution to the problem. Your writeup should include sequence length and runtime for each of the test instances given in the test repo http://github.com/pdx-cs-ai/superseq-instances.

Last modified: Monday, 30 November 2020, 9:36 PM