My Python implementation of the dna muncher was much too slow (25 iterations/sec). The full run requied 2 million iterations so that just wouldn't cut it. A first try at a C-based implementation was even worse (10 it/sec). I knew that using ropes was the way to go but I couldn't _find_ a library. It turns out that "crope" is a more common term than you think so the google search is very cluttered. Mainly I was disappointed that my C chops were so rusty. Thankfully my Python interpreter patches get peer reviewed by people who have forgotten more about compilers than I ever knew (rarely do they have to change my stuff but when they do it is magic).
Some lessons I [re]learned: any call to str* is deadly and the C compiler has no memory. If you repeatedly call strcat(a_string, b_string) it will get slower and slower because it will have to find the end of a_string every time even if it just found it the line before. Assert()s with a call to strlen are equally deadly. You wouldn't write a loop with 10 million iterations but a call to strlen on a 10 meg string is just as expensive.
Even without using ropes I was able to get 500 it/sec out of my C version using regular string by fixing obvious ineffeciencies and using a linked list of char * pointers. At the end of every iteration the new DNA sequence was assembled by smartly NOT moving the biggest unchanged part and memmoving() the smaller chunks around it.
I spent 50+ hours after the contest on a truly decent version which can be downloaded here. It runs at a blistering 200k it/sec - three orders of magnitude faster than my first try. The speed is very compiler (and platform?) dependent. My 1.4Ghz Celeron linux laptop with gcc 4.x runs the full endo.dna sequence in 10 seconds but my 2Ghz Althlon X2 windows desktop with gcc 3.x takes 35 seconds - the same as the cheesy laptop with profiling enabled!
The balls of the program is a refcounted string "chunk" struct coupled with a plain linked list "cat" struct which is just a concatenization of pointers to chunks (I picked refcounting over genrational GC because it fit the problem. The debugging output does a mark and sweep to ensure the numbers match so if that floats your boat turn it on). This actually took me some time to figure out. My first attempt tried to use a single structure to hold both kinds of data but the problem really has two kinds of data. One is a list of "I care about chars X:Y in this chunk and see the next guy for more info" and the other is "I am a string of len X and Y people use me." In the source you'll see the refcount is actually in a variable called "usecnt." Easier to show than to say so here they are:
typedef struct chunk {
char *chars;
int len;
unsigned int usecnt; // how many ->uses pointers point to us, any number
char *purpose;
} chunk;
typedef struct cat {
struct cat *next;
struct chunk *uses;
char *chars;
int len;
char *purpose;
} cat;
The "purpose" attribute is in there for debugging. It is the name of the function that created the cat or chunk. This was very helpful for debugging.
Each cat points to one chunk and usually to the middle of one so we need the "len" attribute to know when to stop reading. New chunks come into existence from the nat() and protect() functions which do transformations on the DNA.
Using these structures catapulted my speed into the 100k it/sec range and took the time for a full run down to 20 seconds. This was after the contest and I can be quite obstinate so I wanted more. Almost none of the time was taken up by strstr(haystack, needle) searches that contestants said needed KPM and other fancy stuff ("premature optimization is the root of all evil"). Most of the time was being taken up by traversing long cat lists, accessing the Nth char of the string, and doing simple bookkeeping in newing and deleting structs.
Keeping a reuse list of already malloc'ed sturctures helped greatly with the bookkeeping overhead (this I learned from reading Hettinger's code). Instead of free'ing structs keep the last N around and just reinitialize them and return them like they were brand new.
Performance really suffered when the linked list of cats grew ("real" ropes use binary trees so I was paying a heavier penalty). From the beginning I had capped the length of cats by calling "do over" whenever the list grew too big: just copy everything into one big chunk referenced by one cat and then move on. But "new_cat" was getting called 80 million times in a full run so keeping the number of cats small or the speed of "new_cat" low was a big priority.
Trimming time for a call to "new_cat" was a bad deal. You can do it easily by not setting everything to NULL/0 or leaving out the "purpose" entry but that greatly hurts assertions and debugabilty (we are re-using structs so not initializing them means that most still point to valid structures). So the answer is to drop the number of calls to "new_cat" which is a little trickier.
Operations on big strings are expensive and correspondingly operations on small strings are cheap. To keep the number of cats down it is cheaper to copy a one byte string onto an existing chunk that it is to malloc a whole new cat that references a single byte in another existing chunk. By trial and error I found that anything in the size 10k to 100k is cheaper to strcat() than add an item to the cat linked list. This somewhat complicated optimization brought the cost of new_cat down from 80M calls to less than 50M.
Finally I changed my "do over" routine to only attack the small chunks instead of all of them. Because it only composited small cats and chunks it was much cheaper to run. Because of that I could run it more often to keep the list of cats even smaller. In a standard endo run of 1.9M iterations it gets called only 101 times but makes the overall speed much faster (and, umm, not coredump).
This approach is pointed at a particular problem and will have drawbacks if used more generally. It never frees a chunk even if only 1 byte of it is used and it weighs 10 megs. For this task which has a high chunk turnover the wastage memory never gets above three times over the minimum (60M for 20M worth of real data). You can keep the ratio tighter by tuning the cat concatenation but it wasn't needed for this problem so I didn't do it (it hurt speed).
My final and fastest product would not have been ideal for the contest even if I had finished it in time. The optimizations meant that I didn't keep track of the original position of DNA in the trace so it is less useful in investigating the problem. My only goal was to go into gear head mode and make an interpreter that would be slick as shit. I'm frankly thrilled that it can do a 10 second run on my Celeron laptop but unfortunately I know five guys that could cut that number in half with a few tweaks. And I'm adult enough to know I will never match them. Such is life.
Errata: Valgrind is your friend. I cannot overemphasize this enough. gprof is your friend, sometimes. In order to get library calls included in the output you will need to write thin wrapper functions. My source includes a bunch of wrap_str* and wrap_mem* functions for that purpose.
See also ICFP social stuff.