Back to data compression


One of my long time fascinations has been with data compression. I've read a lot about it, tried to keep up with modern techniques, but never really written much usable code.

Recently I decided, partly as an exercise in remaining fluent in C, I'd try to re-implement one of the simplest algorithms I know. I used this method almost 20 years ago to compress payloads for a tiny embedded board which had 256kB RAM and 128kB ROM.

The algorithm.

This one works by trying to predict the next byte. If it gets it right, it emits a single 1 bit to say it got it right. If not, it emits a 0 bit, followed by the actual byte.

Now, to do this directly, it would be slow, as we'd have to work to deal with writing bits unaligned with bytes.

Instead, what we'll do is process 8 bytes at a time, emit those match flags as a byte, followed by however many literal bytes are required.

In this case, the algorithm is extremely well suited to embedded systems with tiny memory budgets [<64kB].

The prediction.

We start with an array initialised to 0s, and a state variable. For each character we:

  1. check if table[state] == char
  2. emit match / no-match accordingly
  3. set table[state] = char
  4. update state to hash(state, char)

Obviously, it helps if the table is a power of 2 in size [I started with 2^^14]. And the hash function can be as simple as state = (state << 3) ^ char;


So once I had the code working and debugged, I tested the compression.

Firstly, unsurprisingly, it's very fast. It compressed the enwik8 reference file of 100000000 bytes in about 4 seconds. And can decompress again in about 3.5s.

The compression ratio is, in fact, not too bad, given the simplicity of the algorithm.

size filename
42,265,732 enwik8.gz
57,281,515 enwik8.p4_20
57,285,984 enwik8.lz4
61,657,971 enwik8.p4_16
61,719,932 enwik8.p3_20
66,232,007 enwik8.p3_16
67,714,233 enwik8.p4_13
71,055,303 enwik8.p3_13
100,000,000 enwik8.txt

What you see here are the "gzip -1" of the file, the "lz4 -1" of the file, and various combinations of 3 or 4 bit hash, and 13, 16, or 20 bit state size [and thus table size].


It turns out that changing the shift from 3 to 4 bits improves results for "mostly text" data, whilst slightly harming "mostly binary" data.

Also, I find it impressive that with a modest 1MB memory investment, we can achieve slightly better compression than LZ4, which uses the more sophisticated Lempel-Ziv 77 compression algorithm, and a lot more memory.

Choosing the table size is interesting. A larger table allows you to predict from a larger range of possible contexts [state], but also takes longer to fill. A smaller table/state will fill quicker, and likely be good for highly repetitive data, but clearly doesn't compress as well in our sample case.