Back to data compression
Published:
One of my long time fascinations has been with data compression. I've read a lot about it, tried to keep up with modern techniques, but never really written much usable code.
Recently I decided, partly as an exercise in remaining fluent in C, I'd try to re-implement one of the simplest algorithms I know. I used this method almost 20 years ago to compress payloads for a tiny embedded board which had 256kB RAM and 128kB ROM.
The algorithm.
This one works by trying to predict the next byte. If it gets it right, it
emits a single 1
bit to say it got it right. If not, it emits a 0
bit,
followed by the actual byte.
Now, to do this directly, it would be slow, as we'd have to work to deal with writing bits unaligned with bytes.
Instead, what we'll do is process 8 bytes at a time, emit those match flags as a byte, followed by however many literal bytes are required.
In this case, the algorithm is extremely well suited to embedded systems with tiny memory budgets [<64kB].
The prediction.
We start with an array initialised to 0s, and a state
variable. For each
character we:
- check if table[state] == char
- emit match / no-match accordingly
- set table[state] = char
- update state to hash(state, char)
Obviously, it helps if the table is a power of 2 in size [I started with
2^^14]. And the hash
function can be as simple as state = (state << 3) ^
char;
Results
So once I had the code working and debugged, I tested the compression.
Firstly, unsurprisingly, it's very fast. It compressed the enwik8
reference
file of 100000000 bytes in about 4 seconds. And can decompress again in about
3.5s.
The compression ratio is, in fact, not too bad, given the simplicity of the algorithm.
size | filename |
---|---|
42,265,732 | enwik8.gz |
57,281,515 | enwik8.p4_20 |
57,285,984 | enwik8.lz4 |
61,657,971 | enwik8.p4_16 |
61,719,932 | enwik8.p3_20 |
66,232,007 | enwik8.p3_16 |
67,714,233 | enwik8.p4_13 |
71,055,303 | enwik8.p3_13 |
100,000,000 | enwik8.txt |
What you see here are the "gzip -1" of the file, the "lz4 -1" of the file, and various combinations of 3 or 4 bit hash, and 13, 16, or 20 bit state size [and thus table size].
Findings
It turns out that changing the shift from 3 to 4 bits improves results for "mostly text" data, whilst slightly harming "mostly binary" data.
Also, I find it impressive that with a modest 1MB memory investment, we can achieve slightly better compression than LZ4, which uses the more sophisticated Lempel-Ziv 77 compression algorithm, and a lot more memory.
Choosing the table size is interesting. A larger table allows you to predict from a larger range of possible contexts [state], but also takes longer to fill. A smaller table/state will fill quicker, and likely be good for highly repetitive data, but clearly doesn't compress as well in our sample case.