Data compression is fun! I’ve written a new compression scheme that’s designed to be as fast as possible to decompress on a 68K CPU, while still maintaining a decent compression density. I’m calling it FC8, and you can get the generic C implementation and optimized 68K decompressor from the project’s Github repository. I’ve probably reinvented the wheel with this, and in a non-optimal way too, but once I started I found that I couldn’t stop. My intended platform for FC8 is 68020 and 68030-based vintage Macintosh computers, but it should be easily portable to other classic computers, microcontrollers, and similar minimal systems.
The main loop of the 68K decompressor is exactly 256 bytes, so it fits entirely within the instruction cache of the 68020/030. Decompression speed on a 68030 is about 25% as fast as an optimized memcpy of uncompressed data, which is essentially an unrolled loop of 4-byte move.l instructions with no computation involved. Compared to that, I think 25% is pretty good, but I can always hope for more.
In the previous post, I described how I was using compression to squeeze a larger rom-disk image into a custom replacement Macintosh ROM that I’m designing. I began with a compression algorithm called LZG, written by Marcus Geelnard. It worked well, but the 68K decompression seemed disappointingly slow. I tried to contact the author to discuss it, but couldn’t find any email address or other contact info, so I eventually drifted towards creating my own compression method loosely based on LZG. This became FC8. On a 68030 CPU, FC8 compresses data equally as tightly as LZG and decompresses 1.5x to 2x faster. FC8 retains much of the compression acceleration code from LZG, as well as the idea of quantizing lengths, but the encoding and decompressor are new.
The algorithm is based on the classic LZ77 compression scheme, with a 128K sliding history window and with duplicated data replaced by (distance,length) backref markers pointing to previous instances of the same data. No extra RAM is required during decompression, aside from the input and output buffers. The compressed data is a series of tokens in this format:
- LIT = 00aaaaaa = next aaaaaa+1 bytes are literals
- BR0 = 01baaaaa = backref to offset aaaaa, length b+3
- EOF = 01×00000 = end of file
- BR1 = 10bbbaaa’aaaaaaaa = backref to offset aaa’aaaaaaaa, length bbb+3
- BR2 = 11bbbbba’aaaaaaaa’aaaaaaaa = backref to offset a’aaaaaaaa’aaaaaaaa, length lookup_table[bbbbb]
The encoding may look slightly strange, such as only a single bit for the backref length in BR0, but this produced the best results in my testing with sample data. The length lookup table enables encoding of backrefs up to 256 bytes in length using only 5 bits, though some longer lengths can’t be encoded directly. These are encoded as two successive backrefs, each with a smaller length.
The biggest conceptual changes vs LZG were the introductions of the LIT and EOF tokens. EOF eliminates the need to check the input pointer after decoding each token to determine if decompression is complete, and speeds things up slightly. LIT enables a whole block of literals to be quickly copied to the output buffer, instead of checking each one to see if it’s a backref token. This speeds things up substantially, but also swells the size of the data. In the worst case, a single literal would encode as 1 byte in LZG but 2 bytes in FC8, making it twice as expensive! All the other changes were needed to cancel out the compression bloat introduced by the LIT token, with the end result that FC8 compresses equally as compactly as LZG. Both compressed my sample data to about 63% of original size.
The 68K decompressor code can be viewed here.
Decompression on the Fly
Several people mentioned the possibility of on-the-fly decompression, since the intended use is a compressed disk image. That’s something I plan to explore, but it’s not as simple as it might seem at first. Disk sectors are 512 bytes, but there’s no way to decompress a specific 512 byte range from the compressed data, since the whole compression scheme depends on having 128K of prior data to draw on for backref matches. You could compress the entire disk image as a series of separate 512 byte blocks, but then the compression density would go to hell. A better solution would compress the entire disk image as a series of larger blocks, maybe 128K or a little smaller, and then design a caching scheme to keep track of whether the block containing a particular sector were already decompressed and available. This would still have a negative impact on the compression density, and it would make disk I/O slower, but would probably still be OK.
Ultimately I think the two decompression approaches each have strengths and weaknesses, so the best choice depends on the requirements.
Boot-Time Decompression:
Pros: Best compression density, fastest I/O speeds once the disk image is decompressed
Cons: 5-10 second wait for decompression at boot time, requires enough RAM to hold the entire disk image
On-the-Fly Decompression:
Pros: No wait at boot time, required amount of RAM is configurable (size of the decompressed block cache)
Cons: Worse compression density, slower I/O speeds, more complex implementation
Hardware Tests
I discovered that a Macintosh IIci in 8-bit color mode decompresses about 20% slower than in 1-bit color mode. But a IIsi decompresses at the same speed regardless of the color settings. Both machines are using the built-in graphics hardware, which steals some memory cycles from the CPU in order to refresh the display. I’m not sure why only the IIci showed a dependence on the color depth. Both machines should be faster when using a discrete graphics card, though I didn’t test this.
The original LZG compression showed a much bigger speed difference between the IIci and IIsi, closer to a 50% difference, which I assumed was due to the 32K cache card in the IIci as well as its faster CPU. It’s not clear why the discrepancy is smaller with FC8, or whether it means the IIci has gotten worse or the IIsi has gotten better, relatively speaking. Compared to the same machine with the LZG compression, FC8 is 1.57x faster on the IIci and 1.99x faster on the IIsi. Based on tests under emulation with MESS, I was expecting a 1.78x speedup.
Tradeoffs
While working on this, I discovered many places where compression compactness could be traded for decompression speed. My first attempt at FC8 had a minimum match size of 2 bytes instead of 3, which compressed about 0.7% smaller but was 13% slower to decompress due to the larger number of backrefs. At the other extreme, the introduction of a LIT token without any other changes resulted in the fastest decompression speed of all, about 7% faster than FC8, but the compressed files were about 6% larger, and I decided the tradeoff wasn’t worth it.
I explored many other ideas to improve the compression density, but everything I thought of proved to have only a tiny benefit at best, not enough to justify the impact on decompression speed. An algorithm based on something other than LZ77 would likely have compressed substantially more densely, or say a combination of LZ77 and Huffman coding. But decompression of LZ77-based methods are far easier and faster to implement.
Compression Heuristics
It eventually became obvious to me that defining the token format doesn’t tell you much about how to best encode the data in that format. A greedy algorithm seemed to work fairly well, so that’s what I used. At each point in the uncompressed data, the compressor substitutes the best match it can make (if any) between that data and previous data in the history window.
However, there are some examples where choosing a non-optimal match would allow for an even better match later, resulting in better overall compression. This can happen due to quirks in the quantizing of match lengths, or with long runs of repeated bytes which are only partially matched in the previous data. It’s a bit like sacrificing your queen in chess, and sometimes you need to accept a short-term penalty in order to realize a long-term benefit. Better compression heuristics that took this into account could probably squeeze another 1% out of the data, without changing the compression format or the decompressor at all.