Quantcast
Channel: Bit Bucket – Big Mess o' Wires
Viewing all articles
Browse latest Browse all 164

Optimizing Assembly (Fast 68K Decompression)

$
0
0

Kid-optimized-15-percent

Are you a 68K assembly language guru, or just good at optimizing code? I’m working on a project that’s a replacement ROM for old 68K-based Macintosh computers, part of which involves decompressing a ~5MB disk image from ROM into RAM to boot the machine. This needs to happen fast, to avoid a lengthy wait whenever the computer boots up. I selected liblzg specifically for its simplicity and speed of decompression, even though it doesn’t compress as well as some alternatives. And the whole thing works! But I want it to be faster.

The compressor is a regular Windows/Mac/Linux program, and the decompressor is a hand-written 680000 assembly routine from the lzg authors. It works well, but decompression on a Mac IIsi or IIci only runs about 600K/sec of decompressed data, so it takes around 5-20 seconds to decompress the whole rom disk depending on its size and the Mac’s CPU speed.

The meat of the 68000 decompression routine isn’t too long. It’s a fairly simple Lempel-Ziv algorithm that encodes repeated data as (distance,length) references to the first appearance of the data. There’s a brief summary of lzg’s specific algorithm here. Anyone see any obvious ways to substantially optimize this code? It was written for a vanilla 68000, but for this Mac ROM it’ll always be running on a 68020 or ‘030. Maybe there are some ‘030-specific instructions that could be used to help speed it up? Or some kind of cache prefetch? There’s also some bounds-checking code that could be removed, though the liblzg web site says this provides only a ~12% improvement.

The meat-of-the-meat where data gets copied from source to dest is a two-instruction dbf loop:

_loop1:	move.b	(a4)+,(a1)+
	dbf	d6,_loop1

If any ‘030-specific tricks could improve that, it would help the most. One improvement would be to copy 4 bytes at a time with move.l instead of move.b. But the additional instructions needed to handle 4-byte alignment and 1-3 extra bytes might outweigh the savings for smaller blocks being copied. I think the average block size is around 10 bytes, though some are up to 127 bytes.

The loop might also be unrolled, for certain pre-defined block sizes.

Here’s the entirety of the decompressor’s main loop:

_LZG_LENGTH_DECODE_LUT:
	dc.b	1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
	dc.b	17,18,19,20,21,22,23,24,25,26,27,28,34,47,71,127
	
_lzg_decompress:
	// a0 = src
	// a1 = dst
	// a2 = inEnd = in + inSize
	// a3 = outEnd = out + decodedSize
	// a6 = out
	move.b	(a0)+,d1			// d1 = marker1
	move.b	(a0)+,d2			// d2 = marker2
	move.b	(a0)+,d3			// d3 = marker3
	move.b	(a0)+,d4			// d4 = marker4
 
	lea		_LZG_LENGTH_DECODE_LUT(pc),a5	// a5 = _LZG_LENGTH_DECODE_LUT
	
	// Main decompression loop
	move.l	#2056,d0			// Keep the constant 2056 in d0 (for marker1)
	cmp.l	a2,a0
_mainloop:
	bcc.s	_fail				// Note: cmp.l a2,a0 must be performed prior to this!
	move.b	(a0)+,d7			// d7 = symbol
 
	cmp.b	d1,d7				// marker1?
	beq.s	_marker1
	cmp.b	d2,d7				// marker2?
	beq.s	_marker2
	cmp.b	d3,d7				// marker3?
	beq.s	_marker3
	cmp.b	d4,d7				// marker4?
	beq.s	_marker4
 
_literal:
	cmp.l	a3,a1
	bcc.s	_fail
	move.b	d7,(a1)+
	cmp.l	a2,a0
	bcs.s	_mainloop
 
	// We're done
_done:
	// irrelevant code removed
	bra _exit
 
	// marker4 - "Near copy (incl. RLE)"
_marker4:
	cmp.l	a2,a0
	bcc.s	_fail
	moveq	#0,d5
	move.b	(a0)+,d5
	beq.s	_literal			// Single occurance of the marker symbol (rare)
	move.l	d5,d6
	and.b	#0x1f,d6
	move.b	(a5,d6.w),d6		// length-1 = _LZG_LENGTH_DECODE_LUT[b & 0x1f]
	lsr.b	#5,d5
	addq.w	#1,d5				// offset = (b >> 5) + 1
	bra.s	_copy
 
	// marker3 - "Short copy"
_marker3:
	cmp.l	a2,a0
	bcc.s	_fail
	moveq	#0,d5
	move.b	(a0)+,d5
	beq.s	_literal			// Single occurance of the marker symbol (rare)
	move.l	d5,d6
	lsr.b	#6,d6
	addq.w	#2,d6				// length-1 = (b >> 6) + 2
	and.b	#0x3f,d5
	addq.w	#8,d5				// offset = (b & 0x3f) + 8
	bra.s	_copy
 
	// marker2 - "Medium copy"
_marker2:
	cmp.l	a2,a0
	bcc.s	_fail
	moveq	#0,d5
	move.b	(a0)+,d5
	beq.s	_literal			// Single occurance of the marker symbol (rare)
	cmp.l	a2,a0
	bcc.s	_fail
	move.l	d5,d6
	and.b	#0x1f,d6
	move.b	(a5,d6.w),d6		// length-1 = _LZG_LENGTH_DECODE_LUT[b & 0x1f]
	lsl.w	#3,d5
	move.b	(a0)+,d5
	addq.w	#8,d5				// offset = (((b & 0xe0) << 3) | b2) + 8
	bra.s	_copy
 
	// marker1 - "Distant copy"
_marker1:
	cmp.l	a2,a0
	bcc.s	_fail
	moveq	#0,d5
	move.b	(a0)+,d5
	beq.s	_literal			// Single occurance of the marker symbol (rare)
	lea		1(a0),a4
	cmp.l	a2,a4
	bcc.s	_fail
	move.l	d5,d6
	and.b	#0x1f,d6
	move.b	(a5,d6.w),d6		// length-1 = _LZG_LENGTH_DECODE_LUT[b & 0x1f]
	lsr.w	#5,d5
	swap	d5
	move.b	(a0)+,d5
	lsl.w	#8,d5
	move.b	(a0)+,d5
	add.l	d0,d5				// offset = (((b & 0xe0) << 11) | (b2 << 8) | (*src++)) + 2056
 
	// Copy corresponding data from history window
	// d5 = offset
	// d6 = length-1
_copy:
	lea		(a1,d6.l),a4
	cmp.l	a3,a4
	bcc	_fail
 
	move.l	a1,a4
	sub.l	d5,a4
 
	cmp.l	a6,a4
	bcs	_fail
 
_loop1:	move.b	(a4)+,(a1)+
	dbf	d6,_loop1
 
	cmp.l	a2,a0
	bcs	_mainloop
	bra	_done

Another thing to note is that about half of all the data is literals rather than (distance,length) markers, and goes to the _literal branch above. That involves an awful lot of instructions to copy a single byte. A faster method of determining whether a byte is a marker or a literal would help - I plan to try a 256-entry lookup table instead of the four compare and branch instructions.

My final idea would involve changing the lzg algorithm itself, and making the compression slightly worse. For longish sequences of literals, the decompressor just copies bytes from input to output, but it goes through the whole _literal loop for each byte. I'm thinking of introducing a 5th marker byte that means "copy the next N bytes directly to output", for some hand-tuned value of N. Then those N bytes could be copied using a much higher performance loop.


Viewing all articles
Browse latest Browse all 164

Trending Articles