Optimizing Assembly (Fast 68K Decompression)

Kid-optimized-15-percent

Are you a 68K assembly language guru, or just good at optimizing code? I’m working on a project that’s a replacement ROM for old 68K-based Macintosh computers, part of which involves decompressing a ~5MB disk image from ROM into RAM to boot the machine. This needs to happen fast, to avoid a lengthy wait whenever the computer boots up. I selected liblzg specifically for its simplicity and speed of decompression, even though it doesn’t compress as well as some alternatives. And the whole thing works! But I want it to be faster.

The compressor is a regular Windows/Mac/Linux program, and the decompressor is a hand-written 680000 assembly routine from the lzg authors. It works well, but decompression on a Mac IIsi or IIci only runs about 600K/sec of decompressed data, so it takes around 5-20 seconds to decompress the whole rom disk depending on its size and the Mac’s CPU speed.

The meat of the 68000 decompression routine isn’t too long. It’s a fairly simple Lempel-Ziv algorithm that encodes repeated data as (distance,length) references to the first appearance of the data. There’s a brief summary of lzg’s specific algorithm here. Anyone see any obvious ways to substantially optimize this code? It was written for a vanilla 68000, but for this Mac ROM it’ll always be running on a 68020 or ‘030. Maybe there are some ‘030-specific instructions that could be used to help speed it up? Or some kind of cache prefetch? There’s also some bounds-checking code that could be removed, though the liblzg web site says this provides only a ~12% improvement.

The meat-of-the-meat where data gets copied from source to dest is a two-instruction dbf loop:

_loop1:	move.b	(a4)+,(a1)+
	dbf	d6,_loop1

If any ‘030-specific tricks could improve that, it would help the most. One improvement would be to copy 4 bytes at a time with move.l instead of move.b. But the additional instructions needed to handle 4-byte alignment and 1-3 extra bytes might outweigh the savings for smaller blocks being copied. I think the average block size is around 10 bytes, though some are up to 127 bytes.

The loop might also be unrolled, for certain pre-defined block sizes.

Here’s the entirety of the decompressor’s main loop:

_LZG_LENGTH_DECODE_LUT:
	dc.b	1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
	dc.b	17,18,19,20,21,22,23,24,25,26,27,28,34,47,71,127
	
_lzg_decompress:
	// a0 = src
	// a1 = dst
	// a2 = inEnd = in + inSize
	// a3 = outEnd = out + decodedSize
	// a6 = out
	move.b	(a0)+,d1			// d1 = marker1
	move.b	(a0)+,d2			// d2 = marker2
	move.b	(a0)+,d3			// d3 = marker3
	move.b	(a0)+,d4			// d4 = marker4
 
	lea		_LZG_LENGTH_DECODE_LUT(pc),a5	// a5 = _LZG_LENGTH_DECODE_LUT
	
	// Main decompression loop
	move.l	#2056,d0			// Keep the constant 2056 in d0 (for marker1)
	cmp.l	a2,a0
_mainloop:
	bcc.s	_fail				// Note: cmp.l a2,a0 must be performed prior to this!
	move.b	(a0)+,d7			// d7 = symbol
 
	cmp.b	d1,d7				// marker1?
	beq.s	_marker1
	cmp.b	d2,d7				// marker2?
	beq.s	_marker2
	cmp.b	d3,d7				// marker3?
	beq.s	_marker3
	cmp.b	d4,d7				// marker4?
	beq.s	_marker4
 
_literal:
	cmp.l	a3,a1
	bcc.s	_fail
	move.b	d7,(a1)+
	cmp.l	a2,a0
	bcs.s	_mainloop
 
	// We're done
_done:
	// irrelevant code removed
	bra _exit
 
	// marker4 - "Near copy (incl. RLE)"
_marker4:
	cmp.l	a2,a0
	bcc.s	_fail
	moveq	#0,d5
	move.b	(a0)+,d5
	beq.s	_literal			// Single occurance of the marker symbol (rare)
	move.l	d5,d6
	and.b	#0x1f,d6
	move.b	(a5,d6.w),d6		// length-1 = _LZG_LENGTH_DECODE_LUT[b & 0x1f]
	lsr.b	#5,d5
	addq.w	#1,d5				// offset = (b >> 5) + 1
	bra.s	_copy
 
	// marker3 - "Short copy"
_marker3:
	cmp.l	a2,a0
	bcc.s	_fail
	moveq	#0,d5
	move.b	(a0)+,d5
	beq.s	_literal			// Single occurance of the marker symbol (rare)
	move.l	d5,d6
	lsr.b	#6,d6
	addq.w	#2,d6				// length-1 = (b >> 6) + 2
	and.b	#0x3f,d5
	addq.w	#8,d5				// offset = (b & 0x3f) + 8
	bra.s	_copy
 
	// marker2 - "Medium copy"
_marker2:
	cmp.l	a2,a0
	bcc.s	_fail
	moveq	#0,d5
	move.b	(a0)+,d5
	beq.s	_literal			// Single occurance of the marker symbol (rare)
	cmp.l	a2,a0
	bcc.s	_fail
	move.l	d5,d6
	and.b	#0x1f,d6
	move.b	(a5,d6.w),d6		// length-1 = _LZG_LENGTH_DECODE_LUT[b & 0x1f]
	lsl.w	#3,d5
	move.b	(a0)+,d5
	addq.w	#8,d5				// offset = (((b & 0xe0) << 3) | b2) + 8
	bra.s	_copy
 
	// marker1 - "Distant copy"
_marker1:
	cmp.l	a2,a0
	bcc.s	_fail
	moveq	#0,d5
	move.b	(a0)+,d5
	beq.s	_literal			// Single occurance of the marker symbol (rare)
	lea		1(a0),a4
	cmp.l	a2,a4
	bcc.s	_fail
	move.l	d5,d6
	and.b	#0x1f,d6
	move.b	(a5,d6.w),d6		// length-1 = _LZG_LENGTH_DECODE_LUT[b & 0x1f]
	lsr.w	#5,d5
	swap	d5
	move.b	(a0)+,d5
	lsl.w	#8,d5
	move.b	(a0)+,d5
	add.l	d0,d5				// offset = (((b & 0xe0) << 11) | (b2 << 8) | (*src++)) + 2056
 
	// Copy corresponding data from history window
	// d5 = offset
	// d6 = length-1
_copy:
	lea		(a1,d6.l),a4
	cmp.l	a3,a4
	bcc	_fail
 
	move.l	a1,a4
	sub.l	d5,a4
 
	cmp.l	a6,a4
	bcs	_fail
 
_loop1:	move.b	(a4)+,(a1)+
	dbf	d6,_loop1
 
	cmp.l	a2,a0
	bcs	_mainloop
	bra	_done

Another thing to note is that about half of all the data is literals rather than (distance,length) markers, and goes to the _literal branch above. That involves an awful lot of instructions to copy a single byte. A faster method of determining whether a byte is a marker or a literal would help - I plan to try a 256-entry lookup table instead of the four compare and branch instructions.

My final idea would involve changing the lzg algorithm itself, and making the compression slightly worse. For longish sequences of literals, the decompressor just copies bytes from input to output, but it goes through the whole _literal loop for each byte. I'm thinking of introducing a 5th marker byte that means "copy the next N bytes directly to output", for some hand-tuned value of N. Then those N bytes could be copied using a much higher performance loop.

Optimizing Assembly (Fast 68K Decompression)

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112