Most people believe that modern compilers generate better-optimized assembly code than humans, but look at this example from AVR-GCC 5.4.0 with -O2 optimization level:
7b96: 10 92 34 37 sts 0x3734, r1 ; 0x803734 <tachFlutter> 7b9a: e0 e0 ldi r30, 0x00 ; 0 7b9c: f0 e0 ldi r31, 0x00 ; 0 7b9e: a0 91 35 37 lds r26, 0x3735 ; 0x803735 <driveTachHalfPeriod> 7ba2: b0 91 36 37 lds r27, 0x3736 ; 0x803736 <driveTachHalfPeriod+0x1> 7ba6: ae 1b sub r26, r30 7ba8: bf 0b sbc r27, r31 7baa: b0 93 89 00 sts 0x0089, r27 ; 0x800089 <OCR1AH> 7bae: a0 93 88 00 sts 0x0088, r26 ; 0x800088 <OCR1AL> 7bb2: 10 92 95 00 sts 0x0095, r1 ; 0x800095 <TCNT3H> 7bb6: 10 92 94 00 sts 0x0094, r1 ; 0x800094 <TCNT3L> 7bba: 32 2d mov r19, r2 7bbc: e0 e0 ldi r30, 0x00 ; 0 7bbe: f0 e0 ldi r31, 0x00 ; 0 7bc0: f0 93 e3 33 sts 0x33E3, r31 ; 0x8033e3 <currentTrackBytePos+0x1> 7bc4: e0 93 e2 33 sts 0x33E2, r30 ; 0x8033e2 <currentTrackBytePos>
This is straight-line code with no branching. All registers and memory references are 8-bit. With AVR-GCC, the register r1 always holds the value 0, so the code is doing this: Set tachFlutter to 0, load driveTachHalfPeriod, set OCR1A to driveTachHalfPeriod minus 0, set TCNT3 to 0, set currentTrackBytePos to 0. There’s also a move of r2 to r19, which is used later, and I’m not sure why the compiler located the instruction here. There are at least three glaring inefficiences:
- the compiler wastes time loading 0 into r30 and r31, when it could have just used r1
- it does this TWICE, when we know r30 and r31 were already zero after the first time
- it subtracts a constant 0 from driveTachHalfPeriod
I can maybe understand the subtraction of constant 0, if there’s another code path that jumps to 7ba6 where the value in r30:r31 isn’t 0. But why wouldn’t the compiler make a completely separate path for that, with faster execution speed when the subtracted value is known to be 0, even if the code size is greater? After all this is -O2, not -Os.
It also appears there’s no optimization for setting multi-byte variables like currentTrackBytePos to zero. Instead of just storing r1 twice for the low and high bytes, the compiler first creates an unnamed 16-bit temporary variable in r30:r31 and sets its value to 0, then stores the unnamed variable at currentTrackBytePos.
This whole block of code could easily be rewritten:
sts 0x3734, r1 ; 0x803734 <tachFlutter> lds r26, 0x3736 ; 0x803736 <driveTachHalfPeriod+0x1> sts 0x0089, r26 ; 0x800089 <OCR1AH> lds r26, 0x3735 ; 0x803735 <driveTachHalfPeriod> sts 0x0088, r26 ; 0x800088 <OCR1AL> sts 0x0095, r1 ; 0x800095 <TCNT3H> sts 0x0094, r1 ; 0x800094 <TCNT3L> mov r19, r2 sts 0x33E3, r1 ; 0x8033e3 <currentTrackBytePos+0x1> sts 0x33E2, r1 ; 0x8033e2 <currentTrackBytePos>
This is much shorter, and avoids using r27, r30, and r31, so there are more free registers available for other purposes.