Can a fast microcontroller replace external glue logic, while also continuing to run application code? This is the third in a series of posts considering the question. It’s part of a potential simplification of my Floppy Emu disk emulator hardware, whose present design combines an MCU and a CPLD for glue logic. For readers that haven’t seen the first two parts, you can find them here. Read these first, including the comments discussion after the post body. Go ahead, I’ll wait.
Thoughts on Floppy Emu Redesign
Thoughts on Low Latency Interrupt Handling
There are several pieces of CPLD glue logic that I’m hoping to replace with interrupt handlers on a Cortex M4 microcontroller, specifically the 120 MHz Atmel SAMD51 Cortex M4. The most challenging is a piece of logic that behaves like a 16:1 mux, and must respond within 500 ns to any change on its address inputs. There’s also a write function that behaves a little like a 4-bit latch, as well as some enable logic. I haven’t yet done any real hardware testing, but I’ve spent many hours reading datasheets, writing code, and examining compiler output. I’ll save you the suspense: I don’t think it’s going to work. But it’s close enough to keep it interesting.
Coding an Interrupt Handler
A 120 MHz MCU means there are 120 clock cycles per microsecond. To meet the 500 ns (half a microsecond) timing requirement for the mux logic, the MCU needs to do its work in 60 clock cycles. Cortex M4 has a built-in interrupt latency of 12 clock cycles before the interrupt handler begins to run, so that leaves just 48 clock cycles to do the actual work. At best that’s enough time for 48 instructions. In reality it will be fewer than 48, due to pipeline issues, cache misses, branches, flash memory wait states, and the fact that some instructions just inherently take more than one clock cycle. But 48 is the theoretical upper bound.
I spent a while digging through the heavily-abstracted (or should I say obfuscated) code of Atmel Start, the hardware abstraction library provided for the SAMD51. Peeling back the layers of Start, I wrote a minimal interrupt handler that directly manipulates the MCU configuration registers for maximum speed, rather than using the Start API. I ignored the write latch and the enable logic for the moment, and just wrote an interrupt handler for the 16:1 mux function. Bearing in mind this code has never been run on real hardware, here it is:
volatile uint32_t selectedDriveRegister; volatile uint32_t driveRegisters[16]; void EIC_Handler(void) { // a shared interrupt handler for changes on five different external pins: // EXTINT0 = PA00 = SEL - interrupt on rising or falling edge // EXTINT1 = PA01 = PH0 - interrupt on rising or falling edge // EXTINT2 = PA02 = PH1 - interrupt on rising or falling edge // EXTINT3 = PA03 = PH2 - interrupt on rising or falling edge // EXTINT4 = PA04 = PH3 - interrupt on rising edge // PA11 = output uint32_t flags = EIC->INTFLAG.reg; // a 1 bit means a change was detected on that pin // clear EXTINT0-4 flags, if they were set. EIC->INTFLAG.reg = (flags & 0x1F); // writing a 1 bit clears the interrupt flags // mask the 4 lowest bits and use them as the address of the desired drive register selectedDriveRegister = PORT->Group[GPIO_PORTA].IN.reg & 0xF; // don't need to check if drive is enabled. // output enable will be handled externally in a level shifter. switch (selectedDriveRegister) { case 7: // motor tachometer // enable peripheral multiplexer selection PORT->Group[GPIO_PORTA].PINCFG[11].bit.PMUXEN = 1; // choose TIMER/COUNTER1 peripheral PORT->Group[GPIO_PORTA].PMUX[11>>1].bit.PMUXO = MUX_PA11E_TC1_WO1; break; case 8: // disk data side 0 // enable peripheral multiplexer selection PORT->Group[GPIO_PORTA].PINCFG[11].bit.PMUXEN = 1; // choose SERCOM0 peripheral PORT->Group[GPIO_PORTA].PMUX[11>>1].bit.PMUXO = MUX_PA11C_SERCOM0_PAD3; // TODO: main loop must check selectedDriveRegister to see if it's 8 or 9 when adding // new bytes to SPI break; case 9: // disk data side 1 // enable peripheral multiplexer selection PORT->Group[GPIO_PORTA].PINCFG[11].bit.PMUXEN = 1; // choose SERCOM0 peripheral PORT->Group[GPIO_PORTA].PMUX[11>>1].bit.PMUXO = MUX_PA11C_SERCOM0_PAD3; // TODO: main loop must check selectedDriveRegister to see if it's 8 or 9 when adding // new bytes to SPI break; default: // disk state flags and configuration constants // disable peripheral multiplexer selection, return to standard GPIO PORT->Group[GPIO_PORTA].PINCFG[11].bit.PMUXEN = 0; // set the output pin high or low, according to the register state if (driveRegisters[selectedDriveRegister]) PORT->Group[GPIO_PORTA].OUTSET.reg = (1 << 11); else PORT->Group[GPIO_PORTA].OUTCLR.reg = (1 << 11); // TODO: also change the PA11 output in the main loop, if the selected register // changes its value break; } }
You'll notice that EXTINT4 (the PH3 signal on the disk interface) isn't actually used in this code, but it will be needed later for the write latch.
The default of the switch statement is about what you'd expect: it uses four of the inputs to construct a 4-bit address, then uses that address to access an array of 16 internal drive registers. Then it sets the output pin high or low, depending on the internal register value.
Addresses 7, 8, and 9 get special handling. These aren't really registers, but are pass-throughs of the drive motor tachometer signal or of the instantaneous read head data from the top or bottom of the disk. They're not static values, but rather are constantly changing streams of data. I plan to implement the tachometer using the timer/counter peripheral, and the read head data using the SPI peripheral. All of these functions share the same pin, PA11. The code must enable and disable the peripheral pin remapping functions as needed.
After finishing this speculative interrupt handler code, I compiled it in Atmel Studio, using gcc with -O2 optimization. Then I viewed the .lss to see what code the compiler generated:
00000c70 <EIC_Handler>: c70: 481f ldr r0, [pc, #124] ; (cf0 <EIC_Handler+0x80>) c72: 4b20 ldr r3, [pc, #128] ; (cf4 <EIC_Handler+0x84>) c74: 6942 ldr r2, [r0, #20] c76: 4920 ldr r1, [pc, #128] ; (cf8 <EIC_Handler+0x88>) c78: f002 021f and.w r2, r2, #31 c7c: 6142 str r2, [r0, #20] c7e: 6a1a ldr r2, [r3, #32] c80: f002 020f and.w r2, r2, #15 c84: 600a str r2, [r1, #0] c86: 680a ldr r2, [r1, #0] c88: 2a08 cmp r2, #8 c8a: d012 beq.n cb2 <EIC_Handler+0x42> c8c: 2a09 cmp r2, #9 c8e: d010 beq.n cb2 <EIC_Handler+0x42> c90: 2a07 cmp r2, #7 c92: f893 204b ldrb.w r2, [r3, #75] ; 0x4b c96: d01a beq.n cce <EIC_Handler+0x5e> c98: f36f 0200 bfc r2, #0, #1 c9c: f883 204b strb.w r2, [r3, #75] ; 0x4b ca0: 4816 ldr r0, [pc, #88] ; (cfc <EIC_Handler+0x8c>) ca2: 680a ldr r2, [r1, #0] ca4: f850 2022 ldr.w r2, [r0, r2, lsl #2] ca8: b9ea cbnz r2, ce6 <EIC_Handler+0x76> caa: f44f 6200 mov.w r2, #2048 ; 0x800 cae: 615a str r2, [r3, #20] cb0: 4770 bx lr cb2: f893 204b ldrb.w r2, [r3, #75] ; 0x4b cb6: f042 0201 orr.w r2, r2, #1 cba: f883 204b strb.w r2, [r3, #75] ; 0x4b cbe: f893 2035 ldrb.w r2, [r3, #53] ; 0x35 cc2: 2102 movs r1, #2 cc4: f361 1207 bfi r2, r1, #4, #4 cc8: f883 2035 strb.w r2, [r3, #53] ; 0x35 ccc: 4770 bx lr cce: f042 0201 orr.w r2, r2, #1 cd2: f883 204b strb.w r2, [r3, #75] ; 0x4b cd6: f893 2035 ldrb.w r2, [r3, #53] ; 0x35 cda: 2104 movs r1, #4 cdc: f361 1207 bfi r2, r1, #4, #4 ce0: f883 2035 strb.w r2, [r3, #53] ; 0x35 ce4: 4770 bx lr ce6: f44f 6200 mov.w r2, #2048 ; 0x800 cea: 619a str r2, [r3, #24] cec: 4770 bx lr cee: bf00 nop cf0: 40002800 .word 0x40002800 cf4: 41008000 .word 0x41008000 cf8: 2000063c .word 0x2000063c cfc: 200005f8 .word 0x200005f8
I don't know much about ARM assembly, but I can count 44 instructions. Already that looks pretty dubious for execution in 48 clock cycles. A couple of cache misses, or multi-cycle branches, or anything else that requires more than one clock per instruction, and the interrupt handler will be too slow to work. And if I attempt to add the missing write latch logic, the code will almost certainly be too slow. Even just an if() test to see whether the write latch was written would probably be too much extra code.
Meanwhile the microcontroller will be running the main application, responding to user input, updating the display, and streaming disk data. Occasionally the main loop will need to do an atomic operation, requiring interrupts to be disabled for a few clock cycles. If an external pin changes state during that time, the interrupt handler will be delayed by a few clock cycles.
The interrupt handler shown above is appropriate for one of the Floppy Emu's many disk emulation modes. In other modes, a different behavior is needed. A real interrupt handler would need some more if() checks at the beginning to perform different actions depending on the current emulation mode. This would add a few clock cycles more.
Even reaching this "almost fast enough" level would require some minor heroics. I'm fairly certain the interrupt handler code would need to be in RAM, not flash, to minimize or eliminate flash wait states. Even RAM might not be enough - it might need to be placed in the special "tightly coupled memory" region. The vector table itself probably also needs to be relocated from flash to RAM or TCM. This should be theoretically possible, but it's the sort of uncommon thing that's often difficult to find good documentation or examples about, and that eats up lots of development time.
To make a long story short - it doesn't look like it's going to work. And even if it did work, it might be such a pain in the ass that it negates any gain I'd get by eliminating the CPLD. And yet it looks pretty close to working, at least within a factor of two if not less. If the timing requirement were 1000 ns instead of 500 ns, I think I could make it work.
Other Interrupt Oddities
According to the docs I've read, interrupt handlers on ARM are just like any other function. There's no special interrupt prologue or epilogue, and there's no RTI return from interrupt instruction. And yet gcc does specify an interrupt attribute for ARM functions:
__attribute__ ((interrupt))
The code in Atmel Start doesn't appear to use that attribute for its interrupt handlers. So is it needed or not? What does it do? As best as I can tell, it adds some extra code that aligns the stack pointer upon entry to the interrupt handler, but why? If I add the interrupt attribute to my EIC_Handler(), it gets many instructions longer.
Another unanswered question is how to handle nested interrupts. EIC_Handler wouldn't be the only interrupt handler in the firmware, but it should be the highest priority. If another interrupt handler is running when an external pin changes state, that handler should be pre-empted and EIC_Handler should be started. The Cortex M4 supports nested interrupts, but is there any extra code needed in the interrupt handlers to make it work correctly? Extra registers that must be pushed and popped? I'm not sure, but this discussion suggests the answer is yes. If so, that would add still more instructions to the interrupt handler, making it even slower.