Quick-links:
Because nothing is actually done in the interrupt handlers (they're just a wake-up call), the interrupt routines should be as tiny as possible.
Here's our best attempt:
ISR( TIMER1_COMPA_vect, ISR_NAKED) { asm("reti"); }
This simply returns as soon as it arrives.
Theoretically, we could do better by inserting the RETI instruction directly into the interrupt vector table, but the runes needed to do this are beyond our wit and would probably frighten the compiler.
This is a method by which a particular function is chosen for calling (possibly later). It has replaced large switch statements in a couple of places.
Advantages to function lookup tables:
Disadvantages to function lookup tables:
Example:
The following function lookup table is used to handle control characters. It decides which character processing routine should be called (later on, during the sync routine). It is used within character polling routines, hence a fixed lookup overhead is essential.
// array to lookup which character handler should be called for which control codes. process_char * PROGMEM f_control_lookup[ 32 ] = { nul_process, // 0 [does housekeeping] control_default ,// 1 control_default ,// 2 control_default ,// 3 control_default ,// 4 control_default ,// 5 control_default ,// 6 control_BEL , // 7 control_BS , // 8 control_TAB , // 9 control_LF , // 10 control_default ,// 11 control_FF , // 12 control_CR , // 13 control_default ,// 14 control_default ,// 15 control_DLE , // 16 control_default ,// 17 control_default ,// 18 control_default ,// 19 control_default ,// 20 control_default ,// 21 control_default ,// 22 control_default ,// 23 control_CAN , // 24 control_default ,// 25 control_default ,// 26 control_ESC , // 27 control_default ,// 28 control_default ,// 29 control_default ,// 30 control_default ,// 31 };
and the lookup is carried out as follows:
g_char_process = (char_process *) pgm_read_word( &f_control_lookup[ g_char ] ) ; // 16 cycle lookup time.
This is a commonly used method by which speed of execution is bought at the expense of code-size. When repeating a very small segment of code (possibly even a single instruction), the speed-overhead of checking to see if you've reached the end of the loop can become significant.
Example:
To clear 38 bytes to 0, the following loop might be used:
for( uint8_t i = 38 ; i > 0 ; i-- ) { *char_ptr++ = 0; }
This would compile nicely to a loop with the following attributes:
If the loop is 'unrolled' to:
*char_ptr++ = 0; //clear byte 1 *char_ptr++ = 0; //clear byte 2 *char_ptr++ = 0; //clear byte 3 *char_ptr++ = 0; //clear byte 4 ... *char_ptr++ = 0; //clear byte 37 *char_ptr++ = 0; //clear byte 38
then the attributes are:
This execution-speed improvement is needed in several routines (for clearing the screen, row contents etc.).
The problem with loop unrolling is that we don't always want to clear 38 bytes every time. What if we only want to clear 10 bytes? Do we need to write a new routine?
All we need to do is jump into the correct point in the unrolled loop. If we only want to clear 10 bytes, we need to jump to the last 10 unrolled instructions.
This can be achieved in C by using the default 'drop through' behaviour of switch statements...
switch ( num_to_clear ) { case 38: *char_ptr++ ; // 37 left to clear after this case 37: *char_ptr++ ; // 36 left to clear after this ... case 3: *char_ptr++ ; // 2 to clear after this case 2: *char_ptr++ ; // 1 to clear after this case 1: *char_ptr++ ; // none left to clear after this case 0: }
It works because there are no 'break' statements after each case.
If case 38 is chosen, then all subsequent cases are executed as well - 38 bytes are cleared (case 38, case 37, case 36 etc.).
If case 10 is chosen, then 10 bytes are cleared (case 10, case 9, case 8 etc.).
Unfortunately, the code produced from the GCC compiler is not quite optimal. Rather than jumping directly to the correct line of code, GCC generates a lookup table (which contains a jump to the correct address to start from for each of the 39 cases). This double-jumping, along with the overhead of 'bounds checking' (e.g. it does the 'right thing' for numbers outside 0-38) means it can't quite compete with assembler.
Note: This isn't a complaint against the GCC compiler/optimiser - far from it. It does an incredibly good job. The very fact that only a couple of places in this timing-sensitive project have warranted assembler is testament to that.
The (inline) assembler equivalent is as follows:
asm ( " ldi r30, lo8(pm(_mem_clear_end_));load pointer to the end of function into Z\n\t" " ldi r31, hi8(pm(_mem_clear_end_));ditto\n\t" " sub r30, %1;move backwards however many instructions needed\n\t" " sbc r31, __zero_reg__\n\t" " ijmp;\n\t" " st %a0+,__zero_reg__ ;38\n\t" " st %a0+,__zero_reg__ ;37\n\t" " st %a0+,__zero_reg__ ;36\n\t" " st %a0+,__zero_reg__ ;35\n\t" ... " st %a0+,__zero_reg__ ;4\n\t" " st %a0+,__zero_reg__ ;3\n\t" " st %a0+,__zero_reg__ ;2\n\t" " st %a0+,__zero_reg__ ;1\n\t" "_mem_clear_end_:\n\t" // parameters: // %0 is char_ptr (put into X), %1 is len (put anywhere). :: "x" (char_ptr),"r" (num_to_clear) // clobbers: // R30 and r31 (Z) are clobbered. : "r30","r31" ) ;
This works nicely, as long as num_to_clear is in the range 0-38. If num_to_clear is outside that range, it will jump to incorrect code locations. This will cause civilisations to crumble and whole worlds to end. Be careful.
In tight areas of code, you need any help you can get.
A carefully aligned font table means that the start of the slices are always at 256-byte boundaries. This means that all 256 characters through a particular slice can be individually addressed by only changing the lower half of the pointer.
This technique is used in the font rendering code.
The font rendering code is a particularly tight spot. A very specific 18 cycle loop is required to ensure accurate pixel placement. Inline assembler makes it feasible to include the ability to invert a particular character (for cursor rendering) as well as the '9th pixel' handling. It actually has a 'spare' 2 cycles - Does anyone have any ideas what can be done with them?
asm("\n\t" // initialise registers // r21: which character position to invert (for cursor) // r22: DDRB setting for "enable pixel output" // r23: DDRB setting for "disable pixel output" // r24: bit-pattern of previous character // This is stored so that the 9th bit can duplicate the 8th bit. // r25: count of characters left to display // X : (r26,r27) address of next character to output // r30: (z-lo) lo-byte of font lookup table (e.g. the character to lookup). // r31: (z-hi) hi-byte of font lookup table (256-byte aligned - determines which slice) " lds r21, g_render_InvertedColumn \n\t" " ldi r22, %[enable_pixel] \n\t" " ldi r23, %[disable_pixel] \n\t" " ldi r24, 0x00 \n\t" " ldi r25, %[visible_column_count] \n\t" " lds r31, g_render_FontPtrHi \n\t" "loop: \n\t" " ld r30, %a[char_ptr]+ ; straight into z-lo\n\t" " lpm __tmp_reg__,Z \n\t" " cp r21, r25 \n\t" " brne .+2 ; invert if this is the current cursor position\n\t" " com __tmp_reg__ \n\t" " sbrs r24, 0 ; skip turning off the pixel output if we want pixel 9 to be white\n\t" " out %[DDR], r23 \n\t" " mov r24, __tmp_reg__ \n\t" " out %[_SPDR], __tmp_reg__ \n\t" " out %[DDR], r22 ; switch MOSI pin to output\n\t" " rjmp .+0 ; 2 cycle nop \n\t" " subi r25, 0x01 \n\t" " brne loop \n\t" : : [char_ptr] "x" (char_ptr), [visible_column_count] "M" (COL_COUNT_VISIBLE), [enable_pixel] "M" ((1<<SIG_PIXEL_PIN)|(1<<SIG_SYNC_PIN)), [disable_pixel] "M" ((0<<SIG_PIXEL_PIN)|(1<<SIG_SYNC_PIN)), [DDR] "I" (_SFR_IO_ADDR(DDRB)), [_SPDR] "I" (_SFR_IO_ADDR(SPDR)) : "r21","r22","r23","r24","r25","r30","r31" );