Batsocks - Text on TV - Tricks and Tips

Text on TV

Tricks and Tips

Empty Interrupt Handler

Because nothing is actually done in the interrupt handlers (they're just a wake-up call), the interrupt routines should be as tiny as possible.

Here's our best attempt:

    ISR( TIMER1_COMPA_vect, ISR_NAKED)
    {
        asm("reti");
    }

This simply returns as soon as it arrives.
Theoretically, we could do better by inserting the RETI instruction directly into the interrupt vector table, but the runes needed to do this are beyond our wit and would probably frighten the compiler.

Function lookup tables

This is a method by which a particular function is chosen for calling (possibly later). It has replaced large switch statements in a couple of places.

Advantages to function lookup tables:

A constant lookup overhead.
encapsulation
The same lookups can be re-used

Disadvantages to function lookup tables:

Can't speedup/shortcut common lookups
Sparse lookups are wasteful.

Example:

The following function lookup table is used to handle control characters. It decides which character processing routine should be called (later on, during the sync routine). It is used within character polling routines, hence a fixed lookup overhead is essential.


    // array to lookup which character handler should be called for which control codes.
    process_char * PROGMEM f_control_lookup[ 32 ] =
    {
        nul_process,     // 0  [does housekeeping]
        control_default ,// 1
        control_default ,// 2
        control_default ,// 3
        control_default ,// 4
        control_default ,// 5
        control_default ,// 6
        control_BEL ,    // 7
        control_BS ,     // 8
        control_TAB ,    // 9
        control_LF ,     // 10
        control_default ,// 11
        control_FF ,     // 12
        control_CR ,     // 13
        control_default ,// 14
        control_default ,// 15
        control_DLE ,    // 16
        control_default ,// 17
        control_default ,// 18
        control_default ,// 19
        control_default ,// 20
        control_default ,// 21
        control_default ,// 22
        control_default ,// 23
        control_CAN ,    // 24
        control_default ,// 25
        control_default ,// 26
        control_ESC ,    // 27
        control_default ,// 28
        control_default ,// 29
        control_default ,// 30
        control_default ,// 31
    };

and the lookup is carried out as follows:

    g_char_process = (char_process *) pgm_read_word( &amp;f_control_lookup[ g_char ] ) ; // 16 cycle lookup time.

Loop Unrolling

This is a commonly used method by which speed of execution is bought at the expense of code-size. When repeating a very small segment of code (possibly even a single instruction), the speed-overhead of checking to see if you've reached the end of the loop can become significant.

Example:

To clear 38 bytes to 0, the following loop might be used:


    for( uint8_t i = 38 ; i > 0 ; i-- )
    {
        *char_ptr++ = 0;
    }

This would compile nicely to a loop with the following attributes:

Code size: about 4 or 5 instructions
Code speed: 5 * 38 = 190 cycles

If the loop is 'unrolled' to:


    *char_ptr++ = 0; //clear byte 1
    *char_ptr++ = 0; //clear byte 2
    *char_ptr++ = 0; //clear byte 3
    *char_ptr++ = 0; //clear byte 4
    ...
    *char_ptr++ = 0; //clear byte 37
    *char_ptr++ = 0; //clear byte 38

then the attributes are:

Code size: 38 instructions (+ a couple for setup)
Code speed: 76 cycles.

This execution-speed improvement is needed in several routines (for clearing the screen, row contents etc.).

Fast array clearing

The problem with loop unrolling is that we don't always want to clear 38 bytes every time. What if we only want to clear 10 bytes? Do we need to write a new routine?

All we need to do is jump into the correct point in the unrolled loop. If we only want to clear 10 bytes, we need to jump to the last 10 unrolled instructions.

This can be achieved in C by using the default 'drop through' behaviour of switch statements...


    switch ( num_to_clear )
    {
        case 38: *char_ptr++ ; // 37 left to clear after this
        case 37: *char_ptr++ ; // 36 left to clear after this
        ...
        case 3:  *char_ptr++ ; // 2 to clear after this
        case 2:  *char_ptr++ ; // 1 to clear after this
        case 1:  *char_ptr++ ; // none left to clear after this
        case 0:
     }

It works because there are no 'break' statements after each case.

If case 38 is chosen, then all subsequent cases are executed as well - 38 bytes are cleared (case 38, case 37, case 36 etc.).

If case 10 is chosen, then 10 bytes are cleared (case 10, case 9, case 8 etc.).

Unfortunately, the code produced from the GCC compiler is not quite optimal. Rather than jumping directly to the correct line of code, GCC generates a lookup table (which contains a jump to the correct address to start from for each of the 39 cases). This double-jumping, along with the overhead of 'bounds checking' (e.g. it does the 'right thing' for numbers outside 0-38) means it can't quite compete with assembler.

Note: This isn't a complaint against the GCC compiler/optimiser - far from it. It does an incredibly good job. The very fact that only a couple of places in this timing-sensitive project have warranted assembler is testament to that.

The (inline) assembler equivalent is as follows:


    asm (
		"    ldi    r30, lo8(pm(_mem_clear_end_));load pointer to the end of function into Z\n\t"
		"    ldi r31, hi8(pm(_mem_clear_end_));ditto\n\t"
		"    sub    r30, %1;move backwards however many instructions needed\n\t"
        "    sbc    r31, __zero_reg__\n\t"
        "    ijmp;\n\t"
        "    st        %a0+,__zero_reg__ ;38\n\t"
        "    st        %a0+,__zero_reg__ ;37\n\t"
        "    st        %a0+,__zero_reg__ ;36\n\t"
        "    st        %a0+,__zero_reg__ ;35\n\t"
        ...
        "    st        %a0+,__zero_reg__ ;4\n\t"
        "    st        %a0+,__zero_reg__ ;3\n\t"
        "    st        %a0+,__zero_reg__ ;2\n\t"
        "    st        %a0+,__zero_reg__ ;1\n\t"
        "_mem_clear_end_:\n\t"
		// parameters:
		// %0 is char_ptr (put into X), %1 is len (put anywhere).
        :: "x" (char_ptr),"r" (num_to_clear)
		// clobbers:
		// R30 and r31 (Z) are clobbered.
	: "r30","r31" ) ;

This works nicely, as long as num_to_clear is in the range 0-38. If num_to_clear is outside that range, it will jump to incorrect code locations. This will cause civilisations to crumble and whole worlds to end. Be careful.

256-byte aligned font table

In tight areas of code, you need any help you can get.

A carefully aligned font table means that the start of the slices are always at 256-byte boundaries. This means that all 256 characters through a particular slice can be individually addressed by only changing the lower half of the pointer.

This technique is used in the font rendering code.

Font Rendering Assembler

The font rendering code is a particularly tight spot. A very specific 18 cycle loop is required to ensure accurate pixel placement. Inline assembler makes it feasible to include the ability to invert a particular character (for cursor rendering) as well as the '9th pixel' handling. It actually has a 'spare' 2 cycles - Does anyone have any ideas what can be done with them?



    asm("\n\t"
            // initialise registers
            // r21: which character position to invert (for cursor)
            // r22: DDRB setting for "enable pixel output"
            // r23: DDRB setting for "disable pixel output"
            // r24: bit-pattern of previous character
            //        This is stored so that the 9th bit can duplicate the 8th bit.
            // r25: count of characters left to display
            // X  : (r26,r27) address of next character to output
            // r30: (z-lo) lo-byte of font lookup table (e.g. the character to lookup).
            // r31: (z-hi) hi-byte of font lookup table (256-byte aligned - determines which slice)
            "    lds        r21, g_render_InvertedColumn     \n\t"
            "    ldi        r22, %[enable_pixel]             \n\t"
            "    ldi        r23, %[disable_pixel]            \n\t"
            "    ldi        r24, 0x00                        \n\t"
            "    ldi        r25, %[visible_column_count]     \n\t"
            "    lds        r31, g_render_FontPtrHi          \n\t"
            "loop:                                           \n\t"
            "    ld         r30, %a[char_ptr]+        ; straight into z-lo\n\t"
            "    lpm        __tmp_reg__,Z                    \n\t"
            "    cp         r21, r25                         \n\t"
            "    brne       .+2                       ; invert if this is the current cursor position\n\t"
            "    com        __tmp_reg__                      \n\t"
            "    sbrs       r24, 0                    ; skip turning off the pixel output if we want pixel 9 to be white\n\t"
            "    out        %[DDR],    r23                   \n\t"
            "    mov        r24, __tmp_reg__                 \n\t"
            "    out        %[_SPDR], __tmp_reg__            \n\t"
            "    out        %[DDR], r22               ; switch MOSI pin to output\n\t"
            "    rjmp       .+0                       ; 2 cycle nop      \n\t"
            "    subi       r25, 0x01                        \n\t"
            "    brne       loop                             \n\t"
            :
            :
            [char_ptr]        "x" (char_ptr),
            [visible_column_count] "M" (COL_COUNT_VISIBLE),
            [enable_pixel]    "M" ((1<<SIG_PIXEL_PIN)|(1<<SIG_SYNC_PIN)),
            [disable_pixel]   "M" ((0<<SIG_PIXEL_PIN)|(1<<SIG_SYNC_PIN)),
            [DDR]             "I" (_SFR_IO_ADDR(DDRB)),
            [_SPDR]           "I" (_SFR_IO_ADDR(SPDR))
            :
            "r21","r22","r23","r24","r25","r30","r31"
        );