6502 assembly optimisations: Difference between revisions

From NESdev Wiki
Jump to navigationJump to search
(New page, should be continued)
 
(→‎Use look-up table to shift right 4 times: Provided that the high nibble is already cleared)
Line 137: Line 137:
Savings : 2 cycles
Savings : 2 cycles


=== Use look-up table to shift right 4 times ===
=== Use look-up table to shift left 4 times ===
Provided that the high nibble is already cleared, you can shift left by 4 by making a multiplication look-up table.
<pre>
Example:
  lda rownum
  asl A
  asl A
  asl A
  asl A
  rts
</pre>


Example
becomes
    lda FooBar
<pre>
    asl A
Example:
    asl A
  ldx rownum
    asl A
  lda times_sixteen,x
    asl A
  rts


becomes :
times_sixteen:
   Example
   .byt $00, $10, $20, $30, $40, $50, $60, $70
    ldx FooBar
  .byt $80, $90, $A0, $B0, $C0, $D0, $E0, $F0
    lda Shiftright4Tbl,X
</pre>
 
Savings: 4 cycles
Shiftright4Tbl
    .db $00, $10, $20, $30, ...
 
Savings : 4 cycles


== Optimise code size at the expense of cycles ==
== Optimise code size at the expense of cycles ==

Revision as of 13:35, 28 April 2012

This page is about optimisations that are possible in assembly language, or various things one programmer has to keep in mind to make his code as optimal as possible.

There is two major kind of optimisations : Optimisation for speed (code executes in less cycles) and optimisation for size (the code takes less bytes).

There is also some other kinds of optimisations, such as constant-executing-time optimisation (code execute in a constant number of cycle no matter what it has to do), or RAM usage optimisation (use as few variales as possible). Because those optimisations have more to do with the algorithm than with its implementation in assembly, only speed and size optimisations will be discussed in this article.

Optimise both speed and size of the code

Avoid a jsr + rts chain

When a subroutine finishes it works by calling another subroutine, use a jmp instruction instead :

MySubroutine
  lda Foo
  sta Bar
  jsr SomeRandomRoutine
  rts

becomes :

MySubroutine
  lda Foo
  sta Bar
  jmp SomeRandomRoutine

Savings : 9 cycles, 1 byte

Split word tables in high and low componants

This optimisation is not human friendly, makes the source code much bigger, but still makes the compiled size smaller and faster :

Example
  lda FooBar
  asl A
  tax
  lda PointerTable,X
  sta Temp
  lda PointerTable+1,X
  sta Temp+1
  ....
PointerTable
  .dw Pointer1, Pointer2, ....

Becomes :

Example
  ldx FooBar
  lda PointerTableL,X
  sta Temp
  lda PointerTableH,X
  sta Temp+1
  ....
PointerTableL
  .db <Pointer1, <Pointer2, ....
PointerTableH
  .db >Pointer1, >Pointer2, ....

Savings : 2 bytes, 4 cycles

Use Jump tables with RTS instruction instead of JMP indirect instruction

 Example
   ldx JumpEntry
   lda PointerTableL,X
   sta Temp
   lda PointerTableH,X
   sta Temp+1
   jmp [temp]

becomes :

 Example
   ldx JumpEntry
   lda PointerTableL,X
   pha
   lda PointerTableH,X
   pha
   rts

Savings : 4 bytes, 1 cycle.

Use a macro instead of a subroutine which is only called once

What is the point to call a subroutine if you only call it at a single place ? It would be more optimal to just instert the code where the subroutine is called. However this makes the code less structured and harder to understand.

How macros are used depends on the assembler so no code examples will be placed here to avoid further confusion.

Savings : 4 bytes, 12 cycles.

Logical shift right

Compact way to divide a variable by 2 but keep it's sign :

  cmp #$80
  ror A

Easily test 2 upper bits of a variable

   lda FooBar
   asl A         ;C = b7, N = b6

Test bits in decreasing order

  lda foobar 
  bmi bit7_set 
  cmp #$40  ; we know that bit 7 wasn't set 
  bcs bit6_set 
  cmp #$20 
  bcs bit5_set 
            ; and so on

Optimise speed at the expense of size

Those optimisations will make code faster to execute, but use more ROM.

Use identity look-up table instead of temp variable

 Example
    ldx Foo
    lda Bar
    stx Temp
    clc
    adc Temp    ;A = Foo + Bar

becomes :

 Example
    ldx Foo
    lda Bar
    clc
    adc Identity,X    ;A = Foo + Bar

Identity

    .db $00, $01, $02, $03, .....

Savings : 2 cycles

Use look-up table to shift left 4 times

Provided that the high nibble is already cleared, you can shift left by 4 by making a multiplication look-up table.

Example:
  lda rownum
  asl A
  asl A
  asl A
  asl A
  rts

becomes

Example:
  ldx rownum
  lda times_sixteen,x
  rts

times_sixteen:
  .byt $00, $10, $20, $30, $40, $50, $60, $70
  .byt $80, $90, $A0, $B0, $C0, $D0, $E0, $F0

Savings: 4 cycles

Optimise code size at the expense of cycles

Those optimisations will produce code that is smaller but takes more cycles to execute.

Use the stack instead of a temp variable

Example
   lda Foo
   sta Temp
   lda Bar
   ....
   ....
   lda Temp   ;Restores Foo
   .....

becomes:

Example
   lda Foo
   pha
   lda Bar
   ....
   ....
   pla   ;Restores Foo
   .....

Savings : 2 bytes.

Use an "intelligent" argument system

Each time a routine needs multiple bytes of arguments (>3) it's hard to code it without wasting a lot of bytes.

 Example
    lda Argument1
    sta Temp
    lda Argument2
    ldx Argument3
    ldy Argument4
    jsr RoutineWhichNeeds4Args
    .....

Becomes something like :

 Example
    jsr PassArguments
    .dw RoutineWhichNeeds4Args
    .db Argument1, Argument2, Argument3, Argument4
    .db $00
    ....
PassArguments
 pla 
 tay 
 pla 
 pha                    ; put the high byte back 
 sta pointer+1 
 ldx #$00 
 beq SKIP 
LOOP 
 sta parameters,x 
 inx 
SKIP 
 iny                    ; pointing one short first pass here fixes that 
 lda (pointer),y 
 bne LOOP      
 iny 
 lda (pointer),y 
 beq LOOP      
 dey                    ; fix the return address guess we can't return to a 
                        ;  break        
 tya 
 pha 
 jmp (parameters)

Savings : Complicated to estimate - only saves bytes if the trick is used fairly often across the program, in order to compensate for the size of the PassArguments routine.