6502 assembly optimisations: Difference between revisions
(New page, should be continued) |
(→Use look-up table to shift right 4 times: Provided that the high nibble is already cleared) |
||
Line 137: | Line 137: | ||
Savings : 2 cycles | Savings : 2 cycles | ||
=== Use look-up table to shift | === Use look-up table to shift left 4 times === | ||
Provided that the high nibble is already cleared, you can shift left by 4 by making a multiplication look-up table. | |||
<pre> | |||
Example: | |||
lda rownum | |||
asl A | |||
asl A | |||
asl A | |||
asl A | |||
rts | |||
</pre> | |||
becomes | |||
<pre> | |||
Example: | |||
ldx rownum | |||
lda times_sixteen,x | |||
rts | |||
times_sixteen: | |||
.byt $00, $10, $20, $30, $40, $50, $60, $70 | |||
.byt $80, $90, $A0, $B0, $C0, $D0, $E0, $F0 | |||
</pre> | |||
Savings: 4 cycles | |||
Savings : 4 cycles | |||
== Optimise code size at the expense of cycles == | == Optimise code size at the expense of cycles == |
Revision as of 13:35, 28 April 2012
This page is about optimisations that are possible in assembly language, or various things one programmer has to keep in mind to make his code as optimal as possible.
There is two major kind of optimisations : Optimisation for speed (code executes in less cycles) and optimisation for size (the code takes less bytes).
There is also some other kinds of optimisations, such as constant-executing-time optimisation (code execute in a constant number of cycle no matter what it has to do), or RAM usage optimisation (use as few variales as possible). Because those optimisations have more to do with the algorithm than with its implementation in assembly, only speed and size optimisations will be discussed in this article.
Optimise both speed and size of the code
Avoid a jsr + rts chain
When a subroutine finishes it works by calling another subroutine, use a jmp instruction instead :
MySubroutine lda Foo sta Bar jsr SomeRandomRoutine rts
becomes :
MySubroutine lda Foo sta Bar jmp SomeRandomRoutine
Savings : 9 cycles, 1 byte
Split word tables in high and low componants
This optimisation is not human friendly, makes the source code much bigger, but still makes the compiled size smaller and faster :
Example lda FooBar asl A tax lda PointerTable,X sta Temp lda PointerTable+1,X sta Temp+1 ....
PointerTable .dw Pointer1, Pointer2, ....
Becomes :
Example ldx FooBar lda PointerTableL,X sta Temp lda PointerTableH,X sta Temp+1 ....
PointerTableL .db <Pointer1, <Pointer2, ....
PointerTableH .db >Pointer1, >Pointer2, ....
Savings : 2 bytes, 4 cycles
Use Jump tables with RTS instruction instead of JMP indirect instruction
Example ldx JumpEntry lda PointerTableL,X sta Temp lda PointerTableH,X sta Temp+1 jmp [temp]
becomes :
Example ldx JumpEntry lda PointerTableL,X pha lda PointerTableH,X pha rts
Savings : 4 bytes, 1 cycle.
Use a macro instead of a subroutine which is only called once
What is the point to call a subroutine if you only call it at a single place ? It would be more optimal to just instert the code where the subroutine is called. However this makes the code less structured and harder to understand.
How macros are used depends on the assembler so no code examples will be placed here to avoid further confusion.
Savings : 4 bytes, 12 cycles.
Logical shift right
Compact way to divide a variable by 2 but keep it's sign :
cmp #$80 ror A
Easily test 2 upper bits of a variable
lda FooBar asl A ;C = b7, N = b6
Test bits in decreasing order
lda foobar bmi bit7_set cmp #$40 ; we know that bit 7 wasn't set bcs bit6_set cmp #$20 bcs bit5_set ; and so on
Optimise speed at the expense of size
Those optimisations will make code faster to execute, but use more ROM.
Use identity look-up table instead of temp variable
Example ldx Foo lda Bar stx Temp clc adc Temp ;A = Foo + Bar
becomes :
Example ldx Foo lda Bar clc adc Identity,X ;A = Foo + Bar
Identity
.db $00, $01, $02, $03, .....
Savings : 2 cycles
Use look-up table to shift left 4 times
Provided that the high nibble is already cleared, you can shift left by 4 by making a multiplication look-up table.
Example: lda rownum asl A asl A asl A asl A rts
becomes
Example: ldx rownum lda times_sixteen,x rts times_sixteen: .byt $00, $10, $20, $30, $40, $50, $60, $70 .byt $80, $90, $A0, $B0, $C0, $D0, $E0, $F0
Savings: 4 cycles
Optimise code size at the expense of cycles
Those optimisations will produce code that is smaller but takes more cycles to execute.
Use the stack instead of a temp variable
Example lda Foo sta Temp lda Bar .... .... lda Temp ;Restores Foo .....
becomes:
Example lda Foo pha lda Bar .... .... pla ;Restores Foo .....
Savings : 2 bytes.
Use an "intelligent" argument system
Each time a routine needs multiple bytes of arguments (>3) it's hard to code it without wasting a lot of bytes.
Example lda Argument1 sta Temp lda Argument2 ldx Argument3 ldy Argument4 jsr RoutineWhichNeeds4Args .....
Becomes something like :
Example jsr PassArguments .dw RoutineWhichNeeds4Args .db Argument1, Argument2, Argument3, Argument4 .db $00 ....
PassArguments pla tay pla pha ; put the high byte back sta pointer+1 ldx #$00 beq SKIP LOOP sta parameters,x inx SKIP iny ; pointing one short first pass here fixes that lda (pointer),y bne LOOP iny lda (pointer),y beq LOOP
dey ; fix the return address guess we can't return to a ; break tya pha jmp (parameters)
Savings : Complicated to estimate - only saves bytes if the trick is used fairly often across the program, in order to compensate for the size of the PassArguments routine.