6502 assembly optimisations: Difference between revisions
No edit summary |
(Removal of syntax highlighter) |
||
Line 12: | Line 12: | ||
<span id="Tail call"></span>A [[wikipedia:tail call|tail call]] occurs when a subroutine finishes by calling another subroutine. This can be optimised into a JMP instruction: | <span id="Tail call"></span>A [[wikipedia:tail call|tail call]] occurs when a subroutine finishes by calling another subroutine. This can be optimised into a JMP instruction: | ||
< | <pre> | ||
MySubroutine | MySubroutine | ||
lda Foo | lda Foo | ||
Line 18: | Line 18: | ||
jsr SomeRandomRoutine | jsr SomeRandomRoutine | ||
rts | rts | ||
</ | </pre> | ||
becomes : | becomes : | ||
< | <pre> | ||
MySubroutine | MySubroutine | ||
lda Foo | lda Foo | ||
sta Bar | sta Bar | ||
jmp SomeRandomRoutine | jmp SomeRandomRoutine | ||
</ | </pre> | ||
Savings : 9 cycles, 1 byte | Savings : 9 cycles, 1 byte | ||
Line 35: | Line 35: | ||
This optimisation is not human friendly, makes the source code much bigger, but still makes the compiled<ref name="compile_pedantry">Pedants may complain that "compile" is incorrect terminology for "translate a program written in assembly language into object code". But it is the most familiar term meaning "translate a program, no matter the language, into object code", and the same issues apply to code generators within a compiler that targets the 6502 as to programs written in 6502 assembly language.</ref> size smaller and faster: | This optimisation is not human friendly, makes the source code much bigger, but still makes the compiled<ref name="compile_pedantry">Pedants may complain that "compile" is incorrect terminology for "translate a program written in assembly language into object code". But it is the most familiar term meaning "translate a program, no matter the language, into object code", and the same issues apply to code generators within a compiler that targets the 6502 as to programs written in 6502 assembly language.</ref> size smaller and faster: | ||
< | <pre> | ||
Example | Example | ||
lda FooBar | lda FooBar | ||
Line 48: | Line 48: | ||
PointerTable | PointerTable | ||
.dw Pointer1, Pointer2, .... | .dw Pointer1, Pointer2, .... | ||
</ | </pre> | ||
Becomes : | Becomes : | ||
< | <pre> | ||
Example | Example | ||
ldx FooBar | ldx FooBar | ||
Line 66: | Line 66: | ||
PointerTableH | PointerTableH | ||
.byt >Pointer1, >Pointer2, .... | .byt >Pointer1, >Pointer2, .... | ||
</ | </pre> | ||
Some assemblers may have a way to implement a macro to automatically make the table coded like this (Unofficial MagicKit Assembler is one such program). | Some assemblers may have a way to implement a macro to automatically make the table coded like this (Unofficial MagicKit Assembler is one such program). | ||
Line 76: | Line 76: | ||
<span id="RTS Trick"></span>The so-called [[RTS Trick]] is a method of implementing jump tables by pushing a subroutine's entry point to the stack. | <span id="RTS Trick"></span>The so-called [[RTS Trick]] is a method of implementing jump tables by pushing a subroutine's entry point to the stack. | ||
< | <pre> | ||
; This: | ; This: | ||
Line 94: | Line 94: | ||
pha | pha | ||
rts | rts | ||
</ | </pre> | ||
Note that PointerTable entries must point to one byte ''before'' the intended target when the RTS trick is used, because RTS will add 1 to the offset. | Note that PointerTable entries must point to one byte ''before'' the intended target when the RTS trick is used, because RTS will add 1 to the offset. | ||
Line 102: | Line 102: | ||
To squeeze slightly more, it's possible to combine this with the tail call optimization: | To squeeze slightly more, it's possible to combine this with the tail call optimization: | ||
< | <pre> | ||
; This: | ; This: | ||
Line 121: | Line 121: | ||
pha | pha | ||
jmp SomeOtherFunction | jmp SomeOtherFunction | ||
</ | </pre> | ||
Here, the CPU runs SomeOtherFunction, then returns to the function from the jump table, then returns to what called this dispatcher. | Here, the CPU runs SomeOtherFunction, then returns to the function from the jump table, then returns to what called this dispatcher. | ||
Line 141: | Line 142: | ||
Compact way to divide a variable by 2 but keep its sign: | Compact way to divide a variable by 2 but keep its sign: | ||
< | <pre> | ||
cmp #$80 | cmp #$80 | ||
ror A | ror A | ||
</ | </pre> | ||
=== Easily test 2 upper bits of a variable === | === Easily test 2 upper bits of a variable === | ||
< | <pre> | ||
lda FooBar | lda FooBar | ||
asl A ;C = b7, N = b6 | asl A ;C = b7, N = b6 | ||
</ | </pre> | ||
Alternative: | Alternative: | ||
< | <pre> | ||
bit Foobar ;N = b7, V = b6, regardless of the value of A. | bit Foobar ;N = b7, V = b6, regardless of the value of A. | ||
</ | </pre> | ||
This can be e.g. used to poll the sprite-0-hit flag in $2002. | This can be e.g. used to poll the sprite-0-hit flag in $2002. | ||
Line 163: | Line 164: | ||
=== Negating a value without temporaries === | === Negating a value without temporaries === | ||
< | <pre> | ||
eor #$FF | eor #$FF | ||
clc | clc | ||
adc #1 | adc #1 | ||
</ | </pre> | ||
=== Avoiding the need for CLC/SEC with ADC/SBC === | === Avoiding the need for CLC/SEC with ADC/SBC === | ||
Line 182: | Line 183: | ||
=== Test bits in decreasing order === | === Test bits in decreasing order === | ||
< | <pre> | ||
lda foobar | lda foobar | ||
bmi bit7_set | bmi bit7_set | ||
Line 190: | Line 191: | ||
bcs bit5_set | bcs bit5_set | ||
; and so on | ; and so on | ||
</ | </pre> | ||
Or if you do not need to preserve the value of A: | Or if you do not need to preserve the value of A: | ||
< | <pre> | ||
lda foobar | lda foobar | ||
asl | asl | ||
Line 203: | Line 204: | ||
bcs bit5_set | bcs bit5_set | ||
; and so on | ; and so on | ||
</ | </pre> | ||
This saves one byte per comparison, but 2 cycles more are used because of the extra ASL. | This saves one byte per comparison, but 2 cycles more are used because of the extra ASL. | ||
Line 209: | Line 210: | ||
=== Test bits in increasing order === | === Test bits in increasing order === | ||
< | <pre> | ||
lda foobar | lda foobar | ||
lsr | lsr | ||
Line 218: | Line 219: | ||
bcs bit2_set | bcs bit2_set | ||
; and so on | ; and so on | ||
</ | </pre> | ||
Note: This does not preserve the value of A. | Note: This does not preserve the value of A. | ||
Line 227: | Line 228: | ||
A way to do it is to look for an opcode that has the bits you want to test, and use bit $xxxx on this opcode. | A way to do it is to look for an opcode that has the bits you want to test, and use bit $xxxx on this opcode. | ||
< | <pre> | ||
Example | Example | ||
lda foobar | lda foobar | ||
Line 238: | Line 239: | ||
lda foobar | lda foobar | ||
..... | ..... | ||
</ | </pre> | ||
becomes : | becomes : | ||
< | <pre> | ||
Example | Example | ||
lda foobar | lda foobar | ||
Line 256: | Line 257: | ||
_bmi_instruction ;The BMI opcode = $30 | _bmi_instruction ;The BMI opcode = $30 | ||
bmi somewhere | bmi somewhere | ||
</ | </pre> | ||
Savings : 2 cycles, 3 bytes | Savings : 2 cycles, 3 bytes | ||
Line 263: | Line 264: | ||
To retrieve the 3 highest bits of a value in the low positions, you might be tempted to do 5 LSRs in a row. However, if you do not need the 5 top bits to be cleared, this is more efficient: | To retrieve the 3 highest bits of a value in the low positions, you might be tempted to do 5 LSRs in a row. However, if you do not need the 5 top bits to be cleared, this is more efficient: | ||
< | |||
<pre> | |||
lda value ; got: 76543210 c | lda value ; got: 76543210 c | ||
rol ; got: 6543210c 7 | rol ; got: 6543210c 7 | ||
Line 270: | Line 272: | ||
rol ; got: 3210c765 4 | rol ; got: 3210c765 4 | ||
; Only care about these ^^^ | ; Only care about these ^^^ | ||
</ | </pre> | ||
It works the same for replacing 5 ASLs with 4 RORs. | It works the same for replacing 5 ASLs with 4 RORs. | ||
To replace 6 ASLs you can use 3 RORs: | To replace 6 ASLs you can use 3 RORs: | ||
< | |||
<pre> | |||
lda value ; got: 76453210 c | lda value ; got: 76453210 c | ||
ror ; got: c7654321 0 | ror ; got: c7654321 0 | ||
Line 281: | Line 284: | ||
ror ; got: 10c76543 2 | ror ; got: 10c76543 2 | ||
and #$C0 ; got: 10------ | and #$C0 ; got: 10------ | ||
</ | </pre> | ||
== Optimise speed at the expense of size == | == Optimise speed at the expense of size == | ||
Line 289: | Line 292: | ||
=== Use identity look-up table instead of temp variable === | === Use identity look-up table instead of temp variable === | ||
< | <pre> | ||
Example | Example | ||
ldx Foo | ldx Foo | ||
Line 296: | Line 299: | ||
clc | clc | ||
adc Temp ;A = Foo + Bar | adc Temp ;A = Foo + Bar | ||
</ | </pre> | ||
becomes : | becomes : | ||
< | <pre> | ||
Example | Example | ||
ldx Foo | ldx Foo | ||
Line 309: | Line 312: | ||
Identity | Identity | ||
.byt $00, $01, $02, $03, ..... | .byt $00, $01, $02, $03, ..... | ||
</ | </pre> | ||
If your program is very large, it is possible that this way uses slightly less ROM; also, it might save one byte of RAM in some circumstances. | If your program is very large, it is possible that this way uses slightly less ROM; also, it might save one byte of RAM in some circumstances. | ||
Line 317: | Line 320: | ||
=== Use look-up table to shift left 4 times === | === Use look-up table to shift left 4 times === | ||
Provided that the high nibble is already cleared, you can shift left by 4 by making a multiplication look-up table. | Provided that the high nibble is already cleared, you can shift left by 4 by making a multiplication look-up table. | ||
< | |||
<pre> | |||
Example: | Example: | ||
lda rownum | lda rownum | ||
Line 325: | Line 329: | ||
asl A | asl A | ||
rts | rts | ||
</ | </pre> | ||
becomes | becomes | ||
< | <pre> | ||
Example: | Example: | ||
ldx rownum | ldx rownum | ||
Line 338: | Line 342: | ||
.byt $00, $10, $20, $30, $40, $50, $60, $70 | .byt $00, $10, $20, $30, $40, $50, $60, $70 | ||
.byt $80, $90, $A0, $B0, $C0, $D0, $E0, $F0 | .byt $80, $90, $A0, $B0, $C0, $D0, $E0, $F0 | ||
</ | </pre> | ||
In very large programs, this might save some ROM space. However, it will use up the X register, so it might not be ideal. | In very large programs, this might save some ROM space. However, it will use up the X register, so it might not be ideal. | ||
Line 350: | Line 354: | ||
=== Use the stack instead of a temp variable === | === Use the stack instead of a temp variable === | ||
< | <pre> | ||
Example | Example | ||
lda Foo | lda Foo | ||
Line 359: | Line 363: | ||
lda Temp ;Restores Foo | lda Temp ;Restores Foo | ||
..... | ..... | ||
</ | </pre> | ||
becomes: | becomes: | ||
< | <pre> | ||
Example | Example | ||
lda Foo | lda Foo | ||
Line 372: | Line 376: | ||
pla ;Restores Foo | pla ;Restores Foo | ||
..... | ..... | ||
</ | </pre> | ||
Savings : 2 bytes. | Savings : 2 bytes. | ||
Line 380: | Line 384: | ||
Each time a routine needs multiple bytes of arguments (>3) it's hard to code it without wasting a lot of bytes. | Each time a routine needs multiple bytes of arguments (>3) it's hard to code it without wasting a lot of bytes. | ||
< | <pre> | ||
Example | Example | ||
lda Argument1 | lda Argument1 | ||
Line 389: | Line 393: | ||
jsr RoutineWhichNeeds4Args | jsr RoutineWhichNeeds4Args | ||
..... | ..... | ||
</ | </pre> | ||
Becomes something like: | Becomes something like: | ||
< | <pre> | ||
Example | Example | ||
jsr PassArguments | jsr PassArguments | ||
Line 425: | Line 429: | ||
pha | pha | ||
jmp (parameters) | jmp (parameters) | ||
</ | </pre> | ||
Syscalls in Apple ProDOS<ref>''ProDOS 8 Technical Reference Manual''</ref> and FDS BIOS work this way. | Syscalls in Apple ProDOS<ref>''ProDOS 8 Technical Reference Manual''</ref> and FDS BIOS work this way. |
Revision as of 04:07, 11 September 2014
This page is about optimisations that are possible in assembly language, or various things one programmer has to keep in mind to make his code as optimal as possible.
There is two major kind of optimisations: Optimisation for speed (code executes in fewer cycles) and optimisation for size (the code takes fewer bytes).
There is also some other kinds of optimisations, such as constant-executing-time optimisation (code execute in a constant number of cycle no matter what it has to do), or RAM usage optimisation (use as few variables as possible). Because those optimisations have more to do with the algorithm than with its implementation in assembly, only speed and size optimisations will be discussed in this article.
Optimise both speed and size of the code
Avoid a jsr + rts chain
A tail call occurs when a subroutine finishes by calling another subroutine. This can be optimised into a JMP instruction:
MySubroutine lda Foo sta Bar jsr SomeRandomRoutine rts
becomes :
MySubroutine lda Foo sta Bar jmp SomeRandomRoutine
Savings : 9 cycles, 1 byte
Split word tables in high and low components
This optimisation is not human friendly, makes the source code much bigger, but still makes the compiled[1] size smaller and faster:
Example lda FooBar asl A tax lda PointerTable,X sta Temp lda PointerTable+1,X sta Temp+1 .... PointerTable .dw Pointer1, Pointer2, ....
Becomes :
Example ldx FooBar lda PointerTableL,X sta Temp lda PointerTableH,X sta Temp+1 .... PointerTableL .byt <Pointer1, <Pointer2, .... PointerTableH .byt >Pointer1, >Pointer2, ....
Some assemblers may have a way to implement a macro to automatically make the table coded like this (Unofficial MagicKit Assembler is one such program).
Savings : 2 bytes, 4 cycles
Use Jump tables with RTS instruction instead of JMP indirect instruction
The so-called RTS Trick is a method of implementing jump tables by pushing a subroutine's entry point to the stack.
; This: ldx JumpEntry lda PointerTableL,X sta Temp lda PointerTableH,X sta Temp+1 jmp [Temp] ; becomes this: ldx JumpEntry lda PointerTableH,X pha lda PointerTableL,X pha rts
Note that PointerTable entries must point to one byte before the intended target when the RTS trick is used, because RTS will add 1 to the offset.
Savings : 4 bytes, 1 cycle.
To squeeze slightly more, it's possible to combine this with the tail call optimization:
; This: jsr SomeOtherFunction ; MUST NOT modify JumpEntry ldx JumpEntry lda PointerTableH,X pha lda PointerTableL,X pha rts ; Becomes this: ldx JumpEntry lda PointerTableH,X pha lda PointerTableL,X pha jmp SomeOtherFunction
Here, the CPU runs SomeOtherFunction, then returns to the function from the jump table, then returns to what called this dispatcher.
Use a macro instead of a subroutine which is only called once
What is the point to call a subroutine if you only call it at a single place? It would be more optimal to just insert the code where the subroutine is called. However this makes the code less structured and harder to understand. Inline expansion of a subroutine into another subroutine can be done with a macro. One drawback is that if the subroutine is called in a loop, it may require some JMPing to work around the 128-byte limit on branch length.
How macros are used depends on the assembler so no code examples will be placed here to avoid further confusion.
In C, the static inline
keyword acts as a hint to expand a function as a macro.
Savings : 4 bytes, 12 cycles.
Arithmetic shift right
Compact way to divide a variable by 2 but keep its sign:
cmp #$80 ror A
Easily test 2 upper bits of a variable
lda FooBar asl A ;C = b7, N = b6
Alternative:
bit Foobar ;N = b7, V = b6, regardless of the value of A.
This can be e.g. used to poll the sprite-0-hit flag in $2002.
Negating a value without temporaries
eor #$FF clc adc #1
Avoiding the need for CLC/SEC with ADC/SBC
If you are using ADC #imm, and you know your carry is already cleared,
you do not need to do CLC. However, if you know that carry is set
(for example, your code is located in a branch that is only ever entered with a BCS instruction),
you can still avoid using CLC by just doing ADC #(value-1).
The PLOT
subroutine in the Apple II Monitor uses this.
Similarly for SBC #imm: If you know know that carry is clear, you can still avoid using SEC by just doing SBC #(value+1).
Test bits in decreasing order
lda foobar bmi bit7_set cmp #$40 ; we know that bit 7 wasn't set bcs bit6_set cmp #$20 bcs bit5_set ; and so on
Or if you do not need to preserve the value of A:
lda foobar asl bcs bit7_set asl bcs bit6_set asl bcs bit5_set ; and so on
This saves one byte per comparison, but 2 cycles more are used because of the extra ASL.
Test bits in increasing order
lda foobar lsr bcs bit0_set lsr bcs bit1_set lsr bcs bit2_set ; and so on
Note: This does not preserve the value of A.
Test bits without destroying the accumulator
The AND instruction can be used to test bits, but this destroy the value in the accumulator. The BIT can do this but it has no immediate adressing mode. A way to do it is to look for an opcode that has the bits you want to test, and use bit $xxxx on this opcode.
Example lda foobar and #$30 beq bits_clear lda foobar .... bits_clear lda foobar .....
becomes :
Example lda foobar bit _bmi_instruction ;equivalent to and #$30 but preserves A beq bits_clear .... bits_clear ..... anywhere_in_the_code .... _bmi_instruction ;The BMI opcode = $30 bmi somewhere
Savings : 2 cycles, 3 bytes
Use opposite rotate instead of a great number of shifts
To retrieve the 3 highest bits of a value in the low positions, you might be tempted to do 5 LSRs in a row. However, if you do not need the 5 top bits to be cleared, this is more efficient:
lda value ; got: 76543210 c rol ; got: 6543210c 7 rol ; got: 543210c7 6 rol ; got: 43210c76 5 rol ; got: 3210c765 4 ; Only care about these ^^^
It works the same for replacing 5 ASLs with 4 RORs.
To replace 6 ASLs you can use 3 RORs:
lda value ; got: 76453210 c ror ; got: c7654321 0 ror ; got: 0c765432 1 ror ; got: 10c76543 2 and #$C0 ; got: 10------
Optimise speed at the expense of size
Those optimisations will make code faster to execute, but use more ROM. Therefore, it is useful in NMI routines and other things that need to run fast.
Use identity look-up table instead of temp variable
Example ldx Foo lda Bar stx Temp clc adc Temp ;A = Foo + Bar
becomes :
Example ldx Foo lda Bar clc adc Identity,X ;A = Foo + Bar Identity .byt $00, $01, $02, $03, .....
If your program is very large, it is possible that this way uses slightly less ROM; also, it might save one byte of RAM in some circumstances.
Savings : 2 cycles
Use look-up table to shift left 4 times
Provided that the high nibble is already cleared, you can shift left by 4 by making a multiplication look-up table.
Example: lda rownum asl A asl A asl A asl A rts
becomes
Example: ldx rownum lda times_sixteen,x rts times_sixteen: .byt $00, $10, $20, $30, $40, $50, $60, $70 .byt $80, $90, $A0, $B0, $C0, $D0, $E0, $F0
In very large programs, this might save some ROM space. However, it will use up the X register, so it might not be ideal.
Savings: 4 cycles
Optimise code size at the expense of cycles
Those optimisations will produce code that is smaller but takes more cycles to execute. Therefore, it can be used in the parts of the program that do not have to be fast.
Use the stack instead of a temp variable
Example lda Foo sta Temp lda Bar .... .... lda Temp ;Restores Foo .....
becomes:
Example lda Foo pha lda Bar .... .... pla ;Restores Foo .....
Savings : 2 bytes.
Use an "intelligent" argument system
Each time a routine needs multiple bytes of arguments (>3) it's hard to code it without wasting a lot of bytes.
Example lda Argument1 sta Temp lda Argument2 ldx Argument3 ldy Argument4 jsr RoutineWhichNeeds4Args .....
Becomes something like:
Example jsr PassArguments .dw RoutineWhichNeeds4Args .db Argument1, Argument2, Argument3, Argument4 .db $00 .... PassArguments pla tay pla pha ; put the high byte back sta pointer+1 ldx #$00 beq SKIP LOOP sta parameters,x inx SKIP iny ; pointing one short first pass here fixes that lda (pointer),y bne LOOP iny lda (pointer),y beq LOOP dey ; fix the return address guess we can't return to a ; break tya pha jmp (parameters)
Syscalls in Apple ProDOS[2] and FDS BIOS work this way.
Savings : Complicated to estimate - only saves bytes if the trick is used fairly often across the program, in order to compensate for the size of the PassArguments routine.
Using relative branch instruction instead of absolute
If the state of one of the processor flags is already known at this point and the branch target is not too far away, then you can use a condition branch instruction.
Savings : 1 byte.
See also
Notes
- ↑ Pedants may complain that "compile" is incorrect terminology for "translate a program written in assembly language into object code". But it is the most familiar term meaning "translate a program, no matter the language, into object code", and the same issues apply to code generators within a compiler that targets the 6502 as to programs written in 6502 assembly language.
- ↑ ProDOS 8 Technical Reference Manual