8-bit Multiply: Difference between revisions
Rainwarrior (talk | contribs) (replacing lead which was orphaned text) |
Rainwarrior (talk | contribs) m (Thwaite moved to userspace) |
||
(19 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
Since the 6502 CPU has no multiplication instruction, this feature has to be written in software. | Since the 6502 CPU has no multiplication instruction, this feature has to be written in software. | ||
See also: [[Fast signed multiply]] | |||
== tepples == | |||
This 8x8 multiply routine from ''[[User:Tepples/Thwaite|Thwaite]]'' does binary long multiplication (a.k.a. "the [[wikipedia:Ancient Egyptian multiplication#Russian_peasant_multiplication|Russian Peasant method]]"). | |||
<pre> | |||
;; | |||
; Multiplies two 8-bit factors to produce a 16-bit product | |||
; in about 153 cycles. | |||
; @param A one factor | |||
; @param Y another factor | |||
; @return high 8 bits in A; low 8 bits in $0000 | |||
; Y and $0001 are trashed; X is untouched | |||
.proc mul8 | |||
prodlo = $0000 | |||
factor2 = $0001 | |||
; Factor 1 is stored in the lower bits of prodlo; the low byte of | |||
; the product is stored in the upper bits. | |||
lsr a ; prime the carry bit for the loop | |||
sta prodlo | |||
sty factor2 | |||
lda #0 | |||
ldy #8 | |||
loop: | |||
; At the start of the loop, one bit of prodlo has already been | |||
; shifted out into the carry. | |||
bcc noadd | |||
clc | |||
adc factor2 | |||
noadd: | |||
ror a | |||
ror prodlo ; pull another bit out for the next iteration | |||
dey ; inc/dec don't modify carry; only shifts and adds do | |||
bne loop | |||
rts | |||
.endproc | |||
</pre> | |||
This is a similar technique to Bregalad's method below, but is further optimized to keep part of the running total in the accumulator. | |||
Assuming no page crossings and zero page, this routine takes 137-169 cycles, not counting the JSR to call it. (Each non-zero bit in A adds 4 cycles.) | |||
== tepples unrolled == | |||
The above code can be made to run slightly faster by both unrolling the loop and pre-decrementing factor2 so that CLC isn't required. Note that the low byte is now returned in A, the high byte in Y, and that CA65 syntax is used. | |||
<pre> | |||
; @param A one factor | |||
; @param Y another factor | |||
; @return low 8 bits in A; high 8 bits in Y | |||
mul8_multiply: | |||
lsr | |||
sta prodlo | |||
tya | |||
beq mul8_early_return | |||
dey | |||
sty factor2 | |||
lda #0 | |||
.repeat 8, i | |||
.if i > 0 | |||
ror prodlo | |||
.endif | |||
bcc :+ | |||
adc factor2 | |||
: | |||
ror | |||
.endrepeat | |||
tay | |||
lda prodlo | |||
ror | |||
mul8_early_return: | |||
rts | |||
</pre> | |||
The code takes up 70 bytes and runs in 120 cycles or less. | |||
== Bregalad == | == Bregalad == | ||
This routine by Bregalad | This routine by Bregalad is similar to tepples' method above, but is less optimized. | ||
<pre> | <pre> | ||
Line 35: | Line 110: | ||
An optimization for efficiency is made here; binary long multiplication requires adding one multiplicand to the result at various bit-shifts (i.e. multiply by each power of 2). The naive approach might maintain the value to add as a 16-bit value, left shifting it once each iteration to reach the next power of 2. This one, however, takes advantage of the input being only 8-bits wide, and instead pre-multiplies the result by 256 (8 bits), and each iteration instead right-shifts the result. After 8 iterations the pre-multiply is undone, and the advantage gained is that only the shift is 16-bit; adding the multiplicand remains an efficient 8-bit add. | An optimization for efficiency is made here; binary long multiplication requires adding one multiplicand to the result at various bit-shifts (i.e. multiply by each power of 2). The naive approach might maintain the value to add as a 16-bit value, left shifting it once each iteration to reach the next power of 2. This one, however, takes advantage of the input being only 8-bits wide, and instead pre-multiplies the result by 256 (8 bits), and each iteration instead right-shifts the result. After 8 iterations the pre-multiply is undone, and the advantage gained is that only the shift is 16-bit; adding the multiplicand remains an efficient 8-bit add. | ||
Assuming no page crossings and zero page, this routine takes 184-320 cycles | Assuming no page crossings and zero page, this routine takes 184-320 cycles, not counting the JSR to call it. (Each non-zero bit in A adds 17 cycles.) | ||
== | == frantik == | ||
This routine by | This routine by frantik is another binary long multiplication. | ||
<pre> | <pre> | ||
Line 45: | Line 120: | ||
; by frantik | ; by frantik | ||
; Accepts: | ; Accepts: value1 and value2, labels for bytes in memory | ||
; | ; value2 should ideally be the lesser of the two input values | ||
; for increased efficiency | ; for increased efficiency | ||
; Uses: $00, $01, $02 for temporary variables | ; Uses: $00, $01, $02 for temporary variables | ||
; Returns: 16 bit value in $00 and $01 | ; Returns: 16 bit value in $00 and $01 | ||
.macro multiply | .macro multiply value1, value2 | ||
ret = $00 ; return value | ret = $00 ; return value | ||
Line 60: | Line 135: | ||
sta ret+1 | sta ret+1 | ||
sta temp | sta temp | ||
jmp start: | lda value2 | ||
bne +end" | |||
jmp +start: | |||
-loop: | -loop: | ||
asl | asl value1 ; double first value | ||
rol temp ; using 16bit precision | rol temp ; using 16bit precision | ||
lsr | lsr value2 ; halve second vale | ||
start: | +start: | ||
lda | lda value2 ; | ||
and #01 ; is new 2nd value an odd number? | and #01 ; is new 2nd value an odd number? | ||
beq -loop: ; | beq -loop: ; | ||
clc ; if so, add new 1st value to running total | clc ; if so, add new 1st value to running total | ||
lda ret ; | lda ret ; | ||
adc | adc value1 ; | ||
sta ret ; | sta ret ; | ||
lda ret+1 ; | lda ret+1 ; | ||
adc temp ; | adc temp ; | ||
sta ret+1 ; | sta ret+1 ; | ||
lda | lda value2 ; | ||
cmp #01 ; is 2nd value 1? if so, we're done | cmp #01 ; is 2nd value 1? if so, we're done | ||
bne -loop: ; otherwise, loop | bne -loop: ; otherwise, loop | ||
+end: | |||
.endm | .endm | ||
</pre> | </pre> | ||
Assuming no page crossings and zero page, this routine takes | Assuming no page crossings and zero page, this routine takes 17-403 cycles, though it is an in-place macro generating 46 bytes of new code each time it is used. | ||
== MMC5 == | |||
The [[MMC5]] contains an 8x8-bit multiplier at $5205-$5206. | |||
== External | == External links == | ||
* [//forums.nesdev.org/viewtopic.php?f=2&t=16571 Forum post:] Fast multi, ... - Faster multiplication routine using lookup tables, by keldon. | * [//forums.nesdev.org/viewtopic.php?f=2&t=16571 Forum post:] Fast multi, ... - Faster multiplication routine using lookup tables, by keldon. | ||
* [//forums.nesdev.org/viewtopic.php?f=2&t=16616 Forum post:] What's the best (fastest) multiplication/division code? | |||
* [//atariage.com/forums/topic/71120-6502-killer-hacks/page-2#entry896028 AtariAge forum post:] 6502 Killer hacks | |||
[[Category:Arithmetic]] | [[Category:Arithmetic]] |
Latest revision as of 00:35, 18 March 2023
Since the 6502 CPU has no multiplication instruction, this feature has to be written in software.
See also: Fast signed multiply
tepples
This 8x8 multiply routine from Thwaite does binary long multiplication (a.k.a. "the Russian Peasant method").
;; ; Multiplies two 8-bit factors to produce a 16-bit product ; in about 153 cycles. ; @param A one factor ; @param Y another factor ; @return high 8 bits in A; low 8 bits in $0000 ; Y and $0001 are trashed; X is untouched .proc mul8 prodlo = $0000 factor2 = $0001 ; Factor 1 is stored in the lower bits of prodlo; the low byte of ; the product is stored in the upper bits. lsr a ; prime the carry bit for the loop sta prodlo sty factor2 lda #0 ldy #8 loop: ; At the start of the loop, one bit of prodlo has already been ; shifted out into the carry. bcc noadd clc adc factor2 noadd: ror a ror prodlo ; pull another bit out for the next iteration dey ; inc/dec don't modify carry; only shifts and adds do bne loop rts .endproc
This is a similar technique to Bregalad's method below, but is further optimized to keep part of the running total in the accumulator.
Assuming no page crossings and zero page, this routine takes 137-169 cycles, not counting the JSR to call it. (Each non-zero bit in A adds 4 cycles.)
tepples unrolled
The above code can be made to run slightly faster by both unrolling the loop and pre-decrementing factor2 so that CLC isn't required. Note that the low byte is now returned in A, the high byte in Y, and that CA65 syntax is used.
; @param A one factor ; @param Y another factor ; @return low 8 bits in A; high 8 bits in Y mul8_multiply: lsr sta prodlo tya beq mul8_early_return dey sty factor2 lda #0 .repeat 8, i .if i > 0 ror prodlo .endif bcc :+ adc factor2 : ror .endrepeat tay lda prodlo ror mul8_early_return: rts
The code takes up 70 bytes and runs in 120 cycles or less.
Bregalad
This routine by Bregalad is similar to tepples' method above, but is less optimized.
;8-bit multiply ;by Bregalad ;Enter with A,Y, numbers to multiply ;Output with YA = 16-bit result (X is unchanged) Multiply: sty Factor ;Store input factor ldy #$00 sty Res sty Res2 ;Clear result ldy #$08 ;Number of shifts needed - lsr A ;Shift right input number bcc + ;Check if bit is set pha lda Res2 clc adc Factor sta Res2 ;If so add number to result pla + lsr Res2 ;Shift result right ror Res dey bne - lda Res ldy Res2 rts
An optimization for efficiency is made here; binary long multiplication requires adding one multiplicand to the result at various bit-shifts (i.e. multiply by each power of 2). The naive approach might maintain the value to add as a 16-bit value, left shifting it once each iteration to reach the next power of 2. This one, however, takes advantage of the input being only 8-bits wide, and instead pre-multiplies the result by 256 (8 bits), and each iteration instead right-shifts the result. After 8 iterations the pre-multiply is undone, and the advantage gained is that only the shift is 16-bit; adding the multiplicand remains an efficient 8-bit add.
Assuming no page crossings and zero page, this routine takes 184-320 cycles, not counting the JSR to call it. (Each non-zero bit in A adds 17 cycles.)
frantik
This routine by frantik is another binary long multiplication.
; Multiply two bytes in memory using Russian peasant algorithm ; by frantik ; Accepts: value1 and value2, labels for bytes in memory ; value2 should ideally be the lesser of the two input values ; for increased efficiency ; Uses: $00, $01, $02 for temporary variables ; Returns: 16 bit value in $00 and $01 .macro multiply value1, value2 ret = $00 ; return value temp = $02 ; temp storage lda #$00 ; clear temporary variables sta ret sta ret+1 sta temp lda value2 bne +end" jmp +start: -loop: asl value1 ; double first value rol temp ; using 16bit precision lsr value2 ; halve second vale +start: lda value2 ; and #01 ; is new 2nd value an odd number? beq -loop: ; clc ; if so, add new 1st value to running total lda ret ; adc value1 ; sta ret ; lda ret+1 ; adc temp ; sta ret+1 ; lda value2 ; cmp #01 ; is 2nd value 1? if so, we're done bne -loop: ; otherwise, loop +end: .endm
Assuming no page crossings and zero page, this routine takes 17-403 cycles, though it is an in-place macro generating 46 bytes of new code each time it is used.
MMC5
The MMC5 contains an 8x8-bit multiplier at $5205-$5206.
External links
- Forum post: Fast multi, ... - Faster multiplication routine using lookup tables, by keldon.
- Forum post: What's the best (fastest) multiplication/division code?
- AtariAge forum post: 6502 Killer hacks