Cycle counting: Difference between revisions
Rainwarrior (talk | contribs) (→Instruction timings: consistency, trying to use numerals for all cycle counts) |
(Converts the "Instruction timings" examples into a table and adds a STA abs,X example.) |
||
(8 intermediate revisions by 3 users not shown) | |||
Line 4: | Line 4: | ||
You can use a comprehensive | You can use a comprehensive | ||
guide<ref>[ | guide<ref>[https://www.nesdev.org/obelisk-6502-guide/reference.html Obelisk: 6502 instruction reference]</ref><ref>[http://nesdev.org/6502_cpu.txt 6502_cpu.txt: cycle-by-cycle instruction behaviour]</ref> | ||
as reference for instruction timings, | as reference for instruction timings, | ||
but there are some | but there are some | ||
Line 11: | Line 11: | ||
* There is a minimum of 2 cycles per instruction. | * There is a minimum of 2 cycles per instruction. | ||
* Each byte of memory read or | * Each byte of memory read or written adds 1 more cycle to the instruction. This includes fetching the instruction, and each byte of its operand, then any memory it references. | ||
* Indexed instructions which cross a page take 1 extra cycle to adjust the high byte of the effective address first. | * Indexed instructions which cross a page take 1 extra cycle to adjust the high byte of the effective address first. | ||
* Read-modify-write instructions perform a dummy write during the "modify" stage and thus take 1 extra cycle. | * Read-modify-write instructions perform a dummy write during the "modify" stage and thus take 1 extra cycle. | ||
* Instructions that push data onto the stack take 1 extra cycle. | * Instructions that push data onto the stack take 1 extra cycle. | ||
* Instructions that pop data from the stack take 2 extra cycles, since they also need to pre-increment the stack pointer | * Instructions that pop data from the stack take 2 extra cycles, since they also need to pre-increment the stack pointer. | ||
* "Extra" cycles often include an extra read or write that usually does not affect the outcome. | * "Extra" cycles often include an extra read or write that usually does not affect the outcome. | ||
Examples: | Examples: | ||
{|class="tabular" | |||
! Instructions || Cycles || Bytes || Details | |||
|- | |||
| <code>SEC</code> || 2 || 1 || opcode, but has to wait for the 2-cycle minimum. | |||
|- | |||
| <code>AND #imm</code> || 2 || 2 || opcode + operand. Only affects registers. | |||
|- | |||
| <code>LDA zp</code> || 3 || 2 || opcode + operand + byte fetched from zp. | |||
|- | |||
| <code>STA abs</code> || 4 || 3 || opcode + 2 byte operand + byte written to abs. | |||
|- | |||
| <code>LDA abs,X</code> || 4 or 5 || 3 || opcode + 2 byte operand + read from abs, but it delays 1 extra cycle if the addition of the X index causes a page crossing. | |||
|- | |||
| <code>STA abs,X || 5 || 3 || Like LDA abs,X, but assumes the worst case of page crossing and always spends 1 extra read cycle. | |||
|- | |||
| <code>ASL zp</code> || 5 || 2 || opcode + operand + read from zp + write to zp, but it takes 1 extra cycle to modify the value. | |||
|- | |||
| <code>LDA (indirect),Y</code> || 5 or 6 || 2 || opcode + operand + two reads from zp + read from indirect address. 1 extra cycle if a page is crossed. | |||
|- | |||
| <code>STA (indirect),Y</code> || 6 || 2 || Like LDA (indirect),Y, but assumes the worst case of page crossing and always spends 1 extra read cycle. | |||
|- | |||
| <code>PHA</code> || 3 || 1 || opcode + stack write, but requires 1 extra cycle to perform the stack operation. | |||
|- | |||
| <code>RTS</code>|| 6 || 1 || opcode + two stack reads, but requires 2 extra cycles to perform the stack operations, plus 1 cycle to post-increment the program counter (to compensate for the off-by-1 address pushed by JSR). | |||
|} | |||
== Short delays == | == Short delays == | ||
Line 36: | Line 51: | ||
NOP is essential for 2 cycle delays. 3 cycle delays always take 2 bytes, but usually have some compromise. More options become available as delays become longer. | NOP is essential for 2 cycle delays. 3 cycle delays always take 2 bytes, but usually have some compromise. More options become available as delays become longer. | ||
{|class="wikitable sortable" | |||
! Instructions !! Cycles !! Bytes !!class="unsortable"| Side effects and notes | |||
|- | |||
| <code>NOP</code> || 2 || 1 || | |||
|- | |||
| <code>JMP *+3</code> || 3 || 3 || | |||
|- | |||
| <code>Bxx *+2</code> || 3 || 2 || None, but requires a known flag state (e.g. BCC if carry is known to be clear). | |||
|- | |||
| <code>BIT zp</code> || 3 || 2 || Clobbers NVZ flags. Reads zp. | |||
|- | |||
| <code>[[Programming with unofficial opcodes#NOPs|IGN zp]]</code> || 3 || 2 || Reads zp. (Unofficial instruction.) | |||
|- | |||
| <code>NOP, NOP</code> || 4 || 2 | |||
|- | |||
| <code>NOP, ...</code> || 5 || 3 || (... = 3 cycle delay of choice) | |||
|- | |||
| <code>CLV, BVC *+2</code> || 5 || 3 || Clears V flag. (Can instead use C flag with CLC, BCC or SEC, BCS.) | |||
|- | |||
| <code>NOP, NOP, NOP</code> || 6 || 3 | |||
|- | |||
| <code>PHP, PLP</code> || 7 || 2 || Modifies 1 byte of stack. | |||
|- | |||
| <code>NOP, NOP, NOP, NOP</code> || 8 || 4 | |||
|- | |||
| <code>PHP, PLP, NOP</code> || 9 || 3 || Modifies 1 byte of stack. | |||
|- | |||
| <code>PHP, CMP zp, PLP</code> || 10 || 4 || Modifies 1 byte of stack. Reads zp. | |||
|- | |||
| <code>PHP, PLP, NOP, NOP</code> || 11 || 4 || Modifies 1 byte of stack. | |||
|- | |||
| <code>JSR, RTS</code> || 12 || 3 || Modifies 2 bytes of stack. (Takes 3 bytes only if reusing an existing RTS; otherwise 4.) | |||
|} | |||
== Clockslide == | == Clockslide == | ||
Line 59: | Line 92: | ||
The entry point can be controlled with an indirect jump or the [[RTS Trick]] to precisely control raster effect or sample playback timing. | The entry point can be controlled with an indirect jump or the [[RTS Trick]] to precisely control raster effect or sample playback timing. | ||
CMP has a side effect of destroying most of the flags, but [[CPU unofficial opcodes|unofficial instructions]] that | CMP has a side effect of destroying most of the flags, but you can substitute other instructions with the same size and timing that preserve whichever flags/registers you need at the end of the slide. | ||
There are [[CPU unofficial opcodes|unofficial instructions]] that can avoid altering any state: | |||
replace $C9 (CMP) with $89 or $80, which ignores an immediate operand, and replace $C5 with $04, $44, or $64, which ignore a read from the zero page. | |||
== Resources == | == Resources == | ||
* [[Delay code]] - various variable-cycle delays | |||
* [[Fixed cycle delay]] - shortest fixed-cycle delays | |||
* [https://bisqwit.iki.fi/utils/nesdelay.php Fixed-cycle delay code vending machine] - code for generating shortest-possible delay routines at compile-time. | * [https://bisqwit.iki.fi/utils/nesdelay.php Fixed-cycle delay code vending machine] - code for generating shortest-possible delay routines at compile-time. | ||
* [https://github.com/bbbradsmith/6502vdelay 6502 vdelay] - code for delaying a variable number of cycles at run-time. | * [https://github.com/bbbradsmith/6502vdelay 6502 vdelay] - code for delaying a variable number of cycles at run-time. |
Latest revision as of 03:22, 3 June 2024
It is often useful to delay a specific number of CPU cycles. Timing raster effects or generating PCM audio are some examples that might utilize this. This article outlines a few relevant techniques.
Instruction timings
You can use a comprehensive guide[1][2] as reference for instruction timings, but there are some rules-of-thumb[3] that can help remember most of them:
- There is a minimum of 2 cycles per instruction.
- Each byte of memory read or written adds 1 more cycle to the instruction. This includes fetching the instruction, and each byte of its operand, then any memory it references.
- Indexed instructions which cross a page take 1 extra cycle to adjust the high byte of the effective address first.
- Read-modify-write instructions perform a dummy write during the "modify" stage and thus take 1 extra cycle.
- Instructions that push data onto the stack take 1 extra cycle.
- Instructions that pop data from the stack take 2 extra cycles, since they also need to pre-increment the stack pointer.
- "Extra" cycles often include an extra read or write that usually does not affect the outcome.
Examples:
Instructions | Cycles | Bytes | Details |
---|---|---|---|
SEC |
2 | 1 | opcode, but has to wait for the 2-cycle minimum. |
AND #imm |
2 | 2 | opcode + operand. Only affects registers. |
LDA zp |
3 | 2 | opcode + operand + byte fetched from zp. |
STA abs |
4 | 3 | opcode + 2 byte operand + byte written to abs. |
LDA abs,X |
4 or 5 | 3 | opcode + 2 byte operand + read from abs, but it delays 1 extra cycle if the addition of the X index causes a page crossing. |
STA abs,X |
5 | 3 | Like LDA abs,X, but assumes the worst case of page crossing and always spends 1 extra read cycle. |
ASL zp |
5 | 2 | opcode + operand + read from zp + write to zp, but it takes 1 extra cycle to modify the value. |
LDA (indirect),Y |
5 or 6 | 2 | opcode + operand + two reads from zp + read from indirect address. 1 extra cycle if a page is crossed. |
STA (indirect),Y |
6 | 2 | Like LDA (indirect),Y, but assumes the worst case of page crossing and always spends 1 extra read cycle. |
PHA |
3 | 1 | opcode + stack write, but requires 1 extra cycle to perform the stack operation. |
RTS |
6 | 1 | opcode + two stack reads, but requires 2 extra cycles to perform the stack operations, plus 1 cycle to post-increment the program counter (to compensate for the off-by-1 address pushed by JSR). |
Short delays
Here are few ways to create short delays without side effects. As the shortest instruction time is 2 cycles, it is not possible to delay 1 cycle on its own. NOP is essential for 2 cycle delays. 3 cycle delays always take 2 bytes, but usually have some compromise. More options become available as delays become longer.
Instructions | Cycles | Bytes | Side effects and notes |
---|---|---|---|
NOP |
2 | 1 | |
JMP *+3 |
3 | 3 | |
Bxx *+2 |
3 | 2 | None, but requires a known flag state (e.g. BCC if carry is known to be clear). |
BIT zp |
3 | 2 | Clobbers NVZ flags. Reads zp. |
IGN zp |
3 | 2 | Reads zp. (Unofficial instruction.) |
NOP, NOP |
4 | 2 | |
NOP, ... |
5 | 3 | (... = 3 cycle delay of choice) |
CLV, BVC *+2 |
5 | 3 | Clears V flag. (Can instead use C flag with CLC, BCC or SEC, BCS.) |
NOP, NOP, NOP |
6 | 3 | |
PHP, PLP |
7 | 2 | Modifies 1 byte of stack. |
NOP, NOP, NOP, NOP |
8 | 4 | |
PHP, PLP, NOP |
9 | 3 | Modifies 1 byte of stack. |
PHP, CMP zp, PLP |
10 | 4 | Modifies 1 byte of stack. Reads zp. |
PHP, PLP, NOP, NOP |
11 | 4 | Modifies 1 byte of stack. |
JSR, RTS |
12 | 3 | Modifies 2 bytes of stack. (Takes 3 bytes only if reusing an existing RTS; otherwise 4.) |
Clockslide
A clockslide[4] is a sequence of instructions that wastes a small constant amount of cycles plus one cycle per executed byte, no matter whether it's entered on an odd or even address.
With official instructions, one can construct a clockslide from CMP instructions: ... C9 C9 C9 C9 C5 EA
Disassemble from the start and you get CMP #$C9 CMP #$C9 CMP $EA
(6 bytes, 7 cycles).
Disassemble one byte in and you get CMP #$C9 CMP #$C5 NOP
(5 bytes, 6 cycles).
The entry point can be controlled with an indirect jump or the RTS Trick to precisely control raster effect or sample playback timing.
CMP has a side effect of destroying most of the flags, but you can substitute other instructions with the same size and timing that preserve whichever flags/registers you need at the end of the slide. There are unofficial instructions that can avoid altering any state: replace $C9 (CMP) with $89 or $80, which ignores an immediate operand, and replace $C5 with $04, $44, or $64, which ignore a read from the zero page.
Resources
- Delay code - various variable-cycle delays
- Fixed cycle delay - shortest fixed-cycle delays
- Fixed-cycle delay code vending machine - code for generating shortest-possible delay routines at compile-time.
- 6502 vdelay - code for delaying a variable number of cycles at run-time.