Cycle counting: Difference between revisions
Rainwarrior (talk | contribs) (→Short delays: completing this up to 12 with the minimum suggestions from bisqwit's generator, adding BIT ZP) |
Rainwarrior (talk | contribs) m (→Short delays: zp consistency) |
||
Line 34: | Line 34: | ||
Here are few ways to create short delays without side effects. As the shortest instruction time is 2 cycles, it is not possible to delay 1 cycle on its own. | Here are few ways to create short delays without side effects. As the shortest instruction time is 2 cycles, it is not possible to delay 1 cycle on its own. | ||
NOP is essential for 2 cycle delays. 3 cycle delays always take 2 bytes, but usually have some compromise. More options become available as delays become longer | NOP is essential for 2 cycle delays. 3 cycle delays always take 2 bytes, but usually have some compromise. More options become available as delays become longer. | ||
* <code>NOP</code> - 2 cycles, 1 byte, no side effects | * <code>NOP</code> - 2 cycles, 1 byte, no side effects | ||
Line 48: | Line 47: | ||
* <code>NOP, NOP, NOP, NOP</code> - 8 cycles, 4 bytes | * <code>NOP, NOP, NOP, NOP</code> - 8 cycles, 4 bytes | ||
* <code>PHP, PLP, NOP</code> - 9 cycles, 3 bytes, modifies 1 byte of stack | * <code>PHP, PLP, NOP</code> - 9 cycles, 3 bytes, modifies 1 byte of stack | ||
* <code>PHP, CMP | * <code>PHP, CMP zp, PLP</code> - 10 cycles, 4 byes, modifies 1 byte of stack, reads zp | ||
* <code>PHP, PLP, NOP, NOP</code> - 11 cycles, 4 bytes, modifies 1 byte of stack | * <code>PHP, PLP, NOP, NOP</code> - 11 cycles, 4 bytes, modifies 1 byte of stack | ||
* <code>JSR, RTS</code> - 12 cycles, 3 bytes (if taking advantage of an existing RTS elsewhere), modifies 2 bytes of stack | * <code>JSR, RTS</code> - 12 cycles, 3 bytes (if taking advantage of an existing RTS elsewhere), modifies 2 bytes of stack |
Revision as of 23:25, 11 October 2020
It is often useful to delay a specific number of CPU cycles. Timing raster effects or generating PCM audio are some examples that might utilize this. This article outlines a few relevant techniques.
Instruction timings
You can use a comprehensive guide[1][2] as reference for instruction timings, but there are some rules-of-thumb[3] that can help remember most of them:
- Each byte of memory read or write adds another cycle to the instruction. This includes fetching the instruction, and each byte of its operand, then any memory it references.
- Indexed instructions which cross a page take 1 extra cycle to adjust the high byte of the effective address first.
- Read-modify-write instructions perform a dummy write during the "modify" stage and thus take an extra cycle.
- Instructions that push data onto the stack take extra cycles.
- Instructions which pop data from the stack take two extra cycles since they also need to pre-increment the stack pointer
- "Extra" cycles often include an extra read or write that usually does not affect the outcome.
- There is a minimum of 2 cycles per instruction.
Examples:
SEC
- 2 cycles: 1 byte opcode, but has to wait for the 2-cycle minimum.AND #imm
- 2 cycles: opcode + operand = 2 bytes. Only affects registers.LDA zp
- 3 cycles: opcode + operand + byte fetched from zp.STA abs
- 4 cycles: opcode + 2 byte operand + byte written to abs.LDA abs, X
- 4 or 5 cycles: opcode + 2 byte operand + read from abs, but if the addition of the X index causes a page crossing it delays 1 extra cycle.ASL zp
- 5 cycles: opcode + operand + read from zp + write to zp, but it takes 1 extra cycle to modify the value.LDA (indirect), Y
- 5 or 6 cycles: opcode + operand + two reads from zp + read from indirect address. 1 extra cycle if a page is crossed.STA (indirect), Y
- 6 cycles: like LDA (indirect) but assumes the worst case of page crossing, so always spends 1 extra cycle reading in case the page correction is being applied.PHA
- 3 cycles: opcode + stack write, but requires 1 extra cycle to perform the stack operation.RTS
- 6 cycles: opcode + two stack reads, but requires 2 extra cycles to perform the stack operations, plus an additional cycle to post-increment the program counter (to compensate for the off-by-1 address pushed by JSR)
Short delays
Here are few ways to create short delays without side effects. As the shortest instruction time is 2 cycles, it is not possible to delay 1 cycle on its own. NOP is essential for 2 cycle delays. 3 cycle delays always take 2 bytes, but usually have some compromise. More options become available as delays become longer.
NOP
- 2 cycles, 1 byte, no side effectsJMP *+3
- 3 cycles, 3 bytes, no side effectsBxx *+2
- 3 cycles, 2 bytes, no side effects but requires a known flag state (e.g. BCC if carry is known to be clear)BIT zp
- 3 cycles, 2 bytes, clobbers NVZ but preserves C, reads zpIGN zp
- 3 cycles, 2 bytes, only side effect is a read, unofficial instructionNOP, NOP
- 4 cycles, 2 bytesNOP, ...
- 5 cycles, 3 bytes (... = 3 cycle delay of choice)NOP, NOP, NOP
- 6 cycles, 4 bytesPHP, PLP
- 7 cycles, 2 bytes, modifies 1 byte of stackNOP, NOP, NOP, NOP
- 8 cycles, 4 bytesPHP, PLP, NOP
- 9 cycles, 3 bytes, modifies 1 byte of stackPHP, CMP zp, PLP
- 10 cycles, 4 byes, modifies 1 byte of stack, reads zpPHP, PLP, NOP, NOP
- 11 cycles, 4 bytes, modifies 1 byte of stackJSR, RTS
- 12 cycles, 3 bytes (if taking advantage of an existing RTS elsewhere), modifies 2 bytes of stack
Clockslide
A clockslide[4] is a sequence of instructions that wastes a small constant amount of cycles plus one cycle per executed byte, no matter whether it's entered on an odd or even address.
With official instructions, one can construct a clockslide from CMP instructions: ... C9 C9 C9 C9 C5 EA
Disassemble from the start and you get CMP #$C9 CMP #$C9 CMP $EA
(6 bytes, 7 cycles).
Disassemble one byte in and you get CMP #$C9 CMP #$C5 NOP
(5 bytes, 6 cycles).
The entry point can be controlled with an indirect jump or the RTS Trick to precisely control raster effect or sample playback timing.
CMP has a side effect of destroying most of the flags, but unofficial instructions that skip one byte can be used to preserve them. For example, replace $C9 (CMP) with $89 or $80, which skips one immediate byte, and replace $C5 with $04, $44, or $64, which reads a byte from zero page and ignores it.
Resources
- Fixed-cycle delay code vending machine - code for generating shortest-possible delay routines at compile-time.
- 6502 vdelay - code for delaying a variable number of cycles at run-time.