DMA
The 2A03 contains a pair of DMA units, one for copying sprite data to PPU OAM and the other for copying DPCM sample data to the APU's DMC sample buffer. DMA is required for DPCM playback, and it is difficult to fill OAM without DMA. Unfortunately, DMC DMA can also result in data loss when reading registers with side effects, such as the joypads.
Summary
- DMA alternates between cycles on which it can get (read) and cycles on which it can put (write). These are the first and second halves of APU cycles, respectively. The power-on alignment with the CPU is random.
- DMA can only halt on CPU read cycles. On write cycles, the halt fails and the DMA unit tries again next cycle, repeating until successful.
- OAM DMA halts the CPU, performs an optional alignment cycle, and then gets and puts 256 times, taking 513 or 514 cycles. This happens on the first cycle after the $4014 write.
- DMC DMA halts the CPU, performs a dummy cycle and an optional alignment cycle, and then gets once, taking 3 or 4 cycles.
- The first DMC DMA after the $4015 write happens on the get cycle during the second following APU cycle. The other DMC DMAs happen on put cycles.
- The DMC DMA get takes precedence over OAM DMA get, delaying it, but the DMA cycles otherwise overlap, reducing the total cycle cost.
- DMC DMA has bugs triggered by stopping the sample at specific times.
- DMC DMA can corrupt joypad and PPUDATA ($2007) reads.
Cadence
The DMA units cannot just read or write on any cycle as they wish. Instead, they alternate between get cycles where they can read and put cycles where they can write. If they need to perform an action that is not permitted on the current cycle, they wait.
Get and put cycles are aligned to the first and second halves of the APU clock, respectively (called apu_clk1 and apu_clk2 in Visual2A03). While these cycles are sometimes described as even and odd CPU cycles, this is not accurate because the CPU and APU randomly power into either of 2 alignments relative to each other. Therefore, get and put may occur on different CPU cycle parities across different power cycles.
Behavior
During the DMA process, the CPU is halted and the 2A03's address and data lines are used for data transfer. The process involves a combination of no-operation cycles and access cycles. No-operation cycles come in 3 equivalent types: the halt cycle, a DMC-only dummy cycle, and an optional alignment cycle.
When DMA is scheduled, the associated DMA unit attempts to halt the CPU. The CPU only allows this on read cycles. If the CPU is writing, it ignores the halt and the DMA unit waits until the next cycle to try again, repeating until successful. Delays of up to 3 cycles are possible, with read-modify-write instructions having 2 consecutive writes and interrupts having 3. The halting process itself takes 1 CPU cycle, during which no useful work is done.
Once the CPU is halted, the DMA unit may need to perform some amount of non-access setup, taking up to 2 cycles. The exact timing of when DMA is scheduled and what kind of setup it needs depends on the type of DMA.
The CPU is halted using its internal RDY input. When RDY is deasserted, the 6502 core repeats the last read cycle indefinitely, making no forward progress nor handling interrupts. On 2A03 CPUs, these repeated reads are externally visible on any no-operation DMA cycle, causing data loss if reading a register with side effects. On 2A07 CPUs, it is suspected that a different address (perhaps the DMA address) is on the bus, instead, during all no-operation cycles. See Register conflicts for more information.
When the DMA process completes, the CPU performs the read it attempted when halted.
- DMA halts normally:
(halted) CPU reads from address A <- DMA halt cycle (halted) [DMA occurs] CPU reads from address A <- CPU resumes execution
- DMA halt is delayed by writes:
CPU writes <- DMA attempts to halt CPU writes <- DMA attempts to halt (halted) CPU reads from address A <- DMA halt cycle (halted) [DMA occurs] CPU reads from address A <- CPU resumes execution
- DMA has a non-access cycle:
(halted) CPU reads from address A <- DMA halt cycle (halted) CPU reads from address A <- DMA does not read or write (halted) DMA accesses address B CPU reads from address A <- CPU resumes execution
OAM DMA
OAM DMA copies 256 bytes from a CPU page to PPU OAM via the OAMDATA ($2004) register. It is triggered by writing the page number (the high byte of the address) to OAMDMA ($4014). OAM DMA is scheduled to halt the CPU on the first cycle after the register write. In the common case, it performs a halt cycle, an optional alignment cycle, and 256 get/put pairs.
The 256 get/put pairs copy forward from the start of the page. Because DMA can only read on get cycles, an alignment cycle performing no useful work may be required before being able to read. All together, OAM DMA on its own takes 513 or 514 cycles, depending on whether alignment is needed.
OAM DMA will copy from the page most recently written to $4014. This means that read-modify-write instructions such as INC $4014, which are able to perform a second write before the CPU can be halted, will copy from the second page written, not the first.
OAM DMA has a lower priority than DMC DMA. If a DMC DMA get occurs during OAM DMA, OAM DMA is briefly paused. (See DMC DMA during OAM DMA)
- Alignment is not needed:
(get) CPU writes to $4014 (halted) (put) CPU reads from address A <- DMA halt cycle (halted) (get) DMA reads from $xx00 (halted) (put) DMA writes to $2004 (halted) (get) DMA reads from $xx01 (halted) (put) DMA writes to $2004 ... (halted) (get) DMA reads from $xxFF (halted) (put) DMA writes to $2004 <- DMA completes (get) CPU reads from address A <- CPU resumes execution
- Alignment is needed:
(put) CPU writes to $4014 (halted) (get) CPU reads from address A <- DMA halt cycle (halted) (put) CPU reads from address A <- DMA alignment cycle (halted) (get) DMA reads from $xx00 (halted) (put) DMA writes to $2004 (halted) (get) DMA reads from $xx01 (halted) (put) DMA writes to $2004 ... (halted) (get) DMA reads from $xxFF (halted) (put) DMA writes to $2004 <- DMA completes (get) CPU reads from address A <- CPU resumes execution
DMC DMA
DMC DMA copies a single byte to the DMC unit's sample buffer. This occurs automatically after the DMC enable bit, bit 4, of the sound channel enable register ($4015) is set to 1, which starts DPCM sample playback using the current DMC settings in registers $4010-4013. DMC DMA is scheduled when all of DPCM playback is enabled, there are bytes left in the sample, and the sample buffer is empty (see Memory reader and Output unit). In the common cases, DMC DMA performs a halt cycle, a dummy cycle, an optional alignment cycle, and a get.
The exact timing depends on the type of DMC DMA. There are two types: load and reload. Load DMAs occur after $4015 D4 is set, but only if the sample buffer is empty. They are scheduled to halt the CPU on a get cycle during the 2nd APU cycle after the write (that is, the 3rd or 4th CPU cycle). Reload DMAs occur in response to the sample buffer being emptied. Unlike load DMAs, they are scheduled to halt the CPU on a put cycle.
After the halt, DMC DMA always performs a dummy cycle where no work is done. If the next cycle is not a get cycle, then a cycle will be spent on alignment. Then the DMA read is performed.
DMC DMA normally takes 3 or 4 cycles, depending on whether alignment is needed. Because load and reload DMAs schedule on different cycle types, load DMAs take 3 cycles and reload DMAs take 4 unless the halt is delayed by an odd number of cycles. However, bugs can cause additional cycles; see Bugs below.
- Load DMA:
(get) \ CPU writes to $4015 <- DMC enabled (put) / during this APU cycle <- DMC enabled (get) CPU reads (put) CPU reads (halted) (get) CPU reads from address A <- DMA halt cycle (halted) (put) CPU reads from address A <- DMA dummy cycle (halted) (get) DMA reads from address B (put) CPU reads from address A <- CPU resumes execution
- Load DMA (delayed 1 cycle):
(get) \ CPU writes to $4015 <- DMC enabled (put) / during this APU cycle <- DMC enabled (get) CPU reads (put) CPU reads (get) CPU writes <- DMA attempts to halt (halted) (put) CPU reads from address A <- DMA halt cycle (halted) (get) CPU reads from address A <- DMA dummy cycle (halted) (put) CPU reads from address A <- DMA alignment cycle (halted) (get) DMA reads from address B (put) CPU reads from address A <- CPU resumes execution
- Reload DMA:
(halted) (put) CPU reads from address A <- DMA halt cycle (halted) (get) CPU reads from address A <- DMA dummy cycle (halted) (put) CPU reads from address A <- DMA alignment cycle (halted) (get) DMA reads from address B (put) CPU reads from address A <- CPU resumes execution
- Reload DMA (delayed 1 cycle):
(put) CPU writes <- DMA attempts to halt (halted) (get) CPU reads from address A <- DMA halt cycle (halted) (put) CPU reads from address A <- DMA dummy cycle (halted) (get) DMA reads from address B (put) CPU reads from address A <- CPU resumes execution
- Reload DMA (delayed 2 cycles):
(put) CPU writes <- DMA attempts to halt (get) CPU writes <- DMA attempts to halt (halted) (put) CPU reads from address A <- DMA halt cycle (halted) (get) CPU reads from address A <- DMA dummy cycle (halted) (put) CPU reads from address A <- DMA alignment cycle (halted) (get) DMA reads from address B (put) CPU reads from address A <- CPU resumes execution
- Reload DMA (delayed 3 cycles):
(put) CPU writes <- DMA attempts to halt (get) CPU writes <- DMA attempts to halt (put) CPU writes <- DMA attempts to halt (halted) (get) CPU reads from address A <- DMA halt cycle (halted) (put) CPU reads from address A <- DMA dummy cycle (halted) (get) DMA reads from address B (put) CPU reads from address A <- CPU resumes execution
Bugs
DMC DMA suffers from two bugs[1][2] related to sample playback stopping around the time a DMC output cycle ends, which is what empties the sample buffer and triggers a reload DMA. This can happen explicitly, where the sample is stopped by clearing $4015 D4, or implicitly, where a non-looping 1-byte sample is started while the sample buffer is empty shortly before a reload DMA would happen. Implicit stops require this type of sample because the load DMA for the first byte when the sample buffer is empty is the only way to implicitly end the sample just before a reload DMA; longer samples will instead be ended by the reload DMA itself, as normal.
When sample playback is stopped during the APU cycle before a reload DMA would happen (that is, on the 2nd or 3rd CPU cycle before the halt attempt), the DMA starts, but is aborted after a single cycle. If the halt is delayed due to a write cycle, the aborted DMA doesn't occur at all. This aborted DMA happens regardless of how playback was stopped, whether explicitly or implicitly. In the implicit case, the write to begin the sample will normally be the 4th APU cycle before the reload DMA would happen (8th or 9th CPU cycle before).
On RP2A03H and late RP2A03G CPUs, when playback is stopped implicitly on the same APU cycle that a reload DMA would happen (that is, the 1st CPU cycle before the halt attempt), an unexpected reload DMA occurs. It is not known what address is read nor if the byte is played. It is suspected the address is either the same sample address or the next one, and that it is played. On RP2A03G CPUs, this bug was introduced sometime in 1990; earlier chips are unaffected.
It is not known whether 2A07 CPUs are affected by these bugs. Clone hardware may or may not be affected, and behavior on affected clones may differ from official CPUs. For example, the UM6561AF-2 features both the aborted DMA and unexpected DMA, but to trigger these with implicit stops, playback must be started 1 APU cycle earlier for unknown reasons.
- Explicit-stop aborted DMA:
(get) \ CPU writes to $4015 <- DMC disabled (put) / during this APU cycle <- DMC disabled (get) CPU reads (halted) (put) CPU reads from address A <- DMA halt cycle (get) CPU reads from address A <- CPU resumes execution
- Implicit-stop aborted DMA:
(get) \ CPU writes to $4015 <- DMC enabled w/ buffer empty (put) / during this APU cycle <- DMC enabled w/ buffer empty (get) CPU reads (put) CPU reads (halted) (get) CPU reads from address A <- DMA halt cycle (halted) (put) CPU reads from address A <- DMA dummy cycle (halted) (get) DMA reads from address B (put) CPU reads from address A <- CPU resumes execution (get) CPU reads (halted) (put) CPU reads from address C <- DMA halt cycle (get) CPU reads from address C <- CPU resumes execution
- Implicit-stop unexpected DMA:
(get) \ CPU writes to $4015 <- DMC enabled w/ buffer empty (put) / during this APU cycle <- DMC enabled w/ buffer empty (get) CPU reads (put) CPU reads (halted) (get) CPU reads from address A <- DMA halt cycle (halted) (put) CPU reads from address A <- DMA dummy cycle (halted) (get) DMA reads from address B (halted) (put) CPU reads from address A <- DMA halt cycle (halted) (get) CPU reads from address A <- DMA dummy cycle (halted) (put) CPU reads from address A <- DMA alignment cycle (halted) (get) DMA reads (put) CPU reads from address A <- CPU resumes execution
The implicit-stop aborted DMA can be prevented with carefully placed write cycles. This can be necessary for synchronized code, where the aborted DMA's odd number of cycles can invert cycle parity. The following code synchronizes to a put cycle and uses precise write cycles to prevent any aborted DMA that may occur:
STx $4015 ; Initiate DMC DMA STx zp ; Force load DMA to the 4th cycle STx zp ; Override the aborted DMA ; The first cycle of the next instruction is a put cycle.
- Implicit-stop aborted DMA bypass (write on get):
(get) CPU writes to $4015 <- DMC enabled w/ buffer empty (put) CPU reads (get) CPU reads (put) CPU writes (halted) (get) CPU reads from address A <- DMA halt cycle (halted) (put) CPU reads from address A <- DMA dummy cycle (halted) (get) DMA reads from address B (put) CPU reads from address A <- CPU resumes execution (get) CPU reads (put) CPU writes <- DMA attempts to halt (get) CPU reads
- Implicit-stop aborted DMA bypass (write on put):
(put) CPU writes to $4015 <- DMC enabled w/ buffer empty (get) CPU reads (put) CPU reads (get) CPU writes <- DMA attempts to halt (halted) (put) CPU reads from address A <- DMA halt cycle (halted) (get) CPU reads from address A <- DMA halt cycle (halted) (put) CPU reads from address A <- DMA dummy cycle (halted) (get) DMA reads from address B (put) CPU reads from address A <- CPU resumes execution (get) CPU reads (put) CPU writes <- DMA attempts to halt (get) CPU reads
DMC DMA during OAM DMA
DMC and OAM use independent DMA units that only interact when both attempt to access memory on the same cycle. When accesses collide, DMC DMA is allowed to run and OAM DMA is paused, trying again on the next cycle. This can cause OAM DMA to have to perform an additional alignment cycle before continuing. No-operation cycles are allowed to overlap with each other and with access cycles, allowing cycles to be saved.
In the common case, DMC DMA occurring during OAM DMA will cost only 2 cycles: 1 cycle for the DMC DMA get and then 1 cycle for OAM DMA to align back to a get. However, if DMC DMA occurs at the end of OAM DMA, it can take 1 or 3 cycles. If it schedules for the second-to-last put, its get will occur on the first cycle after OAM DMA, taking just 1 cycle total. If it schedules for the last put, it will instead extend 3 cycles beyond the end of OAM DMA.
OAM DMA is sometimes used to synchronize code to avoid conflicts with DMC DMA when reading hardware registers, but because DMC DMA takes an odd number of cycles if it lands at the end, synchronization is not guaranteed.
- DMC DMA at the start of OAM DMA (write on get), taking 2 cycles
(get) CPU writes to $4014 (halted) (put) CPU reads from address A <- DMC and OAM DMA halt cycle (halted) (get) OAM DMA reads from $xx00 <- DMC DMA dummy cycle (halted) (put) OAM DMA writes to $2004 <- DMC DMA alignment cycle (halted) (get) DMC DMA reads from address B (halted) (put) CPU reads from address A <- OAM DMA alignment cycle (halted) (get) DMA reads from $xx01 (halted) (put) DMA writes to $2004 ...
- DMC DMA at the start of OAM DMA (write on put), taking 2 cycles
(put) CPU writes to $4014 <- DMC DMA attempts to halt (halted) (get) CPU reads from address A <- DMC and OAM DMA halt cycle (halted) (put) CPU reads from address A <- DMC DMA dummy cycle, OAM DMA alignment cycle (halted) (get) DMC DMA reads from address B (halted) (put) CPU reads from address A <- OAM DMA alignment cycle (halted) (get) OAM DMA reads from $xx00 (halted) (put) OAM DMA writes to $2004 ...
- DMC DMA in the middle of OAM DMA, taking 2 cycles
... (halted) (get) OAM DMA reads from address C (halted) (put) OAM DMA writes to $2004 <- DMC DMA halt cycle (halted) (get) OAM DMA reads from address C+1 <- DMC DMA dummy cycle (halted) (put) OAM DMA writes to $2004 <- DMC DMA alignment cycle (halted) (get) DMC DMA reads from address B (halted) (put) CPU reads from address A <- OAM DMA alignment cycle (halted) (get) OAM DMA reads from address C+2 ...
- DMC DMA on second-to-last OAM DMA put, taking 1 cycle
... (halted) (get) OAM DMA reads from $xxFE (halted) (put) OAM DMA writes to $2004 <- DMC DMA halt cycle (halted) (get) OAM DMA reads from $xxFF <- DMC DMA dummy cycle (halted) (put) OAM DMA writes to $2004 <- DMC DMA alignment cycle (halted) (get) DMC DMA reads from address B (put) CPU reads from address A <- CPU resumes execution
- DMC DMA on second-to-last OAM DMA put, taking 3 cycles
... (halted) (get) OAM DMA reads from $xxFF (halted) (put) OAM DMA writes to $2004 <- DMC DMA halt cycle (halted) (get) CPU reads from address A <- DMC DMA dummy cycle (halted) (put) CPU reads from address A <- DMC DMA alignment cycle (halted) (get) DMC DMA reads from address B (put) CPU reads from address A <- CPU resumes execution
Register conflicts
On the 2A03, while the CPU is halted, it repeats the read cycle on which it was halted during every no-operation DMA cycle (that is, when the DMA units are not reading or writing). If the CPU was reading a register with side-effects, this can cause data to be lost. While this isn't realistically a problem with OAM DMA because of its particular timing constraints, it is a very real problem with DMC DMA that must usually be worked around. Example registers include PPUSTATUS ($2002), PPUDATA ($2007), and sound status ($4015). When conflicting with DMC DMA, these will see 2 or more extra reads. (Note that $2007 reads on adjacent cycles may have unexpected behavior.)
Most frequently, these DMC DMA conflicts occur with joypad reads, though the mechanism is slightly different. Joypads are clocked via direct lines from the CPU, called joypad 1 /OE and joypad 2 /OE, rather than going over the address bus. These output enables remain asserted the entire CPU cycle and even across adjacent cycles if they're both reading the same joypad register. Therefore, controllers only see a single read for each contiguous set of reads of a joypad register.
The console type affects joypad extra-read behavior. On the RF Famicom, additional hardware outside the 2A03 only passes joypad 1 /OE and joypad 2 /OE through during one half of the clock cycle, meaning the joypad sees a clock on every single CPU read cycle rather than just every contiguous set. The RF Famicom is the only console known to behave this way. The AV Famicom and NES-001 are confirmed to use the per-contiguous-set behavior.[3]
This is further complicated by esoteric behavior regarding how 2A03 registers are activated: instead of checking the full 2A03 address bus, it checks bits 4-0 from the 2A03 address bus and bits 15-5 from the 6502 core. The 6502 core keeps the same address while halted, so if it was reading from $4000-401F, the DMA address can unintentionally activate 2A03 registers. This can lead to bus conflicts and an extra read from the other joypad, and can even prevent an extra read of the current joypad.[4]
The 2A07 fixes these extra read problems, though perhaps not completely. It is suspected it does so by disconnecting the 6502 core's address bus from the 2A07's while halted. However, it is not known if the register activation behavior has also been changed to use just the 2A07 address bus or if it still uses a combination of the 2A07's and 6502 core's address buses. If the 2A07 has the same register behavior as the 2A03, then reading a joypad could still cause DPCM data corruption and cause an extra read of the other joypad if a conflicting DMA reads from an address matching a readable 2A03 register in the low 5 bits.
Workarounds exist for this issue. Most commonly, joypads are read multiple times until the same result is seen twice in row, reducing or eliminating the chance of accepting corrupted data. However, any collisions during this time may corrupt the DPCM data, and this strategy is not suitable for all affected registers. Alternatively, reads synchronized using OAM DMA can ensure a collision never happens, but this enforces strict timing constraints on the code and has numerous caveats, particularly for functions longer than one DMC output cycle (sample byte period).
- DMC DMA collides with $2007 read (3 extra reads)
(halted) (put) CPU reads from $2007 <- DMA halt cycle (halted) (get) CPU reads from $2007 <- DMA dummy cycle (halted) (put) CPU reads from $2007 <- DMA alignment cycle (halted) (get) DMA reads from address B (put) CPU reads from $2007 <- CPU resumes execution
- DMC DMA collides with JOYPAD1 ($4016) read (1 or 3 extra reads)
(halted) (put) CPU reads from $4016 <- DMA halt cycle (halted) (get) CPU reads from $4016 <- DMA dummy cycle (halted) (put) CPU reads from $4016 <- DMA alignment cycle (halted) (get) DMA reads from $C000 (put) CPU reads from $4016 <- CPU resumes execution
- DMC DMA collides with JOYPAD1 read (0 or 4 extra reads)
- The combined address from 6502 core bits 15-5 and 2A03 bits 4-0 is $4016.
- This triggers a bus conflict, corrupting the DMA read.
(halted) (put) CPU reads from $4016 <- DMA halt cycle (halted) (get) CPU reads from $4016 <- DMA dummy cycle (halted) (put) CPU reads from $4016 <- DMA alignment cycle (halted) (get) DMA reads from $C016 and $4016 (put) CPU reads from $4016 <- CPU resumes execution
- DMC DMA collides with JOYPAD1 read (1 extra read of JOYPAD2, 1 or 3 extra of JOYPAD1)
- The combined address from 6502 core bits 15-5 and 2A03 bits 4-0 is $4017.
- This triggers a bus conflict, corrupting the DMA read.
(halted) (put) CPU reads from $4016 <- DMA halt cycle (halted) (get) CPU reads from $4016 <- DMA dummy cycle (halted) (put) CPU reads from $4016 <- DMA alignment cycle (halted) (get) DMA reads from $C017 and $4017 (put) CPU reads from $4016 <- CPU resumes execution
- DMC DMA collides with JOYPAD1 read (1 or 3 extra reads)
- The combined address from 6502 core bits 15-5 and 2A03 bits 4-0 is $4015.
- This causes the 2A03 to read $4015 and ignore the DMA value on the external data bus.
(halted) (put) CPU reads from $4016 <- DMA halt cycle (halted) (get) CPU reads from $4016 <- DMA dummy cycle (halted) (put) CPU reads from $4016 <- DMA alignment cycle (halted) (get) DMA reads from $C015, 2A03 reads from $4015 internally (put) CPU reads from $4016 <- CPU resumes execution
References
- Forum post: Disch's OAM/DMC DMA test results
- ↑ Forum post: Fiskbit's manual DMA test suite
- ↑ Forum post: Fiskbit's explicit and implicit stop tests
- ↑ Forum post: Fiskbit's joypad read cycle breakdown
- ↑ Forum post: Fiskbit's APU register activation test