DMA: Difference between revisions

From NESdev Wiki
Jump to navigationJump to search
(Replaces DMA article stub with detailed DMA behavior writeup. There are still some edge cases to test and 2A07 behavior to verify, but it's largely complete.)
(→‎Register conflicts: Adds Twin Famicom and Famicom Titler to the list of consoles known to clock joypads per CPU cycle.)
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
The 2A03 contains a DMA unit capable of quickly copying sprite data to PPU [[OAM]] and DPCM sample data to the APU's [[APU DMC|DMC]] unit. DMA is required for DPCM playback, and it is difficult to fill OAM without DMA. Unfortunately, DMC DMA can also result in data loss when reading registers with side effects, such as the joypads.
The 2A03 contains a pair of DMA units, one for copying sprite data to PPU [[OAM]] and the other for copying DPCM sample data to the APU's [[APU DMC|DMC]] sample buffer. DMA is required for DPCM playback, and it is difficult to fill OAM without DMA. Unfortunately, DMC DMA can also result in data loss when reading registers with side effects, such as the joypads.


==Summary==
==Summary==
* DMA alternates between cycles on which it can get (read) and cycles on which it can put (write). These are the first and second halves of APU cycles, respectively. The power-on alignment with the CPU is random.
* The CPU alternates between cycles on which DMA can get (read) and cycles on which DMA can put (write). These are the first and second halves of APU cycles, respectively. At power-on, whether the first CPU cycle is get or put is random.
* DMA can only halt on CPU read cycles. On write cycles, the halt fails and the DMA unit tries again next cycle, repeating until successful.
* If DMA tries to get on a put cycle, it waits and tries again next cycle. This wait is called an alignment cycle.
* OAM DMA halts the CPU, performs an optional alignment cycle, and then gets and puts 256 times, taking 513 or 514 cycles. This happens on the first cycle after the $4014 write.
* DMA can only halt on CPU read cycles. On write cycles, the halt fails and the DMA unit tries again next CPU cycle, repeating until successful.
* OAM DMA halts the CPU, performs an optional alignment cycle, and then gets and puts 256 times, taking 513 or 514 cycles. It attempts to halt on the first CPU cycle after the $4014 write.
* DMC DMA halts the CPU, performs a dummy cycle and an optional alignment cycle, and then gets once, taking 3 or 4 cycles.  
* DMC DMA halts the CPU, performs a dummy cycle and an optional alignment cycle, and then gets once, taking 3 or 4 cycles.  
* The first DMC DMA after the $4015 write happens on the get cycle during the second following APU cycle. The other DMC DMAs happen on put cycles.
* The first, "load" DMC DMA after the $4015 write attempts to halt on the get cycle during the 2nd following APU cycle. The other, "reload" DMC DMAs attempt to halt on a put cycle. Failed halts try again on the next CPU cycle, repeating until successful.
* The DMC DMA get takes precedence over OAM DMA get, delaying it, but the DMA cycles otherwise overlap, reducing the total cycle cost.
* The DMC DMA get takes precedence over OAM DMA get, delaying it, but the DMA cycles otherwise overlap, reducing the total cycle cost. This delay can force OAM DMA to add an alignment cycle.
* DMC DMA has [[#Bugs|bugs triggered by stopping the sample at specific times]].
* DMC DMA has [[#Bugs|bugs triggered by stopping the sample at specific times]].
* DMC DMA can [[#Register conflicts|corrupt joypad and PPUDATA ($2007) reads]].
* DMC DMA can [[#Register conflicts|corrupt joypad and PPUDATA ($2007) reads and cause extraneous reads of $4015-4017]].


==Cadence==
==Cadence==
The DMA unit cannot just read or write on any cycle as it wishes. Instead, it alternates between '''get''' cycles where it can read and '''put''' cycles where it can write. If it needs to perform an action that is not permitted on the current cycle, it waits.
The DMA units cannot just read or write on any cycle as they wish. Instead, they alternate between '''get''' cycles where they can read and '''put''' cycles where they can write. If they need to perform an action that is not permitted on the current cycle, they wait.


Get and put cycles are aligned to the first and second halves of the APU clock, respectively (called apu_clk1 and apu_clk2 in Visual2A03). While these cycles are sometimes described as even and odd CPU cycles, this is not accurate because the CPU and APU randomly power into either of 2 alignments relative to each other. Therefore, get and put may occur on different CPU cycle parities across different power cycles.
Get and put cycles are aligned to the first and second halves of the APU clock, respectively (called apu_clk1 and apu_clk2 in Visual2A03). While these cycles are sometimes described as even and odd CPU cycles, this is not accurate because the CPU and APU randomly power into either of 2 alignments relative to each other. Therefore, get and put may occur on different CPU cycle parities across different power cycles.
Line 19: Line 20:
During the DMA process, the CPU is halted and the 2A03's address and data lines are used for data transfer. The process involves a combination of no-operation cycles and access cycles. No-operation cycles come in 3 equivalent types: the halt cycle, a DMC-only dummy cycle, and an optional alignment cycle.
During the DMA process, the CPU is halted and the 2A03's address and data lines are used for data transfer. The process involves a combination of no-operation cycles and access cycles. No-operation cycles come in 3 equivalent types: the halt cycle, a DMC-only dummy cycle, and an optional alignment cycle.


When DMA is scheduled, the DMA unit attempts to halt the CPU. It is only able to do this on CPU read cycles. If the CPU is writing, the halt fails and the DMA unit waits until the next cycle to try again, repeating until a successful halt. Delays of up to 3 cycles are possible, with read-modify-write instructions having 2 consecutive writes and interrupts having 3. The halting process itself takes 1 CPU cycle, during which no useful work is done.
When DMA is scheduled, the associated DMA unit attempts to halt the CPU. The CPU only allows this on read cycles. If the CPU is writing, it ignores the halt and the DMA unit waits until the next cycle to try again, repeating until successful. Delays of up to 3 cycles are possible, with read-modify-write instructions having 2 consecutive writes and interrupts having 3. The halting process itself takes 1 CPU cycle, during which no useful work is done.


Once the CPU is halted, the DMA unit may need to perform some amount of non-access setup, taking up to 2 cycles. The exact timing of when DMA is scheduled and what kind of setup it needs depends on the type of DMA.
Once the CPU is halted, the DMA unit may need to perform some amount of non-access setup, taking up to 2 cycles. The exact timing of when DMA is scheduled and what kind of setup it needs depends on the type of DMA.
Line 160: Line 161:


====Bugs====
====Bugs====
DMC DMA suffers from two bugs<ref>[https://forums.nesdev.org/viewtopic.php?p=250170#p250170 Forum post:] Fiskbit's manual DMA test suite</ref><ref>[https://forums.nesdev.org/viewtopic.php?p=275734#p275734 Forum post:] Fiskbit's explicit and implicit stop tests</ref> related to sample playback stopping around the time [[APU_DMC#Output_unit|a DMC output cycle ends]], which is what empties the sample buffer and triggers a reload DMA. This can happen explicitly, where the sample is stopped by clearing $4015 D4, or implicitly, where a non-looping 1-byte sample is started while the sample buffer is empty shortly before a reload DMA would happen. Implicit stops require this type of sample because the load DMA for the first byte when the sample buffer is empty is the only way to implicitly end the sample just before a reload DMA; longer samples will instead be ended by the reload DMA itself, as normal.
DMC DMA suffers from two bugs<ref>[https://forums.nesdev.org/viewtopic.php?p=250170#p250170 Forum post:] Fiskbit's manual DMA test suite</ref><ref>[https://forums.nesdev.org/viewtopic.php?p=275734#p275734 Forum post:] Fiskbit's explicit and implicit stop tests</ref> related to sample playback stopping around the time [[APU_DMC#Output_unit|a DMC output cycle ends]], which is what empties the sample buffer and triggers a reload DMA. This can happen explicitly, where the sample is stopped by clearing $4015 D4, or implicitly, where a non-looping 1-byte sample is started while the sample buffer is empty shortly before a reload DMA would schedule. Implicit stops require this type of sample because the load DMA for the first byte when the sample buffer is empty is the only way to implicitly end the sample just before a reload DMA; longer samples will instead be ended by the reload DMA itself, as normal.


When sample playback is stopped during the APU cycle before a reload DMA would happen (that is, on the 2nd or 3rd CPU cycle before the halt attempt), the DMA starts, but is aborted after a single cycle. If the halt is delayed due to a write cycle, the aborted DMA doesn't occur at all. This aborted DMA happens regardless of how playback was stopped, whether explicitly or implicitly. In the implicit case, the write to begin the sample will normally be the 4th APU cycle before the reload DMA would happen (8th or 9th CPU cycle before).
When sample playback is stopped during the APU cycle before a reload DMA would schedule (that is, on the 2nd or 3rd CPU cycle before the halt attempt), the DMA starts, but is aborted after a single cycle. If the halt is delayed due to a write cycle, the aborted DMA doesn't occur at all. This aborted DMA schedules regardless of how playback was stopped, whether explicitly or implicitly. In the implicit case, the write to begin the sample will normally be the 4th APU cycle before the reload DMA would schedule (8th or 9th CPU cycle before).


On RP2A03H and late RP2A03G CPUs, when playback is stopped implicitly on the same APU cycle that a reload DMA would happen (that is, the 1st CPU cycle before the halt attempt), an unexpected reload DMA occurs. It is not known what address is read nor if the byte is played. It is suspected the address is either the same sample address or the next one, and that it is played. On RP2A03G CPUs, this bug was introduced sometime in 1990; earlier chips are unaffected.
On RP2A03H and late RP2A03G CPUs, when playback is stopped implicitly on the same APU cycle that a reload DMA would schedule (that is, the 1st CPU cycle before the halt attempt), an unexpected reload DMA occurs from the same address. This extra byte goes into the sample buffer and is played after the first byte, as with any normal fetch. On RP2A03G CPUs, this bug was introduced sometime in 1990; earlier chips are unaffected.


It is not known whether 2A07 CPUs are affected by these bugs. Clone hardware may or may not be affected, and behavior on affected clones may differ from official CPUs. For example, the UM6561AF-2 features both the aborted DMA and unexpected DMA, but to trigger these with implicit stops, playback must be started 1 APU cycle earlier for unknown reasons.
It is not known whether 2A07 CPUs are affected by these bugs. Some clone hardware is known to be affected, and behavior on affected clones may differ from official CPUs. For example, UA6527P-based clones feature both the aborted DMA and unexpected DMA bugs, but samples take 1 APU cycle longer to end than on official CPUs, so to trigger these bugs with implicit stops, the sample-ending byte must be fetched 1 APU cycle earlier.


<div style="margin-left: 2em;">
<div style="margin-left: 2em;">
Line 203: Line 204:
  (halted) (get) CPU reads from address A  <- DMA dummy cycle
  (halted) (get) CPU reads from address A  <- DMA dummy cycle
  (halted) (put) CPU reads from address A  <- DMA alignment cycle
  (halted) (put) CPU reads from address A  <- DMA alignment cycle
  (halted) (get) DMA reads
  (halted) (get) DMA reads from address B
           (put) CPU reads from address A  <- CPU resumes execution
           (put) CPU reads from address A  <- CPU resumes execution
</div></div></div>
</div></div></div>


The implicit-stop aborted DMA can be prevented with carefully placed write cycles. This can be necessary for synchronized code, where the aborted DMA's odd number of cycles can invert cycle parity. The following code synchronizes to a put cycle and uses precise write cycles to prevent any aborted DMA that may occur:
The implicit-stop aborted DMA can be prevented with carefully placed write cycles. This can be necessary for synchronized code, where the aborted DMA's odd number of cycles can invert cycle parity. The following code synchronizes to a put cycle and uses precise write cycles to prevent any aborted DMA that may occur:
  STx $4015  ; Initiate DMC DMA
  STx $4015  ; Initiate DMC DMA.
  STx zp    ; Force load DMA to the 4th cycle
  STx zp    ; Force load DMA to the 4th cycle.
  STx zp    ; Override the aborted DMA
            ; (If on a UA6527P-based clone, place a NOP here.)
  STx zp    ; Override the aborted DMA.
  ; The first cycle of the next instruction is a put cycle.
  ; The first cycle of the next instruction is a put cycle.


Line 246: Line 248:


===DMC DMA during OAM DMA===
===DMC DMA during OAM DMA===
DMC and OAM DMA are independent from each other and only interact when both attempt to access memory on the same cycle. When accesses collide, DMC DMA is allowed to run and OAM DMA is paused, trying again on the next cycle. This can cause OAM DMA to have to perform an additional alignment cycle before continuing. No-operation cycles are allowed to overlap with each other and with access cycles, allowing cycles to be saved.
DMC and OAM use independent DMA units that only interact when both attempt to access memory on the same cycle. When accesses collide, DMC DMA is allowed to run and OAM DMA is paused, trying again on the next cycle. This can cause OAM DMA to have to perform an additional alignment cycle before continuing. No-operation cycles are allowed to overlap with each other and with access cycles, allowing cycles to be saved.


In the common case, DMC DMA occurring during OAM DMA will cost only 2 cycles: 1 cycle for the DMC DMA get and then 1 cycle for OAM DMA to align back to a get. However, if DMC DMA occurs at the end of OAM DMA, it can take 1 or 3 cycles. If it schedules for the second-to-last put, its get will occur on the first cycle after OAM DMA, taking just 1 cycle total. If it schedules for the last put, it will instead extend 3 cycles beyond the end of OAM DMA.
In the common case, DMC DMA occurring during OAM DMA will cost only 2 cycles: 1 cycle for the DMC DMA get and then 1 cycle for OAM DMA to align back to a get. However, if DMC DMA occurs at the end of OAM DMA, it can take 1 or 3 cycles. If it schedules for the second-to-last put, its get will occur on the first cycle after OAM DMA, taking just 1 cycle total. If it schedules for the last put, it will instead extend 3 cycles beyond the end of OAM DMA.
Line 308: Line 310:


==Register conflicts==
==Register conflicts==
On the 2A03, while the CPU is halted, it repeats the read cycle on which it was halted during every no-operation DMA cycle (that is, when the DMA unit is not reading or writing). If the CPU was reading a register with side-effects, this can cause data to be lost. While this isn't realistically a problem with OAM DMA because of its particular timing constraints, it is a very real problem with DMC DMA that must usually be worked around. Example registers include PPUSTATUS ($2002), PPUDATA ($2007), and sound status ($4015). When conflicting with DMC DMA, these will see 2 or more extra reads. (Note that $2007 reads on adjacent cycles may have unexpected behavior.)
On the 2A03, while the CPU is halted, it repeats the read cycle on which it was halted during every no-operation DMA cycle (that is, when the DMA units are not reading or writing). If the CPU was reading a register with side-effects, this can cause data to be lost. While this isn't realistically a problem with OAM DMA because of its particular timing constraints, it is a very real problem with DMC DMA that must usually be worked around. Example registers include PPUSTATUS ($2002), PPUDATA ($2007), and sound status ($4015). When conflicting with DMC DMA, these will see 2 or more extra reads. (Note that $2007 reads on adjacent cycles may have unexpected behavior.)


Most frequently, these DMC DMA conflicts occur with joypad reads, though the mechanism is slightly different. Joypads are clocked via direct lines from the CPU, called joypad 1 /OE and joypad 2 /OE, rather than going over the address bus. These output enables remain asserted the entire CPU cycle and even across adjacent cycles if they're both reading the same joypad register. Therefore, controllers only see a single read for each contiguous set of reads of a joypad register.
Most frequently, these DMC DMA conflicts occur with joypad reads, though the mechanism is slightly different. Joypads are clocked via direct lines from the CPU, called joypad 1 /OE and joypad 2 /OE, rather than going over the address bus. These output enables remain asserted the entire CPU cycle and even across adjacent cycles if they're both reading the same joypad register. Therefore, controllers only see a single read for each contiguous set of reads of a joypad register.


The console type affects joypad extra-read behavior. On the RF Famicom, additional hardware outside the 2A03 only passes joypad 1 /OE and joypad 2 /OE through during one half of the clock cycle, meaning the joypad sees a clock on every single CPU read cycle rather than just every contiguous set. The RF Famicom is the only console known to behave this way. The AV Famicom and NES-001 are confirmed to use the per-contiguous-set behavior.<ref>[https://forums.nesdev.org/viewtopic.php?p=275468#p275468 Forum post:] Fiskbit's joypad read cycle breakdown</ref>
The console type affects joypad extra-read behavior. On the RF Famicom, additional hardware outside the 2A03 only passes joypad 1 /OE and joypad 2 /OE through during one half of the clock cycle, meaning the joypad sees a clock on every single CPU read cycle rather than just every contiguous set<ref>[https://forums.nesdev.org/viewtopic.php?p=275946#p275946 Forum post:] lidnariq's RF Famicom joypad clocking explanation</ref>. The RF Famicom, Twin Famicom, and Famicom Titler are known to behave this way. The AV Famicom and NES-001 are confirmed to use the per-contiguous-set behavior.<ref>[https://forums.nesdev.org/viewtopic.php?p=275468#p275468 Forum post:] Fiskbit's joypad read cycle breakdown</ref>


This is further complicated by esoteric behavior regarding how 2A03 registers are activated: instead of checking the full 2A03 address bus, it checks bits 4-0 from the 2A03 address bus and bits 15-5 from the 6502 core. The 6502 core keeps the same address while halted, so if it was reading from $4000-401F, the DMA address can unintentionally activate 2A03 registers. This can lead to bus conflicts and an extra read from the other joypad, and can even prevent an extra read of the current joypad.<ref>[https://forums.nesdev.org/viewtopic.php?p=275132#p275132 Forum post:] Fiskbit's APU register activation test</ref>
This is further complicated by esoteric behavior regarding how 2A03 registers are activated: instead of checking the full 2A03 address bus, it checks bits 4-0 from the 2A03 address bus and bits 15-5 from the 6502 core. The 6502 core keeps the same address while halted, so if it was reading from $4000-401F, the DMA address can unintentionally activate 2A03 registers. This can lead to bus conflicts and an extra read from the other joypad, and can even prevent an extra read of the current joypad.<ref>[https://forums.nesdev.org/viewtopic.php?p=275132#p275132 Forum post:] Fiskbit's APU register activation test</ref>


The 2A07 fixes these extra read problems, though perhaps not completely. It is suspected it does so by disconnecting the 6502 core's address bus from the 2A07's while halted. However, it is not known if the register activation behavior has also been changed to use just the 2A07 address bus or if it still uses a combination of the 2A07's and 6502 core's address buses. If the 2A07 has the same register behavior as the 2A03, then reading a joypad could still cause DPCM data corruption and cause an extra read of the other joypad if a conflicting DMA reads from an address matching a readable 2A03 register in the low 5 bits.
The 2A07 fixes these extra read problems, but the mechanism is not yet understood. Experimentally, the CPU still performs a read on the halting cycle; if OAM DMA is done from the $4000 page, the open bus value used by the OAM DMA is the opcode of the instruction following the $4014 write. This means that 6502 core reads still occur when halted, at least on the first cycle. Like on NTSC, this DMA also does not trigger 2A07 registers if the 6502 core is not reading from $4000-401F. Unlike NTSC, reading a joypad register when DMA reads an address that matches the other joypad register in its low 5 bits does not clock the other joypad.


Workarounds exist for this issue. Most commonly, [[Controller_reading_code#DPCM_Safety_using_Repeated_Reads|joypads are read multiple times]] until the same result is seen twice in row, reducing or eliminating the chance of accepting corrupted data. However, any collisions during this time may corrupt the DPCM data, and this strategy is not suitable for all affected registers. Alternatively, [[Controller_reading_code#DPCM_Safety_using_OAM_DMA|reads synchronized using OAM DMA]] can ensure a collision never happens, but this enforces strict timing constraints on the code and has numerous caveats, particularly for functions longer than one DMC output cycle (sample byte period).
Workarounds exist for this issue. Most commonly, [[Controller_reading_code#DPCM_Safety_using_Repeated_Reads|joypads are read multiple times]] until the same result is seen twice in row, reducing or eliminating the chance of accepting corrupted data. However, any collisions during this time may corrupt the DPCM data, and this strategy is not suitable for all affected registers. Alternatively, [[Controller_reading_code#DPCM_Safety_using_OAM_DMA|reads synchronized using OAM DMA]] can ensure a collision never happens, but this enforces strict timing constraints on the code and has numerous caveats, particularly for functions longer than one DMC output cycle (sample byte period).
Line 367: Line 369:


==References==
==References==
:* [https://github.com/emu-russia/breaks/blob/master/BreakingNESWiki_DeepL/APU/dma.md BreakingNESWiki:] DMA circuit analysis
:* [https://forums.nesdev.org/viewtopic.php?t=14120 Forum post:] Disch's OAM/DMC DMA test results
:* [https://forums.nesdev.org/viewtopic.php?t=14120 Forum post:] Disch's OAM/DMC DMA test results
:* [https://forums.nesdev.org/viewtopic.php?p=62690#p62690 Forum post:] Blargg's DMA tests
:* [https://forums.nesdev.org/viewtopic.php?p=231604#p231604 Forum post:] Fiskbit's aligned controller read tests
:* [https://forums.nesdev.org/viewtopic.php?p=95703#95703 Forum post:] cpow's Visual 2A03 DMC vs OAM DMA analysis
<references/>
<references/>

Latest revision as of 01:26, 17 March 2024

The 2A03 contains a pair of DMA units, one for copying sprite data to PPU OAM and the other for copying DPCM sample data to the APU's DMC sample buffer. DMA is required for DPCM playback, and it is difficult to fill OAM without DMA. Unfortunately, DMC DMA can also result in data loss when reading registers with side effects, such as the joypads.

Summary

  • The CPU alternates between cycles on which DMA can get (read) and cycles on which DMA can put (write). These are the first and second halves of APU cycles, respectively. At power-on, whether the first CPU cycle is get or put is random.
  • If DMA tries to get on a put cycle, it waits and tries again next cycle. This wait is called an alignment cycle.
  • DMA can only halt on CPU read cycles. On write cycles, the halt fails and the DMA unit tries again next CPU cycle, repeating until successful.
  • OAM DMA halts the CPU, performs an optional alignment cycle, and then gets and puts 256 times, taking 513 or 514 cycles. It attempts to halt on the first CPU cycle after the $4014 write.
  • DMC DMA halts the CPU, performs a dummy cycle and an optional alignment cycle, and then gets once, taking 3 or 4 cycles.
  • The first, "load" DMC DMA after the $4015 write attempts to halt on the get cycle during the 2nd following APU cycle. The other, "reload" DMC DMAs attempt to halt on a put cycle. Failed halts try again on the next CPU cycle, repeating until successful.
  • The DMC DMA get takes precedence over OAM DMA get, delaying it, but the DMA cycles otherwise overlap, reducing the total cycle cost. This delay can force OAM DMA to add an alignment cycle.
  • DMC DMA has bugs triggered by stopping the sample at specific times.
  • DMC DMA can corrupt joypad and PPUDATA ($2007) reads and cause extraneous reads of $4015-4017.

Cadence

The DMA units cannot just read or write on any cycle as they wish. Instead, they alternate between get cycles where they can read and put cycles where they can write. If they need to perform an action that is not permitted on the current cycle, they wait.

Get and put cycles are aligned to the first and second halves of the APU clock, respectively (called apu_clk1 and apu_clk2 in Visual2A03). While these cycles are sometimes described as even and odd CPU cycles, this is not accurate because the CPU and APU randomly power into either of 2 alignments relative to each other. Therefore, get and put may occur on different CPU cycle parities across different power cycles.

Behavior

During the DMA process, the CPU is halted and the 2A03's address and data lines are used for data transfer. The process involves a combination of no-operation cycles and access cycles. No-operation cycles come in 3 equivalent types: the halt cycle, a DMC-only dummy cycle, and an optional alignment cycle.

When DMA is scheduled, the associated DMA unit attempts to halt the CPU. The CPU only allows this on read cycles. If the CPU is writing, it ignores the halt and the DMA unit waits until the next cycle to try again, repeating until successful. Delays of up to 3 cycles are possible, with read-modify-write instructions having 2 consecutive writes and interrupts having 3. The halting process itself takes 1 CPU cycle, during which no useful work is done.

Once the CPU is halted, the DMA unit may need to perform some amount of non-access setup, taking up to 2 cycles. The exact timing of when DMA is scheduled and what kind of setup it needs depends on the type of DMA.

The CPU is halted using its internal RDY input. When RDY is deasserted, the 6502 core repeats the last read cycle indefinitely, making no forward progress nor handling interrupts. On 2A03 CPUs, these repeated reads are externally visible on any no-operation DMA cycle, causing data loss if reading a register with side effects. On 2A07 CPUs, it is suspected that a different address (perhaps the DMA address) is on the bus, instead, during all no-operation cycles. See Register conflicts for more information.

When the DMA process completes, the CPU performs the read it attempted when halted.

Examples - General behavior
  • DMA halts normally:
(halted) CPU reads from address A  <- DMA halt cycle
(halted) [DMA occurs]
         CPU reads from address A  <- CPU resumes execution
  • DMA halt is delayed by writes:
         CPU writes                <- DMA attempts to halt
         CPU writes                <- DMA attempts to halt
(halted) CPU reads from address A  <- DMA halt cycle
(halted) [DMA occurs]
         CPU reads from address A  <- CPU resumes execution
  • DMA has a non-access cycle:
(halted) CPU reads from address A  <- DMA halt cycle
(halted) CPU reads from address A  <- DMA does not read or write
(halted) DMA accesses address B
         CPU reads from address A  <- CPU resumes execution

OAM DMA


OAM DMA copies 256 bytes from a CPU page to PPU OAM via the OAMDATA ($2004) register. It is triggered by writing the page number (the high byte of the address) to OAMDMA ($4014). OAM DMA is scheduled to halt the CPU on the first cycle after the register write. In the common case, it performs a halt cycle, an optional alignment cycle, and 256 get/put pairs.

The 256 get/put pairs copy forward from the start of the page. Because DMA can only read on get cycles, an alignment cycle performing no useful work may be required before being able to read. All together, OAM DMA on its own takes 513 or 514 cycles, depending on whether alignment is needed.

OAM DMA will copy from the page most recently written to $4014. This means that read-modify-write instructions such as INC $4014, which are able to perform a second write before the CPU can be halted, will copy from the second page written, not the first.

OAM DMA has a lower priority than DMC DMA. If a DMC DMA get occurs during OAM DMA, OAM DMA is briefly paused. (See DMC DMA during OAM DMA)

Examples - OAM DMA
  • Alignment is not needed:
         (get) CPU writes to $4014
(halted) (put) CPU reads from address A  <- DMA halt cycle
(halted) (get) DMA reads from $xx00
(halted) (put) DMA writes to $2004
(halted) (get) DMA reads from $xx01
(halted) (put) DMA writes to $2004
             ...
(halted) (get) DMA reads from $xxFF
(halted) (put) DMA writes to $2004       <- DMA completes
         (get) CPU reads from address A  <- CPU resumes execution
  • Alignment is needed:
         (put) CPU writes to $4014
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA alignment cycle
(halted) (get) DMA reads from $xx00
(halted) (put) DMA writes to $2004
(halted) (get) DMA reads from $xx01
(halted) (put) DMA writes to $2004
             ...
(halted) (get) DMA reads from $xxFF
(halted) (put) DMA writes to $2004       <- DMA completes
         (get) CPU reads from address A  <- CPU resumes execution

DMC DMA


DMC DMA copies a single byte to the DMC unit's sample buffer. This occurs automatically after the DMC enable bit, bit 4, of the sound channel enable register ($4015) is set to 1, which starts DPCM sample playback using the current DMC settings in registers $4010-4013. DMC DMA is scheduled when all of DPCM playback is enabled, there are bytes left in the sample, and the sample buffer is empty (see Memory reader and Output unit). In the common cases, DMC DMA performs a halt cycle, a dummy cycle, an optional alignment cycle, and a get.

The exact timing depends on the type of DMC DMA. There are two types: load and reload. Load DMAs occur after $4015 D4 is set, but only if the sample buffer is empty. They are scheduled to halt the CPU on a get cycle during the 2nd APU cycle after the write (that is, the 3rd or 4th CPU cycle). Reload DMAs occur in response to the sample buffer being emptied. Unlike load DMAs, they are scheduled to halt the CPU on a put cycle.

After the halt, DMC DMA always performs a dummy cycle where no work is done. If the next cycle is not a get cycle, then a cycle will be spent on alignment. Then the DMA read is performed.

DMC DMA normally takes 3 or 4 cycles, depending on whether alignment is needed. Because load and reload DMAs schedule on different cycle types, load DMAs take 3 cycles and reload DMAs take 4 unless the halt is delayed by an odd number of cycles. However, bugs can cause additional cycles; see Bugs below.

Examples - DMC DMA
  • Load DMA:
         (get) \ CPU writes to $4015     <- DMC enabled 
         (put) / during this APU cycle   <- DMC enabled
         (get) CPU reads
         (put) CPU reads
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA dummy cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution
  • Load DMA (delayed 1 cycle):
         (get) \ CPU writes to $4015     <- DMC enabled 
         (put) / during this APU cycle   <- DMC enabled
         (get) CPU reads
         (put) CPU reads
         (get) CPU writes                <- DMA attempts to halt
(halted) (put) CPU reads from address A  <- DMA halt cycle
(halted) (get) CPU reads from address A  <- DMA dummy cycle
(halted) (put) CPU reads from address A  <- DMA alignment cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution
  • Reload DMA:
(halted) (put) CPU reads from address A  <- DMA halt cycle
(halted) (get) CPU reads from address A  <- DMA dummy cycle
(halted) (put) CPU reads from address A  <- DMA alignment cycle
(halted) (get) DMA reads from address B 
         (put) CPU reads from address A  <- CPU resumes execution
  • Reload DMA (delayed 1 cycle):
         (put) CPU writes                <- DMA attempts to halt
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA dummy cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution
  • Reload DMA (delayed 2 cycles):
         (put) CPU writes                <- DMA attempts to halt
         (get) CPU writes                <- DMA attempts to halt
(halted) (put) CPU reads from address A  <- DMA halt cycle
(halted) (get) CPU reads from address A  <- DMA dummy cycle
(halted) (put) CPU reads from address A  <- DMA alignment cycle
(halted) (get) DMA reads from address B 
         (put) CPU reads from address A  <- CPU resumes execution
  • Reload DMA (delayed 3 cycles):
         (put) CPU writes                <- DMA attempts to halt
         (get) CPU writes                <- DMA attempts to halt
         (put) CPU writes                <- DMA attempts to halt
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA dummy cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution

Bugs

DMC DMA suffers from two bugs[1][2] related to sample playback stopping around the time a DMC output cycle ends, which is what empties the sample buffer and triggers a reload DMA. This can happen explicitly, where the sample is stopped by clearing $4015 D4, or implicitly, where a non-looping 1-byte sample is started while the sample buffer is empty shortly before a reload DMA would schedule. Implicit stops require this type of sample because the load DMA for the first byte when the sample buffer is empty is the only way to implicitly end the sample just before a reload DMA; longer samples will instead be ended by the reload DMA itself, as normal.

When sample playback is stopped during the APU cycle before a reload DMA would schedule (that is, on the 2nd or 3rd CPU cycle before the halt attempt), the DMA starts, but is aborted after a single cycle. If the halt is delayed due to a write cycle, the aborted DMA doesn't occur at all. This aborted DMA schedules regardless of how playback was stopped, whether explicitly or implicitly. In the implicit case, the write to begin the sample will normally be the 4th APU cycle before the reload DMA would schedule (8th or 9th CPU cycle before).

On RP2A03H and late RP2A03G CPUs, when playback is stopped implicitly on the same APU cycle that a reload DMA would schedule (that is, the 1st CPU cycle before the halt attempt), an unexpected reload DMA occurs from the same address. This extra byte goes into the sample buffer and is played after the first byte, as with any normal fetch. On RP2A03G CPUs, this bug was introduced sometime in 1990; earlier chips are unaffected.

It is not known whether 2A07 CPUs are affected by these bugs. Some clone hardware is known to be affected, and behavior on affected clones may differ from official CPUs. For example, UA6527P-based clones feature both the aborted DMA and unexpected DMA bugs, but samples take 1 APU cycle longer to end than on official CPUs, so to trigger these bugs with implicit stops, the sample-ending byte must be fetched 1 APU cycle earlier.

Examples - DMC DMA bugs
  • Explicit-stop aborted DMA:
         (get) \ CPU writes to $4015     <- DMC disabled
         (put) / during this APU cycle   <- DMC disabled
         (get) CPU reads
(halted) (put) CPU reads from address A  <- DMA halt cycle
         (get) CPU reads from address A  <- CPU resumes execution
  • Implicit-stop aborted DMA:
         (get) \ CPU writes to $4015     <- DMC enabled w/ buffer empty
         (put) / during this APU cycle   <- DMC enabled w/ buffer empty
         (get) CPU reads
         (put) CPU reads
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA dummy cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution
         (get) CPU reads
(halted) (put) CPU reads from address C  <- DMA halt cycle
         (get) CPU reads from address C  <- CPU resumes execution
  • Implicit-stop unexpected DMA:
         (get) \ CPU writes to $4015     <- DMC enabled w/ buffer empty
         (put) / during this APU cycle   <- DMC enabled w/ buffer empty
         (get) CPU reads
         (put) CPU reads
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA dummy cycle
(halted) (get) DMA reads from address B
(halted) (put) CPU reads from address A  <- DMA halt cycle
(halted) (get) CPU reads from address A  <- DMA dummy cycle
(halted) (put) CPU reads from address A  <- DMA alignment cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution

The implicit-stop aborted DMA can be prevented with carefully placed write cycles. This can be necessary for synchronized code, where the aborted DMA's odd number of cycles can invert cycle parity. The following code synchronizes to a put cycle and uses precise write cycles to prevent any aborted DMA that may occur:

STx $4015  ; Initiate DMC DMA.
STx zp     ; Force load DMA to the 4th cycle.
           ; (If on a UA6527P-based clone, place a NOP here.)
STx zp     ; Override the aborted DMA.
; The first cycle of the next instruction is a put cycle.
Examples - aborted DMA workaround
  • Implicit-stop aborted DMA bypass (write on get):
         (get) CPU writes to $4015       <- DMC enabled w/ buffer empty
         (put) CPU reads
         (get) CPU reads
         (put) CPU writes
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA dummy cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution
         (get) CPU reads
         (put) CPU writes                <- DMA attempts to halt
         (get) CPU reads
  • Implicit-stop aborted DMA bypass (write on put):
         (put) CPU writes to $4015       <- DMC enabled w/ buffer empty
         (get) CPU reads
         (put) CPU reads
         (get) CPU writes                <- DMA attempts to halt
(halted) (put) CPU reads from address A  <- DMA halt cycle
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA dummy cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution
         (get) CPU reads
         (put) CPU writes                <- DMA attempts to halt
         (get) CPU reads

DMC DMA during OAM DMA

DMC and OAM use independent DMA units that only interact when both attempt to access memory on the same cycle. When accesses collide, DMC DMA is allowed to run and OAM DMA is paused, trying again on the next cycle. This can cause OAM DMA to have to perform an additional alignment cycle before continuing. No-operation cycles are allowed to overlap with each other and with access cycles, allowing cycles to be saved.

In the common case, DMC DMA occurring during OAM DMA will cost only 2 cycles: 1 cycle for the DMC DMA get and then 1 cycle for OAM DMA to align back to a get. However, if DMC DMA occurs at the end of OAM DMA, it can take 1 or 3 cycles. If it schedules for the second-to-last put, its get will occur on the first cycle after OAM DMA, taking just 1 cycle total. If it schedules for the last put, it will instead extend 3 cycles beyond the end of OAM DMA.

OAM DMA is sometimes used to synchronize code to avoid conflicts with DMC DMA when reading hardware registers, but because DMC DMA takes an odd number of cycles if it lands at the end, synchronization is not guaranteed.

Examples - DMC DMA during OAM DMA
  • DMC DMA at the start of OAM DMA (write on get), taking 2 cycles
         (get) CPU writes to $4014
(halted) (put) CPU reads from address A      <- DMC and OAM DMA halt cycle
(halted) (get) OAM DMA reads from $xx00      <- DMC DMA dummy cycle
(halted) (put) OAM DMA writes to $2004       <- DMC DMA alignment cycle
(halted) (get) DMC DMA reads from address B
(halted) (put) CPU reads from address A      <- OAM DMA alignment cycle
(halted) (get) DMA reads from $xx01
(halted) (put) DMA writes to $2004
             ...
  • DMC DMA at the start of OAM DMA (write on put), taking 2 cycles
         (put) CPU writes to $4014           <- DMC DMA attempts to halt
(halted) (get) CPU reads from address A      <- DMC and OAM DMA halt cycle
(halted) (put) CPU reads from address A      <- DMC DMA dummy cycle, OAM DMA alignment cycle
(halted) (get) DMC DMA reads from address B
(halted) (put) CPU reads from address A      <- OAM DMA alignment cycle
(halted) (get) OAM DMA reads from $xx00
(halted) (put) OAM DMA writes to $2004
             ...
  • DMC DMA in the middle of OAM DMA, taking 2 cycles
             ...
(halted) (get) OAM DMA reads from address C
(halted) (put) OAM DMA writes to $2004         <- DMC DMA halt cycle
(halted) (get) OAM DMA reads from address C+1  <- DMC DMA dummy cycle
(halted) (put) OAM DMA writes to $2004         <- DMC DMA alignment cycle
(halted) (get) DMC DMA reads from address B
(halted) (put) CPU reads from address A        <- OAM DMA alignment cycle
(halted) (get) OAM DMA reads from address C+2
             ...
  • DMC DMA on second-to-last OAM DMA put, taking 1 cycle
             ...
(halted) (get) OAM DMA reads from $xxFE
(halted) (put) OAM DMA writes to $2004       <- DMC DMA halt cycle
(halted) (get) OAM DMA reads from $xxFF      <- DMC DMA dummy cycle
(halted) (put) OAM DMA writes to $2004       <- DMC DMA alignment cycle
(halted) (get) DMC DMA reads from address B
         (put) CPU reads from address A      <- CPU resumes execution
  • DMC DMA on second-to-last OAM DMA put, taking 3 cycles
             ...
(halted) (get) OAM DMA reads from $xxFF
(halted) (put) OAM DMA writes to $2004       <- DMC DMA halt cycle
(halted) (get) CPU reads from address A      <- DMC DMA dummy cycle
(halted) (put) CPU reads from address A      <- DMC DMA alignment cycle
(halted) (get) DMC DMA reads from address B
         (put) CPU reads from address A      <- CPU resumes execution

Register conflicts

On the 2A03, while the CPU is halted, it repeats the read cycle on which it was halted during every no-operation DMA cycle (that is, when the DMA units are not reading or writing). If the CPU was reading a register with side-effects, this can cause data to be lost. While this isn't realistically a problem with OAM DMA because of its particular timing constraints, it is a very real problem with DMC DMA that must usually be worked around. Example registers include PPUSTATUS ($2002), PPUDATA ($2007), and sound status ($4015). When conflicting with DMC DMA, these will see 2 or more extra reads. (Note that $2007 reads on adjacent cycles may have unexpected behavior.)

Most frequently, these DMC DMA conflicts occur with joypad reads, though the mechanism is slightly different. Joypads are clocked via direct lines from the CPU, called joypad 1 /OE and joypad 2 /OE, rather than going over the address bus. These output enables remain asserted the entire CPU cycle and even across adjacent cycles if they're both reading the same joypad register. Therefore, controllers only see a single read for each contiguous set of reads of a joypad register.

The console type affects joypad extra-read behavior. On the RF Famicom, additional hardware outside the 2A03 only passes joypad 1 /OE and joypad 2 /OE through during one half of the clock cycle, meaning the joypad sees a clock on every single CPU read cycle rather than just every contiguous set[3]. The RF Famicom, Twin Famicom, and Famicom Titler are known to behave this way. The AV Famicom and NES-001 are confirmed to use the per-contiguous-set behavior.[4]

This is further complicated by esoteric behavior regarding how 2A03 registers are activated: instead of checking the full 2A03 address bus, it checks bits 4-0 from the 2A03 address bus and bits 15-5 from the 6502 core. The 6502 core keeps the same address while halted, so if it was reading from $4000-401F, the DMA address can unintentionally activate 2A03 registers. This can lead to bus conflicts and an extra read from the other joypad, and can even prevent an extra read of the current joypad.[5]

The 2A07 fixes these extra read problems, but the mechanism is not yet understood. Experimentally, the CPU still performs a read on the halting cycle; if OAM DMA is done from the $4000 page, the open bus value used by the OAM DMA is the opcode of the instruction following the $4014 write. This means that 6502 core reads still occur when halted, at least on the first cycle. Like on NTSC, this DMA also does not trigger 2A07 registers if the 6502 core is not reading from $4000-401F. Unlike NTSC, reading a joypad register when DMA reads an address that matches the other joypad register in its low 5 bits does not clock the other joypad.

Workarounds exist for this issue. Most commonly, joypads are read multiple times until the same result is seen twice in row, reducing or eliminating the chance of accepting corrupted data. However, any collisions during this time may corrupt the DPCM data, and this strategy is not suitable for all affected registers. Alternatively, reads synchronized using OAM DMA can ensure a collision never happens, but this enforces strict timing constraints on the code and has numerous caveats, particularly for functions longer than one DMC output cycle (sample byte period).

Examples - Register conflicts
  • DMC DMA collides with $2007 read (3 extra reads)
(halted) (put) CPU reads from $2007      <- DMA halt cycle
(halted) (get) CPU reads from $2007      <- DMA dummy cycle
(halted) (put) CPU reads from $2007      <- DMA alignment cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from $2007      <- CPU resumes execution
  • DMC DMA collides with JOYPAD1 ($4016) read (1 or 3 extra reads)
(halted) (put) CPU reads from $4016      <- DMA halt cycle
(halted) (get) CPU reads from $4016      <- DMA dummy cycle
(halted) (put) CPU reads from $4016      <- DMA alignment cycle
(halted) (get) DMA reads from $C000
         (put) CPU reads from $4016      <- CPU resumes execution
  • DMC DMA collides with JOYPAD1 read (0 or 4 extra reads)
    • The combined address from 6502 core bits 15-5 and 2A03 bits 4-0 is $4016.
    • This triggers a bus conflict, corrupting the DMA read.
(halted) (put) CPU reads from $4016      <- DMA halt cycle
(halted) (get) CPU reads from $4016      <- DMA dummy cycle
(halted) (put) CPU reads from $4016      <- DMA alignment cycle
(halted) (get) DMA reads from $C016 and $4016
         (put) CPU reads from $4016      <- CPU resumes execution
  • DMC DMA collides with JOYPAD1 read (1 extra read of JOYPAD2, 1 or 3 extra of JOYPAD1)
    • The combined address from 6502 core bits 15-5 and 2A03 bits 4-0 is $4017.
    • This triggers a bus conflict, corrupting the DMA read.
(halted) (put) CPU reads from $4016      <- DMA halt cycle
(halted) (get) CPU reads from $4016      <- DMA dummy cycle
(halted) (put) CPU reads from $4016      <- DMA alignment cycle
(halted) (get) DMA reads from $C017 and $4017
         (put) CPU reads from $4016      <- CPU resumes execution
  • DMC DMA collides with JOYPAD1 read (1 or 3 extra reads)
    • The combined address from 6502 core bits 15-5 and 2A03 bits 4-0 is $4015.
    • This causes the 2A03 to read $4015 and ignore the DMA value on the external data bus.
(halted) (put) CPU reads from $4016      <- DMA halt cycle
(halted) (get) CPU reads from $4016      <- DMA dummy cycle
(halted) (put) CPU reads from $4016      <- DMA alignment cycle
(halted) (get) DMA reads from $C015, 2A03 reads from $4015 internally
         (put) CPU reads from $4016      <- CPU resumes execution

References

  1. Forum post: Fiskbit's manual DMA test suite
  2. Forum post: Fiskbit's explicit and implicit stop tests
  3. Forum post: lidnariq's RF Famicom joypad clocking explanation
  4. Forum post: Fiskbit's joypad read cycle breakdown
  5. Forum post: Fiskbit's APU register activation test