Consistent frame synchronization: Difference between revisions

From NESdev Wiki
Jump to navigationJump to search
No edit summary
(Removal of syntax highlighter)
Line 19: Line 19:
A frame is 33247.5 cycles long. If we could somehow read $2002 every 33247.5 cycles, we'd read at the same point in each frame. But if we read $2002 every 33248 cycles, we'll be reading 0.5 cycles (1.6 pixels) later each successive frame. If we have a loop do this until it finds the VBL flag set, it will synchronize with the PPU. Each time through, it will read later in the frame, until it reads just as the VBL flag for the next frame is set.
A frame is 33247.5 cycles long. If we could somehow read $2002 every 33247.5 cycles, we'd read at the same point in each frame. But if we read $2002 every 33248 cycles, we'll be reading 0.5 cycles (1.6 pixels) later each successive frame. If we have a loop do this until it finds the VBL flag set, it will synchronize with the PPU. Each time through, it will read later in the frame, until it reads just as the VBL flag for the next frame is set.


<source lang="6502">
<pre>
         ; Fine synchronize
         ; Fine synchronize
:      delay 33241
:      delay 33241
         bit $2002
         bit $2002
         bpl :-
         bpl :-
</source>
</pre>


{| class="tabular"
{| class="tabular"
Line 78: Line 78:
The loop must be started so that the first $2002 read is slightly before the end of the frame, otherwise it might start out reading well after the flag has been set. We can do this by starting with a simpler coarse synchronization loop.
The loop must be started so that the first $2002 read is slightly before the end of the frame, otherwise it might start out reading well after the flag has been set. We can do this by starting with a simpler coarse synchronization loop.


<source lang="6502">
<pre>
sync_ppu:
sync_ppu:
         ; Coarse synchronize
         ; Coarse synchronize
Line 94: Line 94:
   
   
         rts
         rts
</source>
</pre>


The coarse synchronization loop might read $2002 just as the VBL flag was set, or read it nearly 7 cycles after it was set. Then, in the fine synchronization loop, $2002 is read 33240 to 33247 cycles later. In most cases, this will be slightly before the VBL flag is set, so the loop will delay and read $2002 again 33248 cycles later, etc.
The coarse synchronization loop might read $2002 just as the VBL flag was set, or read it nearly 7 cycles after it was set. Then, in the fine synchronization loop, $2002 is read 33240 to 33247 cycles later. In most cases, this will be slightly before the VBL flag is set, so the loop will delay and read $2002 again 33248 cycles later, etc.
Line 104: Line 104:
In order to achieve some graphical effect, we want to write to the PPU at a particular pixel every frame. As an example, we'll write to $2006 at pixel 30400, which is near the upper-center of the screen. To simplify things, we'll not care what value we write. This requires that we write to $2006 at VBL+9500.
In order to achieve some graphical effect, we want to write to the PPU at a particular pixel every frame. As an example, we'll write to $2006 at pixel 30400, which is near the upper-center of the screen. To simplify things, we'll not care what value we write. This requires that we write to $2006 at VBL+9500.


<source lang="6502">
<pre>
         ; Synchronize to PPU
         ; Synchronize to PPU
         jsr sync_ppu
         jsr sync_ppu
Line 116: Line 116:
         delay 9497
         delay 9497
         sta $2006
         sta $2006
</source>
</pre>


{| class="tabular"
{| class="tabular"
Line 136: Line 136:
If we try to make this write to the same pixel each frame, we run into a problem: the frame length isn't a whole number of cycles. We'll count frames and treat odd frames as being 33247 cycles long, and even frames 33248 cycles long, which will average to the correct 33247.5 cycles per frame.
If we try to make this write to the same pixel each frame, we run into a problem: the frame length isn't a whole number of cycles. We'll count frames and treat odd frames as being 33247 cycles long, and even frames 33248 cycles long, which will average to the correct 33247.5 cycles per frame.


<source lang="6502">
<pre>
         ; Synchronize to PPU
         ; Synchronize to PPU
         jsr sync_ppu
         jsr sync_ppu
Line 162: Line 162:


         jmp vbl
         jmp vbl
</source>
</pre>


Now our write time doesn't drift, but it still doesn't write to the same pixel each frame. Since even frames begin in the middle of a cycle, our write is half a cycle/1.6 pixels earlier.
Now our write time doesn't drift, but it still doesn't write to the same pixel each frame. Since even frames begin in the middle of a cycle, our write is half a cycle/1.6 pixels earlier.
Line 194: Line 194:
Ideally, NMI would begin a fixed number of cycles after VBL, without waiting for the current instruction to finish. If that were the case, we'd have it nearly as easy as before. Here, we'll imagine NMI always occurs at VBL+2. NMI takes 7 cycles to vector to our NMI handler, so that our NMI handler begins at VBL+9. To simplify the code and timing diagrams, we won't bother saving any registers as we'd normally do in an NMI handler.
Ideally, NMI would begin a fixed number of cycles after VBL, without waiting for the current instruction to finish. If that were the case, we'd have it nearly as easy as before. Here, we'll imagine NMI always occurs at VBL+2. NMI takes 7 cycles to vector to our NMI handler, so that our NMI handler begins at VBL+9. To simplify the code and timing diagrams, we won't bother saving any registers as we'd normally do in an NMI handler.


<source lang="6502">
<pre>
nmi:    ; VBL+9
nmi:    ; VBL+9
         delay 9488
         delay 9488
         sta $2006        ; write at VBL+9500
         sta $2006        ; write at VBL+9500
</source>
</pre>


{| class="tabular"
{| class="tabular"
Line 242: Line 242:
By ensuring that a short instruction is executing when VBL occurs, we can minimize the delay before NMI is vectored. For example, if we have a series of NOP instructions executing when VBL occurs, NMI will occur from 2 to 4 cycles after VBL. The table shows the four possible timings, with each column titled with the time NMI vectoring begins.
By ensuring that a short instruction is executing when VBL occurs, we can minimize the delay before NMI is vectored. For example, if we have a series of NOP instructions executing when VBL occurs, NMI will occur from 2 to 4 cycles after VBL. The table shows the four possible timings, with each column titled with the time NMI vectoring begins.


<source lang="6502">
<pre>
         nop
         nop
         nop
         nop
         nop
         nop
</source>
</pre>


{| class="tabular"
{| class="tabular"
Line 268: Line 268:
Using a long sequence of NOP instructions isn't practical, because it requires either a large number of NOP instructions, or that we know how long the code before them takes so that we can delay entry into the NOP sequence until NMI is about to occur. If we instead have a simple infinite loop made of a single JMP instruction, we only increase the maximum delay by one cycle, to 5.
Using a long sequence of NOP instructions isn't practical, because it requires either a large number of NOP instructions, or that we know how long the code before them takes so that we can delay entry into the NOP sequence until NMI is about to occur. If we instead have a simple infinite loop made of a single JMP instruction, we only increase the maximum delay by one cycle, to 5.


<pre>
  loop:  jmp loop
  loop:  jmp loop
</pre>


{| class="tabular"
{| class="tabular"
Line 292: Line 294:
With a JMP loop to wait for NMI, we have 2 to 5 cycles of delay between VBL and our NMI handler. We want to compensate for this delay D by delaying an additional 5-D cycles. Here, we have the NOP always begin at VBL+12. We can't actually do this, but it shows what we must do the equivalent of.
With a JMP loop to wait for NMI, we have 2 to 5 cycles of delay between VBL and our NMI handler. We want to compensate for this delay D by delaying an additional 5-D cycles. Here, we have the NOP always begin at VBL+12. We can't actually do this, but it shows what we must do the equivalent of.


<source lang="6502">
<pre>
nmi: delay 5-D
nmi: delay 5-D
nop
nop
</source>
</pre>
 
 
{| class="tabular"
{| class="tabular"
Line 335: Line 337:
When sprite DMA ($4014) is written to, the next instruction always begins on an odd cycle. If the $4014 write is on an odd cycle, it pauses the CPU for an additional 513 cycles, otherwise 514 cycles. We can use this aspect to partially compensate for NMI's variable delay.
When sprite DMA ($4014) is written to, the next instruction always begins on an odd cycle. If the $4014 write is on an odd cycle, it pauses the CPU for an additional 513 cycles, otherwise 514 cycles. We can use this aspect to partially compensate for NMI's variable delay.


<source lang="6502">
<pre>
nmi:    lda #$07          ; sprites at $700
nmi:    lda #$07          ; sprites at $700
         sta $4014
         sta $4014
         nop
         nop
</source>
</pre>


{| class="tabular"
{| class="tabular"
Line 403: Line 405:
The VBL flag is cleared around pixel 23869, sometimes one less, so we want to read $2002 at VBL+7458 or VBL+7460. It works out nicely that sprite DMA leaves two cycles between the possible ending times, as this ensures that our $2002 read is several pixels before or after when the flag is cleared, giving us a good margin for error. If we find the flag set, we know we are on the earlier of the two DMA ending times, so we delay an extra two cycles.
The VBL flag is cleared around pixel 23869, sometimes one less, so we want to read $2002 at VBL+7458 or VBL+7460. It works out nicely that sprite DMA leaves two cycles between the possible ending times, as this ensures that our $2002 read is several pixels before or after when the flag is cleared, giving us a good margin for error. If we find the flag set, we know we are on the earlier of the two DMA ending times, so we delay an extra two cycles.


<source lang="6502">
<pre>
nmi:    lda #$07          ; sprites at $700
nmi:    lda #$07          ; sprites at $700
         sta $4014
         sta $4014
Line 411: Line 413:
         bit 0
         bit 0
skip:  nop
skip:  nop
</source>
</pre>


{| class="tabular"
{| class="tabular"
Line 579: Line 581:
But we also need to insert a complementary delay ''after'' DMA, before the $2002 read, since on frames where VBL begins during an odd cycle we'll need to read $2002 one cycle later after DMA than for even frames.
But we also need to insert a complementary delay ''after'' DMA, before the $2002 read, since on frames where VBL begins during an odd cycle we'll need to read $2002 one cycle later after DMA than for even frames.


<source lang="6502">
<pre>
nmi:    lda frame_count
nmi:    lda frame_count
         and #$02
         and #$02
Line 595: Line 597:
         delay 2028
         delay 2028
         sta $2006
         sta $2006
</source>
</pre>


{| class="tabular"
{| class="tabular"
Line 735: Line 737:
Since our final synchronization method relies on knowing whether a given frame begins during an even or odd cycle, we must initially ensure that our PPU synchronization routine's final $2002 read is also during an even cycle. Since the fine synchronization loop takes an even number of cycles, we merely need to ensure that the first time through that the $2002 read is on an even cycle. We can do this by initiating sprite DMA before the fine synchronization loop.
Since our final synchronization method relies on knowing whether a given frame begins during an even or odd cycle, we must initially ensure that our PPU synchronization routine's final $2002 read is also during an even cycle. Since the fine synchronization loop takes an even number of cycles, we merely need to ensure that the first time through that the $2002 read is on an even cycle. We can do this by initiating sprite DMA before the fine synchronization loop.


<source lang="6502">
<pre>
sync_ppu:
sync_ppu:
         ; Coarse synchronize
         ; Coarse synchronize
Line 756: Line 758:
          
          
         rts
         rts
</source>
</pre>


The STA $4014 takes up to 518 cycles, so we subtract that from the initial delay. After the STA $4014, the delay begins on an odd cycle. Since also it's an odd number of cycles until the $2002 read, it will occur on an even cycle, as desired.
The STA $4014 takes up to 518 cycles, so we subtract that from the initial delay. After the STA $4014, the delay begins on an odd cycle. Since also it's an odd number of cycles until the $2002 read, it will occur on an even cycle, as desired.
Line 766: Line 768:
The fine synchronization loop needs to read $2002 every 33248 cycles, so it can find when the VBL flag is set just before the read. This seems to require a long delay between reads. Until the final iteration, it must not find the VBL flag set. If it were like the coarse loop and read the VBL flag every 7 cycles, it would clearly stop somewhere near the beginning of the first frame, but rarely right at the beginning. It might read $2002 one cycle before the VBL flag is set, loop, then read it 7 cycles later and find it now set. This isn't what we want. If we read it  slightly more often, like every 33248/2 = 16624 cycles, it would still work, since the VBL flag is automatically cleared near the end of VBL.
The fine synchronization loop needs to read $2002 every 33248 cycles, so it can find when the VBL flag is set just before the read. This seems to require a long delay between reads. Until the final iteration, it must not find the VBL flag set. If it were like the coarse loop and read the VBL flag every 7 cycles, it would clearly stop somewhere near the beginning of the first frame, but rarely right at the beginning. It might read $2002 one cycle before the VBL flag is set, loop, then read it 7 cycles later and find it now set. This isn't what we want. If we read it  slightly more often, like every 33248/2 = 16624 cycles, it would still work, since the VBL flag is automatically cleared near the end of VBL.


<source lang="6502">
<pre>
sync_ppu:
sync_ppu:
         ; Coarse synchronize
         ; Coarse synchronize
Line 783: Line 785:


         rts
         rts
</source>
</pre>


{| class="tabular"
{| class="tabular"
Line 823: Line 825:
That works, but reducing the delays doesn't eliminate the need for them. The important thing is that the $2002 read only be able to happen just after the VBL flag is set, rather than many cycles after it was set. Rather than rely on the PPU to clear the VBL flag, we can clear it ourselves. 16 is a factor of 33248, so we can have the loop take only 16 cycles and still synchronize properly.
That works, but reducing the delays doesn't eliminate the need for them. The important thing is that the $2002 read only be able to happen just after the VBL flag is set, rather than many cycles after it was set. Rather than rely on the PPU to clear the VBL flag, we can clear it ourselves. 16 is a factor of 33248, so we can have the loop take only 16 cycles and still synchronize properly.


<source lang="6502">
<pre>
sync_ppu:
sync_ppu:
         ; Coarse synchronize
         ; Coarse synchronize
Line 841: Line 843:


         rts
         rts
</source>
</pre>


{| class="tabular"
{| class="tabular"

Revision as of 03:57, 11 September 2014

Introduction

This page describes a method for consistently synchronizing with the PPU every frame from inside an NMI handler, without having to cycle-time everything. This method allows synchronization just as good as is possible with completely cycle-timed code. At the beginning, the PPU is precisely synchronized with, ensuring that the code behaves the same every time it's run, every time the NES is powered up or reset. It's fully predictable.

Currently only PAL version is covered, since the PAL PPU's frame timing is simpler. The NTSC version operates in a similar manner, and will be covered eventually.

PAL timing

The PAL NES has a master clock shared by the PPU and CPU. The CPU divides the master clock by 16 to get its instruction cycle clock, which we'll call cycle for simplicity. For example, a NOP instruction takes 2 cycles. The PPU divides the master clock by 5 to get its pixel clock, which we'll call pixel for simplicity. There are 16/5 = 3.2 pixels per cycle.

A video frame consists of 312 scanlines, each 341 pixels long. Unlike NTSC, there are no short frames, and rendering being enabled or disabled has no effect on frame length. This, every frame is exactly 312*341 = 106396 pixels = 33247.5 cycles long. We'll have pixel 0 refer to the first pixel of a frame, and pixel 106395 refer to the last pixel of a frame.

A frame begins with the vertical blanking interval (VBL), then the visible scanlines. The notation VBL+N refers to N cycles after the cycle that VBL began within, VBL+0. To talk about pixels since VBL, we simply refer to pixel P, where pixel 0 is the beginning of VBL, and pixel 106395 is the last pixel in the frame.

Basic synchronization

If we're going to write at a particular pixel, we must first synchronize the CPU to the beginning of a frame, so that pixel 0 begins at the beginning of a cycle, and we know how many cycles ago that was. Reading $2002 gives the current status of the VBL flag in bit 7, then clears it. The VBL flag is set at pixel 0 of each frame, and cleared around when VBL ends. We can use the VBL flag to achieve synchronization.

A frame is 33247.5 cycles long. If we could somehow read $2002 every 33247.5 cycles, we'd read at the same point in each frame. But if we read $2002 every 33248 cycles, we'll be reading 0.5 cycles (1.6 pixels) later each successive frame. If we have a loop do this until it finds the VBL flag set, it will synchronize with the PPU. Each time through, it will read later in the frame, until it reads just as the VBL flag for the next frame is set.

        ; Fine synchronize
:       delay 33241
        bit $2002
        bpl :-
Cycle PPU CPU
0
1
...
33246 Read $2002 = 0
33246.5
33247
33247.5 Set VBL flag
...
66494 Read $2002 = 0
66494.5
66495 Set VBL flag
...
99742 Read $2002 = 0
99742.5 Set VBL flag
...
132990 Set VBL flag Read $2002 = $80

Looking at it relative to each frame, we more clearly see how the CPU effectively reads later by half a cycle each frame.

Cycle Frame 1 Frame 2 Frame 3 Frame 4 Event
-1.5 read
-1.0 read
-0.5 read
0 read VBL flag set

The loop must be started so that the first $2002 read is slightly before the end of the frame, otherwise it might start out reading well after the flag has been set. We can do this by starting with a simpler coarse synchronization loop.

sync_ppu:
        ; Coarse synchronize
        bit $2002
:       bit $2002
        bpl :-
        
        delay 33231
        jmp first
        
        ; Fine synchronize
:       delay 33241
first:  bit $2002
        bpl :-
 
        rts

The coarse synchronization loop might read $2002 just as the VBL flag was set, or read it nearly 7 cycles after it was set. Then, in the fine synchronization loop, $2002 is read 33240 to 33247 cycles later. In most cases, this will be slightly before the VBL flag is set, so the loop will delay and read $2002 again 33248 cycles later, etc.

Once done, the CPU will have executed two cycles after the final $2002 read that found the VBL flag just set.

Writing to a particular pixel

In order to achieve some graphical effect, we want to write to the PPU at a particular pixel every frame. As an example, we'll write to $2006 at pixel 30400, which is near the upper-center of the screen. To simplify things, we'll not care what value we write. This requires that we write to $2006 at VBL+9500.

        ; Synchronize to PPU
        jsr sync_ppu
        
        ; Delay almost a full frame, so that the code below begins on
        ; a frame.
        delay 33238
        
vbl:    ; VBL begins in this cycle
        
        delay 9497
        sta $2006
Pixel Cycle Event
0 0 VBL begins
delay 9497
...
9497 STA $2006
9498
9499
30400 9500 $2006 write

If we try to make this write to the same pixel each frame, we run into a problem: the frame length isn't a whole number of cycles. We'll count frames and treat odd frames as being 33247 cycles long, and even frames 33248 cycles long, which will average to the correct 33247.5 cycles per frame.

        ; Synchronize to PPU
        jsr sync_ppu
         
        ; Delay almost a full frame, so that the code below begins on
        ; a frame.
        delay 33233
        
        ; We were on frame 1 after sync_ppu, but vbl will begin on frame 2
        lda #2
        sta frame_count
        
vbl:    ; VBL begins in this cycle
        
        delay 9497
        sta $2006
        
        delay 23731
        
        ; Delay extra cycle on even frames
        lda frame_count
        and #$01
        beq extra
extra:  inc frame_count

        jmp vbl

Now our write time doesn't drift, but it still doesn't write to the same pixel each frame. Since even frames begin in the middle of a cycle, our write is half a cycle/1.6 pixels earlier.

Odd frame pixel Even frame pixel Cycle Event
0 0 VBL begins
delay 9497
0 0.5
...
9497 STA $2006
9498
9499
30400 30398.4 9500 $2006 write

Our write will thus fall on pixel 30400 on odd frames, and pixel 30398.4 on even frames. That's the best we can do, regardless of how we write our code, as this is a hardware limitation.

Another similar limitation is that when the NES is powered up or reset, the CPU and PPU master clock dividers start in random states, adding up to 1.6 additional pixels of variance. This offset doesn't change until the NES is powered off or reset.

Ideal NMI

Above, all the code had to be cycle-timed to ensure that each write occurred at the correct time. This isn't practical in most programs, which instead use NMI for synchronizing roughly to VBL. In these programs, timing-critical code is at the beginning of the NMI handler, followed by code that isn't carefully timed. Thus, such code relies on NMI occurring shortly after VBL, and not being delayed.

Ideally, NMI would begin a fixed number of cycles after VBL, without waiting for the current instruction to finish. If that were the case, we'd have it nearly as easy as before. Here, we'll imagine NMI always occurs at VBL+2. NMI takes 7 cycles to vector to our NMI handler, so that our NMI handler begins at VBL+9. To simplify the code and timing diagrams, we won't bother saving any registers as we'd normally do in an NMI handler.

nmi:    ; VBL+9
        delay 9488
        sta $2006         ; write at VBL+9500
Even frame pixel Odd frame pixel Cycle Event
0 0 VBL begins
0 0.5
1
2 NMI vectored
3
4
5
6
7
8
9 delay 9488
...
9497 STA $2006
9498
9499
30400 30398.4 9500 $2006 write

NMI delay

In reality, NMI waits until the current instruction completes before vectoring to the NMI handler, adding an extra delay as compared to the ideal NMI described above. Also, sometimes the NES powers up with the PPU and CPU dividers such that the NMI occurs an additional cycle later.

By ensuring that a short instruction is executing when VBL occurs, we can minimize the delay before NMI is vectored. For example, if we have a series of NOP instructions executing when VBL occurs, NMI will occur from 2 to 4 cycles after VBL. The table shows the four possible timings, with each column titled with the time NMI vectoring begins.

        nop
        nop
        nop
Cycle VBL + 2 VBL + 3 VBL + 4 Event
-1 NOP
0 NOP NOP VBL begins
1 NOP
2 NMI vectored NOP
3 NMI vectored
4 NMI vectored

So, at best, we have 2 to 4 cycles of delay between VBL and our NMI handler.

Using a long sequence of NOP instructions isn't practical, because it requires either a large number of NOP instructions, or that we know how long the code before them takes so that we can delay entry into the NOP sequence until NMI is about to occur. If we instead have a simple infinite loop made of a single JMP instruction, we only increase the maximum delay by one cycle, to 5.

 loop:   jmp loop
Cycle VBL + 2 VBL + 3 VBL + 4 VBL + 5 Event
-1 JMP JMP
0 JMP VBL begins
1 JMP
2 NMI vectored JMP
3 NMI vectored
4 NMI vectored
5 NMI vectored

Compensating for NMI delay

With a JMP loop to wait for NMI, we have 2 to 5 cycles of delay between VBL and our NMI handler. We want to compensate for this delay D by delaying an additional 5-D cycles. Here, we have the NOP always begin at VBL+12. We can't actually do this, but it shows what we must do the equivalent of.

nmi:	delay 5-D
	nop
Cycle VBL + 2 VBL + 3 VBL + 4 VBL + 5 Event
-1 JMP JMP
0 JMP VBL begins
1 JMP
2 NMI vectored JMP
3 NMI vectored
4 NMI vectored
5 NMI vectored
6
7
8
9 delay 3
10 delay 2
11 delay 1
12 NOP NOP NOP NOP (no delay)

We just have to find out how to determine the number of cycles of delay to add.

Sprite DMA always ends on even cycle

When sprite DMA ($4014) is written to, the next instruction always begins on an odd cycle. If the $4014 write is on an odd cycle, it pauses the CPU for an additional 513 cycles, otherwise 514 cycles. We can use this aspect to partially compensate for NMI's variable delay.

nmi:    lda #$07          ; sprites at $700
        sta $4014
        nop
Cycle VBL + 2 VBL + 3 VBL + 4 VBL + 5
0 VBL begins
1
2 NMI
3 NMI
4 NMI
5 NMI
6
7
8
9 LDA #$07
10 LDA #$07
11 STA $4014 LDA #$07
12 STA $4014 LDA #$07
13 STA $4014
14 $4014 write STA $4014
15 514-cycle DMA $4014 write
16 513-cycle DMA $4014 write
17 514-cycle DMA $4014 write
18 513-cycle DMA
...
527
528 DMA finishes DMA finishes
529 NOP NOP
530 DMA finishes DMA finishes
531 NOP NOP

This reduces the number of different delays from four to two. The NOP always executes at either VBL+529 or VBL+531. This is an improvement. We just need a way to determine which time DMA finished at, and delay two extra cycles if it was the earlier one.

VBL flag cleared at end of VBL

The VBL flag is cleared near the end of VBL. If we read $2002 around the time the flag is cleared, we can determine whether the read occurred before or after the flag was cleared. We will have to avoid reading $2002 elsewhere in the NMI handler, since reading $2002 clears the flag.

The VBL flag is cleared around pixel 23869, sometimes one less, so we want to read $2002 at VBL+7458 or VBL+7460. It works out nicely that sprite DMA leaves two cycles between the possible ending times, as this ensures that our $2002 read is several pixels before or after when the flag is cleared, giving us a good margin for error. If we find the flag set, we know we are on the earlier of the two DMA ending times, so we delay an extra two cycles.

nmi:    lda #$07          ; sprites at $700
        sta $4014
        delay 6926
        bit $2002         ; read at VBL+7458 or VBL+7460
        bpl skip
        bit 0
skip:   nop
Cycle VBL + 2 VBL + 3 VBL + 4 VBL + 5
0 VBL begins
1
2 NMI
3 NMI
4 NMI
5 NMI
6
7
8
9 LDA #$07
10 LDA #$07
11 STA $4014 LDA #$07
12 STA $4014 LDA #$07
13 STA $4014
14 $4014 write STA $4014
15 514-cycle DMA $4014 write
16 513-cycle DMA $4014 write
17 514-cycle DMA $4014 write
18 513-cycle DMA
...
527
528 DMA finishes DMA finishes
529 delay 6926 delay 6926
530 DMA finishes DMA finishes
531 delay 6926 delay 6926
...
7455 BIT $2002 BIT $2002
7456
7457 BIT $2002 BIT $2002
7458 $2002 read = $80 $2002 read = $80
7459 BPL not taken BPL not taken VBL cleared VBL cleared
7460 $2002 read = 0 $2002 read = 0
7461 BIT 0 BIT 0 BPL taken BPL taken
7462
7463
7464 NOP NOP NOP NOP

This achieves our goal, but not in all cases.

VBL begins on odd cycles

Unfortunately, VBL doesn't always begin during an even cycle, as we've so far assumed. When VBL begins during an odd cycle, our code doesn't work so well:

Cycle VBL + 2 VBL + 3 VBL + 4 VBL + 5
1 VBL begins
2
3 NMI
4 NMI
5 NMI
6 NMI
7
8
9
10 LDA #$07
11 LDA #$07
12 STA $4014 LDA #$07
13 STA $4014 LDA #$07
14 STA $4014
15 $4014 write STA $4014
16 513-cycle DMA $4014 write
17 514-cycle DMA $4014 write
18 513-cycle DMA $4014 write
19 514-cycle DMA
...
527
528 DMA finishes
529
530 DMA finishes DMA finishes
531
532 DMA finishes

Now DMA ends at three different times, covering a wider range than the original NMI times did, thus making things worse!

We need to keep track of when VBL begins during an odd cycle, and compensate before we begin DMA. After our PPU synchronization routine finishes, the last $2002 read it makes will have just found the VBL flag set. In the following table, that is cycle 0.

Pixel Cycle Frame
0 0 1
106392 33247.5 2
212784 66495 3
319176 99742.5 4
425568 132990 5
531960 166237.5 6
638352 199485 7
744744 232732.5 8

Looking at which cycle each frame begins on, we see they follow a four-frame pattern: even, odd, odd, even. So we'll just have a variable that starts out at 1 and increments every frame, then examine bit 1 and delay an extra cycle if it's clear. This extra code takes 8 cycles on frames where VBL begins during an even cycle, and 7 cycles otherwise.

But we also need to insert a complementary delay after DMA, before the $2002 read, since on frames where VBL begins during an odd cycle we'll need to read $2002 one cycle later after DMA than for even frames.

nmi:    lda frame_count
        and #$02
        beq even
even:   lda #$07          ; sprites at $700
        sta $4014
        delay 6911
        lda frame_count
        and #$02
        bne odd
odd:    bit $2002
        bpl skip
        bit 0
skip:   inc frame_count
        delay 2028
        sta $2006
Cycle Frames 1, 4, 5, 8 ... Frames 2, 3, 6, 7 ...
VBL + 2 VBL + 3 VBL + 4 VBL + 5 VBL + 2 VBL + 3 VBL + 4 VBL + 5
0 VBL VBL VBL VBL
1 VBL VBL VBL VBL
2 NMI
3 NMI NMI
4 NMI NMI
5 NMI NMI
6 NMI
7
8
9 LDA frame_count
10 LDA frame_count LDA frame_count
11 LDA frame_count LDA frame_count
12 AND #$02 LDA frame_count LDA frame_count
13 AND #$02 AND #$02 LDA frame_count
14 BEQ taken AND #$02 AND #$02
15 BEQ taken AND #$02 BEQ not taken AND #$02
16 BEQ taken BEQ not taken AND #$02
17 LDA #$07 BEQ taken LDA #$07 BEQ not taken
18 LDA #$07 LDA #$07 BEQ not taken
19 STA $4014 LDA #$07 STA $4014 LDA #$07
20 STA $4014 LDA #$07 STA $4014 LDA #$07
21 STA $4014 STA $4014
22 $4014 write STA $4014 $4014 write STA $4014
23 514-cycle DMA $4014 write 514-cycle DMA $4014 write
24 513-cycle DMA $4014 write 513-cycle DMA $4014 write
25 514-cycle DMA $4014 write 514-cycle DMA $4014 write
26 513-cycle DMA 513-cycle DMA
...
535
536 DMA finishes DMA finishes DMA finishes DMA finishes
537 delay 6911 delay 6911 delay 6911 delay 6911
538 DMA finishes DMA finishes DMA finishes DMA finishes
539 delay 6911 delay 6911 delay 6911 delay 6911
...
7448 LDA frame_count LDA frame_count LDA frame_count LDA frame_count
7449
7450 LDA frame_count LDA frame_count LDA frame_count LDA frame_count
7451 AND #$02 AND #$02 AND #$02 AND #$02
7452
7453 BNE not taken BNE not taken AND #$02 AND #$02 BNE taken BNE taken AND #$02 AND #$02
7454
7455 BIT $2002 BIT $2002 BNE not taken BNE not taken BNE taken BNE taken
7456 BIT $2002 BIT $2002
7457 BIT $2002 BIT $2002
7458 $2002 read = $80 $2002 read = $80 BIT $2002 BIT $2002
7459 BPL not taken BPL not taken VBL cleared VBL cleared $2002 read = $80 $2002 read = $80
7460 $2002 read = 0 $2002 read = 0 BPL not taken BPL not taken VBL cleared VBL cleared
7461 BIT 0 BIT 0 BPL taken BPL taken $2002 read = 0 $2002 read = 0
7462 BIT 0 BIT 0 BPL taken BPL taken
7463
7464 INC frame_count INC frame_count INC frame_count INC frame_count
7465 INC frame_count INC frame_count INC frame_count INC frame_count
7466
7467
7468
7469 delay 2028 delay 2028 delay 2028 delay 2028
7470 delay 2028 delay 2028 delay 2028 delay 2028
...
9497 STA $2006 STA $2006 STA $2006 STA $2006
9498 STA $2006 STA $2006 STA $2006 STA $2006
9499
9500 $2006 write at VBL+9500 $2006 write at VBL+9500 $2006 write at VBL+9500 $2006 write at VBL+9500
9501 $2006 write at VBL+9500 $2006 write at VBL+9500 $2006 write at VBL+9500 $2006 write at VBL+9500

The $2006 write is done at VBL+9500 in all cases. Remember that the right four columns have VBL beginning on cycle 1 (an odd cycle), which is why the final writes appear to be one cycle later than the others.

Synchronizing with even CPU cycle

Since our final synchronization method relies on knowing whether a given frame begins during an even or odd cycle, we must initially ensure that our PPU synchronization routine's final $2002 read is also during an even cycle. Since the fine synchronization loop takes an even number of cycles, we merely need to ensure that the first time through that the $2002 read is on an even cycle. We can do this by initiating sprite DMA before the fine synchronization loop.

sync_ppu:
        ; Coarse synchronize
        bit $2002
:       bit $2002
        bpl :-
        
        sta $4014
        delay 32713
        jmp first
        
        ; Fine synchronize
:       delay 33241
first:  bit $2002
        bpl :-
        
        ; NMI won't be fired until frame 2
        lda #2
        sta frame_count
        
        rts

The STA $4014 takes up to 518 cycles, so we subtract that from the initial delay. After the STA $4014, the delay begins on an odd cycle. Since also it's an odd number of cycles until the $2002 read, it will occur on an even cycle, as desired.

Simpler synchronization routine

The PPU synchronization routine is pretty short, but it requires use of the delay macro, which takes a fair amount of code to implement. It's possible to eliminate that without any negative impact.

The fine synchronization loop needs to read $2002 every 33248 cycles, so it can find when the VBL flag is set just before the read. This seems to require a long delay between reads. Until the final iteration, it must not find the VBL flag set. If it were like the coarse loop and read the VBL flag every 7 cycles, it would clearly stop somewhere near the beginning of the first frame, but rarely right at the beginning. It might read $2002 one cycle before the VBL flag is set, loop, then read it 7 cycles later and find it now set. This isn't what we want. If we read it slightly more often, like every 33248/2 = 16624 cycles, it would still work, since the VBL flag is automatically cleared near the end of VBL.

sync_ppu:
        ; Coarse synchronize
        bit $2002
:       bit $2002
        bpl :-
        
        sta $4014
        delay 16089
        jmp first
        
        ; Fine synchronize
:       delay 16617
first:  bit $2002
        bpl :-

        rts
Cycle PPU CPU
0 Set VBL flag
7459 Clear VBL flag
16622 Read $2002 = 0
33246 Read $2002 = 0
33247.5 Set VBL flag
40706.5 Clear VBL flag
49870 Read $2002 = 0
66494 Read $2002 = 0
66495 Set VBL flag
73954 Clear VBL flag
83118 Read $2002 = 0
99742 Read $2002 = 0
99742.5 Set VBL flag
107201.5 Clear VBL flag
116366 Read $2002 = 0
132990 Set VBL flag Read $2002 = $80

That works, but reducing the delays doesn't eliminate the need for them. The important thing is that the $2002 read only be able to happen just after the VBL flag is set, rather than many cycles after it was set. Rather than rely on the PPU to clear the VBL flag, we can clear it ourselves. 16 is a factor of 33248, so we can have the loop take only 16 cycles and still synchronize properly.

sync_ppu:
        ; Coarse synchronize
        bit $2002
:       bit $2002
        bpl :-
        
        sta $4014
        bit <0
        
        ; Fine synchronize
:       bit <0
        nop
        bit $2002
        bit $2002
        bpl :-

        rts
Cycle PPU CPU
0 Set VBL flag
10 Dummy read $2002 = $80
14 Read $2002 = 0
26 Dummy read $2002 = 0
30 Read $2002 = 0
...
33242 Dummy read $2002 = 0
33246 Read $2002 = 0
33247.5 Set VBL flag
33258 Dummy read $2002 = $80
33262 Read $2002 = 0
...
66490 Dummy read $2002 = 0
66494 Read $2002 = 0
66495 Set VBL flag
66506 Dummy read $2002 = $80
66510 Read $2002 = 0
...
99738 Dummy read $2002 = 0
99742 Read $2002 = 0
99742.5 Set VBL flag
99754 Dummy read $2002 = $80
99758 Read $2002 = 0
...
132986 Dummy read $2002 = 0
132990 Set VBL flag Read $2002 = $80

Essentially there's a four-cycle window that the second $2002 read in the loop is watching for the VBL flag to be set within. On entry to the loop, we ensure that the flag will never be set within this window. Every 33248/16 = 2078 iterations, the second $2002 read is half a cycle later in the frame, just like the original version. On every other iteration, the dummy $2002 read four cycles before has ensured that the VBL flag is cleared.