Subject: branch prediction analysis

Continuing with the loop example from earlier, we explore how branch prediction impacts performance. Modern CPUs decode multiple instructions per cycle, but jumps like jb (jump if below) introduce uncertainty: the CPU must guess whether the branch will be taken based on conditional flags. A misprediction forces the frontend to flush decoded instructions and re-fetch the correct path - a costly penalty.

To test branch prediction efficiency, we modified the loop to include a conditional jump:

          
asm_branch_pred:  
  xor rax, rax  
.loop:  
  mov r10, [rdx+rax]  
  inc rax  
  test r10, 1  
  jnz .skip  
    nop  
.skip:  
  cmp rax, rcx  
  jb .loop  
  ret

(Source: github.com/cmuratori)

The loop tests the least significant bit of r10 register value: if set, it skips a nop. No computational work occurs in the branch-only frontend decoding overhead. We tested buffers with patterns like all zeros (never taken), all ones (always taken), periodic intervals (e.g., every 2nd, 4th, or 8th iteration), and random values (BCrypt-generated).

Results:

Never Taken:
Total time: 562.4655 ms
Min/Max: 1.817 GB/s and 1.394 GB/s.

Always Taken:
Total time: 786.6097 ms
Min/Max: 1.274 GB/s and 1.113 GB/s.

Every 2nd Taken:
Total time: 639.6436 ms
Min/Max: 1.650 GB/s and 1.086 GB/s.

Every 4th/8th Taken:
Total time: ~506–509 ms
Min/Max: ~1.98–2.0 GB/s.

BCrypt Random:
Total time: 4,905.2049 ms
Min/Max: 0.229 GB/s and 0.204 GB/s.

The results show that the best performance occurs when branches are never taken or taken every 4th/8th iteration (surprisingly consistent). Intel's optimization manual recommends structuring code to favor never-taken conditions, which aligns with our findings. As expected, random branch patterns perform poorly, causing significant slowdowns due to frequent mispredictions.

To quantify the penalty, we calculate:
0.229 * 1024 * 1024 * 1024 = 245886877.696, 2.8 * 1000 * 1000 * 1000 = 2800000000, 2800000000 / 245886877.696 = 11.38735025730716088974 or 11.4 cycles/byte.

This implies a ~11-cycle penalty per misprediction, aligning with Agner Fog's analysis of Broadwell CPUs: "The branch misprediction penalty varies a lot. It was measured to 15–20 clock cycles." While our result (~11 cycles) is slightly lower, it reflects pipeline overlap and matches the expected order of magnitude.

In progress

Multithreading.