Subject: branch prediction analysis
Continuing with the loop example from earlier, we explore how branch prediction impacts performance. Modern CPUs decode multiple instructions per cycle, but jumps like jb (jump if below) introduce uncertainty: the CPU must guess whether the branch will be taken based on conditional flags. A misprediction forces the frontend to flush decoded instructions and re-fetch the correct path - a costly penalty.
To test branch prediction efficiency, we modified the loop to include a conditional jump:
asm_branch_pred:
xor rax, rax
.loop:
mov r10, [rdx+rax]
inc rax
test r10, 1
jnz .skip
nop
.skip:
cmp rax, rcx
jb .loop
ret
(Source: github.com/cmuratori)
The loop tests the least significant bit of r10 register value: if set, it skips a nop. No computational work occurs in the branch-only frontend decoding overhead. We tested buffers with patterns like all zeros (never taken), all ones (always taken), periodic intervals (e.g., every 2nd, 4th, or 8th iteration), and random values (BCrypt-generated).
Results:Never Taken:
Total time: 562.4655 ms
Min/Max: 1.817 GB/s and 1.394 GB/s.
Always Taken:
Total time: 786.6097 ms
Min/Max: 1.274 GB/s and 1.113 GB/s.
Every 2nd Taken:
Total time: 639.6436 ms
Min/Max: 1.650 GB/s and 1.086 GB/s.
Every 4th/8th Taken:
Total time: ~506–509 ms
Min/Max: ~1.98–2.0 GB/s.
BCrypt Random:
Total time: 4,905.2049 ms
Min/Max: 0.229 GB/s and 0.204 GB/s.
The results show that the best performance occurs when branches are never taken or taken every 4th/8th iteration (surprisingly consistent). Intel's optimization manual recommends structuring code to favor never-taken conditions, which aligns with our findings. As expected, random branch patterns perform poorly, causing significant slowdowns due to frequent mispredictions.
To quantify the penalty, we calculate:
0.229 * 1024 * 1024 * 1024 = 245886877.696, 2.8 * 1000 * 1000 * 1000 = 2800000000, 2800000000 / 245886877.696 = 11.38735025730716088974
or 11.4 cycles/byte.
This implies a ~11-cycle penalty per misprediction, aligning with Agner Fog's analysis of Broadwell CPUs: "The branch misprediction penalty varies a lot. It was measured to 15–20 clock cycles." While our result (~11 cycles) is slightly lower, it reflects pipeline overlap and matches the expected order of magnitude.
- Overview
- Profiling the game code;
- File I/O and page faults;
- Measuring memory bandwidth;
- Instruction decoding;
- Testing branch prediction;
- Execution ports and schedulers;
- Cache sizes and bandwidth;
- Introducing SIMD;
- Multithreading.
In progress