Subject: memory write bandwidth
In addition to reading files, the engine must write to memory (specifically, to a buffer passed to the OS for display). To test write performance, we measured memory bandwidth using a method similar to the file I/O benchmarks. Instead of reading a file, we allocated 1 GB of memory and wrote to it one byte at a time. The code:
for (u64 index = 0; index < allocated_size; index++) {
source[index] = (u8)index;
}
Results:
Total time: 436.9061 ms
Min: 1,024 MB at 2.356 GB/s (0 page faults)
Max: 1,024 MB at 1.555 GB/s (262,655 page faults)
The fastest observed write speed was ~2.35 GB/s . The compiled assembly loop (with -O2 optimization) is:
00007FF655E9A4F0 mov byte ptr [rdx+rax*1], al
00007FF655E9A4F3 inc rax
00007FF655E9A4F6 cmp rax, rcx
00007FF655E9A4F9 jb 0x7ff655e9a4f0
00007FF655E9A4FB ret
This loop spans 11 bytes of instructions (from 0x7FF655E9A4F0 to 0x7FF655E9A4FB). Each iteration writes 1 byte but executes 11 bytes of instructions. On a CPU with a base frequency of 2.3 GHz (2,290,000,000 Hz) and Turbo Boost up to 2.8 GHz, and a peak measured bandwidth of 2.356 GB/s (2.356 * 1024^3 = 2,529,735,737.344 bytes/s), the cycles per byte (or cycles per loop) are: 2800000000 / 2529735737 = 1,10683497847 cycles/loop.
Let's add two additional tests. In the first one we'll remove "mov" instruction.
asm_no_mov_loop:
xor rax, rax
.loop:
inc rax
cmp rax, rcx
jb .loop
ret
and the second is just a "dec"
asm_dec_loop:
.loop:
dec rcx
jnz .loop
ret
Results:
No mov loop:
Total time: 382.4215 ms
Min/Max: 2.651 GB/s and 2.439 GB/s (0 page faults).
dec loop:
Total time : 381.4530 ms
Min/Max : 2.649 GB/s and 2.512 GB/s (0 page faults).
Calculating cycles per loop for the dec variant:
2800000000 / 2844342091.776 ~= 0.984410422 cycles/loop
Can this loop be optimized further? The bottleneck lies in the dependency chain. The mov instruction uses rax for address calculation, followed by an inc/dec operation that depends on the prior value of rax. This creates a sequential dependency: each iteration must wait for the previous mov and inc/dec to complete before proceeding. Modern superscalar CPUs can execute multiple instructions in parallel, but dependency chains prevent this parallelism. Let's analyze the impact of removing the mov instruction.
Even the modified loop (without mov) retains a dependency chain: the dec instruction must still complete before the next iteration begins. As a result, the CPU cannot achieve better than ~0.984 cycles per loop iteration for this specific code.
- Overview
- Profiling the game code;
- File I/O and page faults;
- Measuring memory bandwidth;
- Instruction decoding;
- Testing branch prediction;
- Execution ports and schedulers;
- Cache sizes and bandwidth;
- Introducing SIMD;
- Multithreading.
In progress