Subject: execution ports and scheduler
After the frontend decodes instructions, they are sent to the backend scheduler, which determines execution order and timing. The scheduler tracks dependencies and ensures resources (e.g., execution ports) are available.
Ultimately, performance bottlenecks depend on the number of execution ports for specific operations. For example, if a CPU has two ALUs and code requires three additions, one operation must stall. To test port availability for mov (loads/stores), we designed microbenchmarks that avoid memory bandwidth interference by reusing the same memory address.
asm_ports_read_mov_1x:
align 64
.loop:
mov rax, [rdx]
sub rcx, 1
jnle .loop
ret
asm_ports_read_mov_2x:
align 64
.loop:
mov rax, [rdx]
mov rax, [rdx]
sub rcx, 2
jnle .loop
ret
asm_ports_read_mov_3x:
align 64
.loop:
mov rax, [rdx]
mov rax, [rdx]
mov rax, [rdx]
sub rcx, 3
jnle .loop
ret
The stores are the same, exctep we change to mov [memory], register:
asm_ports_write_mov_1x:
align 64
xor rax, rax
.loop:
mov [rdx], rax
sub rcx, 1
jnle .loop
ret
asm_ports_write_mov_2x:
align 64
xor rax, rax
.loop:
mov [rdx], rax
mov [rdx], rax
sub rcx, 2
jnle .loop
ret
asm_ports_write_mov_3x:
align 64
xor rax, rax
.loop:
mov [rdx], rax
mov [rdx], rax
mov [rdx], rax
sub rcx, 3
jnle .loop
ret
Results:
Loads:
1x: 379.9099 ms (2.646 GB/s).
2x: 189.9677 ms (5.314 GB/s).
3x: 192.7740 ms (5.307 GB/s).
Stores:
1x: 387.4054 ms (2.651 GB/s).
2x: 389.9868 ms (2.640 GB/s).
3x: 382.9486 ms (2.638 GB/s).
Loads scale efficiently up to 2x (doubling throughput), but 3x shows diminishing returns. Stores plateau at 1x throughput, suggesting only one store port is available. This aligns with Intel’s documentation for Broadwell CPUs, which specifies two load ports and one store port (see Intel 64 and IA-32 Architectures Optimization Reference Manual , Section 2.3.4).
- Overview
- Profiling the game code;
- File I/O and page faults;
- Measuring memory bandwidth;
- Instruction decoding;
- Testing branch prediction;
- Execution ports and schedulers;
- Cache sizes and bandwidth;
- Introducing SIMD;
- Multithreading.
In progress