Subject: execution ports and scheduler

After the frontend decodes instructions, they are sent to the backend scheduler, which determines execution order and timing. The scheduler tracks dependencies and ensures resources (e.g., execution ports) are available.

Ultimately, performance bottlenecks depend on the number of execution ports for specific operations. For example, if a CPU has two ALUs and code requires three additions, one operation must stall. To test port availability for mov (loads/stores), we designed microbenchmarks that avoid memory bandwidth interference by reusing the same memory address.

          
asm_ports_read_mov_1x:
  align 64
  .loop:
    mov rax, [rdx]
    sub rcx, 1
    jnle .loop
    ret

asm_ports_read_mov_2x:
  align 64
  .loop:
    mov rax, [rdx]
    mov rax, [rdx]
    sub rcx, 2
    jnle .loop
    ret

asm_ports_read_mov_3x:
  align 64
  .loop:
    mov rax, [rdx]
    mov rax, [rdx]
    mov rax, [rdx]
    sub rcx, 3
    jnle .loop
    ret

The stores are the same, exctep we change to mov [memory], register:

        
asm_ports_write_mov_1x:
  align 64
  xor rax, rax
  .loop:
    mov [rdx], rax
    sub rcx, 1
    jnle .loop
    ret

asm_ports_write_mov_2x:
  align 64
  xor rax, rax
  .loop:
    mov [rdx], rax
    mov [rdx], rax
    sub rcx, 2
    jnle .loop
    ret

asm_ports_write_mov_3x:
  align 64
  xor rax, rax
  .loop:
    mov [rdx], rax
    mov [rdx], rax
    mov [rdx], rax
    sub rcx, 3
    jnle .loop
    ret

Results:

Loads:

1x: 379.9099 ms (2.646 GB/s).
2x: 189.9677 ms (5.314 GB/s).
3x: 192.7740 ms (5.307 GB/s).

Stores:

1x: 387.4054 ms (2.651 GB/s).
2x: 389.9868 ms (2.640 GB/s).
3x: 382.9486 ms (2.638 GB/s).

Loads scale efficiently up to 2x (doubling throughput), but 3x shows diminishing returns. Stores plateau at 1x throughput, suggesting only one store port is available. This aligns with Intel’s documentation for Broadwell CPUs, which specifies two load ports and one store port (see Intel 64 and IA-32 Architectures Optimization Reference Manual , Section 2.3.4).

In progress

Multithreading.