Subject: instruction decoding

Modern CPUs attempt to execute instructions in parallel whenever possible. However, their frontend - the component responsible for fetching and decoding instructions - has inherent limits. Even if the backend (execution units) is highly parallelized, performance will bottleneck if the frontend cannot supply instructions fast enough. For example, if the frontend decodes only one instruction per clock cycle, the CPU cannot exceed that rate.

To measure frontend throughput on this hardware, we reused the earlier loop structure but replaced the mov instruction with NOP (no-operation) instructions. We tested two variants:

A 3-byte NOP (equivalent in size to the original mov).
Multiple 1-byte NOPs per iteration to stress-test frontend decoding limits.

        
asm_nop_3_byte_loop:
  xor rax, rax
  .loop:
    db 0x0f, 0x1f, 0x00
    inc rax
    cmp rax, rcx
    jb .loop
    ret

asm_nop_1_byte_3_times_loop:
  xor rax, rax
  .loop:
    nop
    nop
    nop
    inc rax
    cmp rax, rcx
    jb .loop
    ret

asm_nop_1_byte_1_times_loop:
  xor rax, rax
  .loop:
    nop
    inc rax
    cmp rax, rcx
    jb .loop
    ret

asm_nop_1_byte_2_times_loop:
  xor rax, rax
  .loop:
    nop
    nop
    inc rax
    cmp rax, rcx
    jb .loop
    ret

asm_nop_1_byte_10_times_loop:
  xor rax, rax
  .loop:
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    inc rax
    cmp rax, rcx
    jb .loop
    ret

asm_nop_1_byte_5_times_loop:
  xor rax, rax
  .loop:
    nop
    nop
    nop
    nop
    nop
    inc rax
    cmp rax, rcx
    jb .loop
    ret


asm_nop_1_byte_7_times_loop:
  xor rax, rax
  .loop:
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    inc rax
    cmp rax, rcx
    jb .loop
    ret

Results:

3-byte NOP loop:
Total time: 386.4239 ms
Min/Max: 2.645 GB/s and 2.488 GB/s (0 page faults).

Three 1-byte NOPs per loop:
Total time: 517.9729 ms
Min/Max: 1.980 GB/s and 1.835 GB/s (0 page faults).

Ten 1-byte NOPs per loop:
Total time: 1,155.5886 ms
Min/Max: 0.878 GB/s and 0.535 GB/s (0 page faults).

Five 1-byte NOPs per loop:
Total time: 677.1334 ms
Min/Max: 1.504 GB/s and 1.408 GB/s (0 page faults).

Seven 1-byte NOPs per loop:
Total time: 896.1636 ms
Min/Max: 1.127 GB/s and 1.097 GB/s (0 page faults).

Additional Tests:
Single 1-byte NOP per loop:
Total time: 384.4103 ms and 393.9344 ms (variance in runs).
Min/Max: 2.645 GB/s to 2.335 GB/s (0 page faults).

These results demonstrate that increasing the number of instructions - even non-computational NOPs - reduces throughput as the frontend becomes saturated. Notably, a 3-byte NOP outperforms three 1-byte NOPs, suggesting the frontend decodes larger instructions more efficiently.

According to Agner Fog's Microarchitecture Manual, the Broadwell CPU's frontend is designed for 4 instructions per clock cycle and can fetch 16 bytes of code per cycle in single-threaded workloads. Our tests align with this: the gradual slowdown as instruction count increases implies a frontend bottleneck, with decoding throughput approaching ~4 instructions/cycle in the tested loop.

In progress

Multithreading.