Monday, March 17, 2014

Probing instruction throughput of recent Intel processors with likwid-bench - Part 1: Theory

Microarchitectures gradually improve over generations and one thing which is addressed are the instruction throughput capabilities in terms of superscalar execution. This article will focus on the development of the load/store unit improvements from Intel Westmere, Intel IvyBridge to Intel Haswell microarchitectures. The following information is to the best of my knowledge. If you find any errors please let me know to correct it. The present article will perform an architectural analysis for the STREAM triad kernel in order to predict the expected performance with the data located in L1 cache on the considered architectures. In a subsequent article we will try to measure this prediction using the likwid-bench microbenchmarking tool.

likwid-bench is part of the LIKWID tool suite. It is an application which comes out of the box with many streaming based testcases but also allows to easily add more testcases in terms of simple text files. The application cares for data set size and placement and threading configuration. We choose the so called STREAM triad as implemented in McCalpins STREAM microbenchmark. This is a simple C code for it (data type is float 32bit floating point):

for (int i=0; i
{
    vecA[i] = vecB[i] + scalar *  vecC[i];
}

One iteration requires two loads and one store and a multiply/add operations. For Intel Westmere the fastest implementation uses SSE packed SIMD instructions. This is the resulting assembly code:

1:
movaps xmm0, [rdx + rax*4]
movaps xmm1, [rdx + rax*4+16]
movaps xmm2, [rdx + rax*4+32]
movaps xmm3, [rdx + rax*4+48]
mulps xmm0, xmm4
mulps xmm1, xmm4
mulps xmm2, xmm4
mulps xmm3, xmm4
addps xmm0, [rcx + rax*4]
addps xmm1, [rcx + rax*4+16]
addps xmm2, [rcx + rax*4+32]
addps xmm3, [rcx + rax*4+48]
movaps [rsi + rax*4] , xmm0
movaps [rsi + rax*4+16], xmm1
movaps [rsi + rax*4+32], xmm2
movaps [rsi + rax*4+48], xmm3
add rax, 16
cmp rax, rdi
jl 1b

Lets look for the theory first. Here is a schematic figure illustrating the instruction throughput capabilities of the Intel Westmere architecture. The architecture can issue and retire 4 uops per cycle. It is capable of executing one load and one store or one load or one store per cycle. Both load or store can be up to 16 byte wide. On the arithmetic side it can execute one multiply and one add or one add or one multiply.

Schematic illustration of  instruction throughput capabilities for Intel Westmere

Throughout the article we will consider cycles it takes to execute loop iterations equivalent to one cache line (64bytes). Because packed SSE SIMD is 16 bytes wide this results in 4 SIMD iterations to update one cache line. The throughput bottleneck on this architecture is the load port and the minimum time for executing one SIMD iteration is 2 cycles. Therefore we end up with 4 SIMD iterations x 2 cycles = 8 cycles to update one cache line. We can easily compute the lightspeed performance (lets consider a clock of 3GHz) as 3 GHz / 2 cycles * 4 iterations * 2 flops/iteration =  12 GFlops/s.

After Westmere, which was a so called tick (incremental) update, Intel released a larger update with the SandyBridge processor. Intel SandyBridge introduced the AVX SIMD instruction set extension with 32 bytes SIMD width instead of the previous width of 16 bytes with SSE. The resulting AVX kernel looks like the following:

1:
vmovaps ymm1, [rdx + rax*4]
vmovaps ymm2, [rdx + rax*4+32]
vmovaps ymm3, [rdx + rax*4+64]
vmovaps ymm4, [rdx + rax*4+96]
vmulps ymm1, ymm1, ymm5
vmulps ymm2, ymm2, ymm5
vmulps ymm3, ymm3, ymm5
vmulps ymm4, ymm4, ymm5
vaddps ymm1, ymm1, [rcx + rax*4]
vaddps ymm2, ymm2, [rcx + rax*4+32]
vaddps ymm3, ymm3, [rcx + rax*4+64]
vaddps ymm4, ymm4, [rcx + rax*4+96]
vmovaps [rsi + rax*4] , ymm1
vmovaps [rsi + rax*4+32], ymm2
vmovaps [rsi + rax*4+64], ymm3
vmovaps [rsi + rax*4+96], ymm4
add rax, 32
cmp rax, rdi
jl 1b

Because its successor IvyBridge performs similar we will only consider IvyBridge in this comparison. The following illustration again illustrates the important properties. The execution units are 32 bytes wide. This microarchitecture also adds a second load port. The load/store units are still 16 byte wide. This architecture can execute either one AVX packed (SIMD) load instruction and half a packed AVX store or one AVX packed load or half a packed AVX store. For SSE code the architecture suggests it can execute two packed SSE loads and one packed SSE store or one packed SSE load and one packed SSE store or one SSE store. Still as can be seen in the illustration the store (data) unit shares the address generation with the load units (indicated by AGU in the illustration). If you have two SSE loads and one store instruction mix the store competes with the loads for port 2 or 3. The maximum throughput therefore cannot be reached with SSE code. This does not apply for AVX, here the 32 byte load only occupies one port, the other port can be used for the store address generation.

Schematic illustration of  instruction throughput capabilities for Intel Ivy Bridge
But what does that mean for the execution of our STREAM triad test case? Again the load/store unit is the bottleneck. The two loads as well as the store take 2 cycles for one AVX SIMD iterations. Because you need two AVX SIMD iterations to update one cache lines we end up with 2 SIMD iterations x 2 cycles = 4 cycles. Lightspeed performance is then  3 GHz / 2 cycles * 8 iterations * 2 flops/iteration =  24 GFlops/s.

The next microarchitecture Haswell was a larger update (a tock in Intel nomenclature). Haswell widens all load/store paths to 32 bytes. Moreover the processor adds two additional ports, one of them with an address generation unit for the stores. Haswell can now issue up to 8 uops per cycles but still is limited to 4 uops retired per cycle. One could ask why you want to do that, stuffing stuff in on top when not more can exit at the bottom. One explanation is that the 4 uops per cycle throughput in e.g. the Westmere design could not be reached in practice. Already a CPI of 0.35-0.5 is the best you can expect. By issuing more instructions you increase the average instruction throughput by getting closer to the theoretical limit of 4 uops per cycles. The following illustration shows the basic setup of Haswell.

Schematic illustration of  instruction throughput capabilities for Intel Haswell

Haswell has support for AVX2 which adds things like gather and promotes most instructions to 32 byte wide execution. Also Haswell has fused multiply add instructions (FMA). There is a drawback here: A naive view suggests Haswell should be able to execute either one add and one multiply or two adds or two multiplies because it has two FMA units. This holds for add instructions but not for multiplies. For multiplications the throughput is still one instruction per cycle.

Again lets look what consequences these changes have for the throughput of the STREAM triad kernel. From the execution units it should be possible to issue all instructions in one cycle. But due to the fact that only 4 instructions can retire this throughput cannot be reached. This situation should be improved by using the FMA instructions. There only one instruction is necessary and we can get away with 4 instructions for one SIMD iteration. This is the changed code using the FMA instruction:

1:
vmovaps ymm1, [rdx + rax*4]
vmovaps ymm2, [rdx + rax*4+32]
vmovaps ymm3, [rdx + rax*4+64]
vmovaps ymm4, [rdx + rax*4+96]
vfmadd213ps ymm1, ymm5, [rcx + rax*4]
vfmadd213ps ymm2, ymm5, [rcx + rax*4+32]
vfmadd213ps ymm3, ymm5, [rcx + rax*4+64]
vfmadd213ps ymm4, ymm5, [rcx + rax*4+96]
vmovaps [rsi+ rax*4] , ymm1
vmovaps [rsi+ rax*4+32], ymm2
vmovaps [rsi+ rax*4+64], ymm3
vmovaps [rsi+ rax*4+96], ymm4
add rax, 32
cmp rax, rdi
jl 1b

For this code one SIMD iteration should be able to execute in just 1 cycle ending up with 2 cycles to update a cache line. This is due to the doubled data path of 32 bytes of the load/store units. Lightspeed performance is then  3 GHz / 1 cycle * 8 iterations * 2 flops/iteration =  48 GFlops/s. So theoretically the L1 performance for the STREAM triad is doubled from Westmere to IvyBridge and doubled again from IvyBridge to Haswell.

In a next article we will try to confirm the prediction with likwid-bench.