Showing posts with label likwid. Show all posts
Showing posts with label likwid. Show all posts

Monday, March 17, 2014

Probing instruction throughput of recent Intel processors with likwid-bench - Part 1: Theory

Microarchitectures gradually improve over generations and one thing which is addressed are the instruction throughput capabilities in terms of superscalar execution. This article will focus on the development of the load/store unit improvements from Intel Westmere, Intel IvyBridge to Intel Haswell microarchitectures. The following information is to the best of my knowledge. If you find any errors please let me know to correct it. The present article will perform an architectural analysis for the STREAM triad kernel in order to predict the expected performance with the data located in L1 cache on the considered architectures. In a subsequent article we will try to measure this prediction using the likwid-bench microbenchmarking tool.

likwid-bench is part of the LIKWID tool suite. It is an application which comes out of the box with many streaming based testcases but also allows to easily add more testcases in terms of simple text files. The application cares for data set size and placement and threading configuration. We choose the so called STREAM triad as implemented in McCalpins STREAM microbenchmark. This is a simple C code for it (data type is float 32bit floating point):

for (int i=0; i
{
    vecA[i] = vecB[i] + scalar *  vecC[i];
}

One iteration requires two loads and one store and a multiply/add operations. For Intel Westmere the fastest implementation uses SSE packed SIMD instructions. This is the resulting assembly code:

1:
movaps xmm0, [rdx + rax*4]
movaps xmm1, [rdx + rax*4+16]
movaps xmm2, [rdx + rax*4+32]
movaps xmm3, [rdx + rax*4+48]
mulps xmm0, xmm4
mulps xmm1, xmm4
mulps xmm2, xmm4
mulps xmm3, xmm4
addps xmm0, [rcx + rax*4]
addps xmm1, [rcx + rax*4+16]
addps xmm2, [rcx + rax*4+32]
addps xmm3, [rcx + rax*4+48]
movaps [rsi + rax*4] , xmm0
movaps [rsi + rax*4+16], xmm1
movaps [rsi + rax*4+32], xmm2
movaps [rsi + rax*4+48], xmm3
add rax, 16
cmp rax, rdi
jl 1b

Lets look for the theory first. Here is a schematic figure illustrating the instruction throughput capabilities of the Intel Westmere architecture. The architecture can issue and retire 4 uops per cycle. It is capable of executing one load and one store or one load or one store per cycle. Both load or store can be up to 16 byte wide. On the arithmetic side it can execute one multiply and one add or one add or one multiply.

Schematic illustration of  instruction throughput capabilities for Intel Westmere

Throughout the article we will consider cycles it takes to execute loop iterations equivalent to one cache line (64bytes). Because packed SSE SIMD is 16 bytes wide this results in 4 SIMD iterations to update one cache line. The throughput bottleneck on this architecture is the load port and the minimum time for executing one SIMD iteration is 2 cycles. Therefore we end up with 4 SIMD iterations x 2 cycles = 8 cycles to update one cache line. We can easily compute the lightspeed performance (lets consider a clock of 3GHz) as 3 GHz / 2 cycles * 4 iterations * 2 flops/iteration =  12 GFlops/s.

After Westmere, which was a so called tick (incremental) update, Intel released a larger update with the SandyBridge processor. Intel SandyBridge introduced the AVX SIMD instruction set extension with 32 bytes SIMD width instead of the previous width of 16 bytes with SSE. The resulting AVX kernel looks like the following:

1:
vmovaps ymm1, [rdx + rax*4]
vmovaps ymm2, [rdx + rax*4+32]
vmovaps ymm3, [rdx + rax*4+64]
vmovaps ymm4, [rdx + rax*4+96]
vmulps ymm1, ymm1, ymm5
vmulps ymm2, ymm2, ymm5
vmulps ymm3, ymm3, ymm5
vmulps ymm4, ymm4, ymm5
vaddps ymm1, ymm1, [rcx + rax*4]
vaddps ymm2, ymm2, [rcx + rax*4+32]
vaddps ymm3, ymm3, [rcx + rax*4+64]
vaddps ymm4, ymm4, [rcx + rax*4+96]
vmovaps [rsi + rax*4] , ymm1
vmovaps [rsi + rax*4+32], ymm2
vmovaps [rsi + rax*4+64], ymm3
vmovaps [rsi + rax*4+96], ymm4
add rax, 32
cmp rax, rdi
jl 1b

Because its successor IvyBridge performs similar we will only consider IvyBridge in this comparison. The following illustration again illustrates the important properties. The execution units are 32 bytes wide. This microarchitecture also adds a second load port. The load/store units are still 16 byte wide. This architecture can execute either one AVX packed (SIMD) load instruction and half a packed AVX store or one AVX packed load or half a packed AVX store. For SSE code the architecture suggests it can execute two packed SSE loads and one packed SSE store or one packed SSE load and one packed SSE store or one SSE store. Still as can be seen in the illustration the store (data) unit shares the address generation with the load units (indicated by AGU in the illustration). If you have two SSE loads and one store instruction mix the store competes with the loads for port 2 or 3. The maximum throughput therefore cannot be reached with SSE code. This does not apply for AVX, here the 32 byte load only occupies one port, the other port can be used for the store address generation.

Schematic illustration of  instruction throughput capabilities for Intel Ivy Bridge
But what does that mean for the execution of our STREAM triad test case? Again the load/store unit is the bottleneck. The two loads as well as the store take 2 cycles for one AVX SIMD iterations. Because you need two AVX SIMD iterations to update one cache lines we end up with 2 SIMD iterations x 2 cycles = 4 cycles. Lightspeed performance is then  3 GHz / 2 cycles * 8 iterations * 2 flops/iteration =  24 GFlops/s.

The next microarchitecture Haswell was a larger update (a tock in Intel nomenclature). Haswell widens all load/store paths to 32 bytes. Moreover the processor adds two additional ports, one of them with an address generation unit for the stores. Haswell can now issue up to 8 uops per cycles but still is limited to 4 uops retired per cycle. One could ask why you want to do that, stuffing stuff in on top when not more can exit at the bottom. One explanation is that the 4 uops per cycle throughput in e.g. the Westmere design could not be reached in practice. Already a CPI of 0.35-0.5 is the best you can expect. By issuing more instructions you increase the average instruction throughput by getting closer to the theoretical limit of 4 uops per cycles. The following illustration shows the basic setup of Haswell.

Schematic illustration of  instruction throughput capabilities for Intel Haswell

Haswell has support for AVX2 which adds things like gather and promotes most instructions to 32 byte wide execution. Also Haswell has fused multiply add instructions (FMA). There is a drawback here: A naive view suggests Haswell should be able to execute either one add and one multiply or two adds or two multiplies because it has two FMA units. This holds for add instructions but not for multiplies. For multiplications the throughput is still one instruction per cycle.

Again lets look what consequences these changes have for the throughput of the STREAM triad kernel. From the execution units it should be possible to issue all instructions in one cycle. But due to the fact that only 4 instructions can retire this throughput cannot be reached. This situation should be improved by using the FMA instructions. There only one instruction is necessary and we can get away with 4 instructions for one SIMD iteration. This is the changed code using the FMA instruction:

1:
vmovaps ymm1, [rdx + rax*4]
vmovaps ymm2, [rdx + rax*4+32]
vmovaps ymm3, [rdx + rax*4+64]
vmovaps ymm4, [rdx + rax*4+96]
vfmadd213ps ymm1, ymm5, [rcx + rax*4]
vfmadd213ps ymm2, ymm5, [rcx + rax*4+32]
vfmadd213ps ymm3, ymm5, [rcx + rax*4+64]
vfmadd213ps ymm4, ymm5, [rcx + rax*4+96]
vmovaps [rsi+ rax*4] , ymm1
vmovaps [rsi+ rax*4+32], ymm2
vmovaps [rsi+ rax*4+64], ymm3
vmovaps [rsi+ rax*4+96], ymm4
add rax, 32
cmp rax, rdi
jl 1b

For this code one SIMD iteration should be able to execute in just 1 cycle ending up with 2 cycles to update a cache line. This is due to the doubled data path of 32 bytes of the load/store units. Lightspeed performance is then  3 GHz / 1 cycle * 8 iterations * 2 flops/iteration =  48 GFlops/s. So theoretically the L1 performance for the STREAM triad is doubled from Westmere to IvyBridge and doubled again from IvyBridge to Haswell.

In a next article we will try to confirm the prediction with likwid-bench.

Wednesday, June 13, 2012

Tutorial how to measure energy consumption on Sandy Bridge processors

This small study illustrates how to use likwid-perfctr to measure energy consumption using the RAPL counter on Intel Sandy Bridge Processors. You have to set up likwid to access the msr device files.

We will use the famous STREAM triad benchmark for this study. The STREAM triad performs the operation A[i]=B[i]+s*C[i]. For easy measurement we will use likwid-bench which already includes different variants of STREAM triad out of the box. We want to measure the variations of energy to solution for different frequencies within one ccNUMA domain. The STREAM triad benchmark is regarded to be solely limited by main memory bandwidth. We start out doing a scaling study on a SandyBridge system running with a nominal clock of 2.7 GHz. If you have set the power governor to performance you have to be careful as the clock can be different from this due to turbo mode. We first will get the turbo mode steps using likwid-powermeter. This is what the output looks like on our system:


$ likwid-powermeter -i

-------------------------------------------------------------
CPU name: Intel Core SandyBridge EP processor 
CPU clock: 2.69 GHz 
-------------------------------------------------------------
Base clock: 2700.00 MHz 
Minimal clock: 1200.00 MHz 
Turbo Boost Steps:
C1 3500.00 MHz 
C2 3500.00 MHz 
C3 3400.00 MHz 
C4 3200.00 MHz 
C5 3200.00 MHz 
C6 3200.00 MHz 
C7 3100.00 MHz 
C8 3100.00 MHz 
-------------------------------------------------------------
Thermal Spec Power: 130 Watts 
Minimum  Power: 51 Watts 
Maximum  Power: 130 Watts 
Maximum  Time Window: 0.398438 micro sec 
-------------------------------------------------------------


For our benchmark runs we want to verify the actual clock. To do so you can enable the instrumentation available in likwid-bench. If you compile likwid with gcc you have to uncomment the following line in the file include_GCC.mk:



DEFINES  += -DPERFMON


Then rebuild everything. To use the instrumentation you have to call likwid-bench together with likwid-perfctr. While you can measure energy with likwid-powermeter, it is more convenient to use likwid-perfctr as it is much more flexible in different modes. Also you can correlate your energy measurements with other hardware performance counter data. All things we are interested in are in the ENERGY group available on SandyBridge processors.The following example will run a AVX version of the STREAM triad on all 8 cores of socket 1 with the data set being 1GB large. likwid-perfctr will setup the counters to measure the ENERGY group. To use the instrumented regions you have to specify the -m option.


$./likwid-perfctr -c S1:0-3 -g ENERGY -m ./likwid-bench  -g 1 -i 50 -t stream_avx  -w S1:1GB:4
=====================
Region: bench 
=====================
+-------------------+---------+---------+---------+---------+
|    Region Info    | core 8  | core 9  | core 10 | core 11 |
+-------------------+---------+---------+---------+---------+
| RDTSC Runtime [s] | 2.03051 | 2.03034 | 2.03045 | 2.03027 |
|    call count     |    1    |    1    |    1    |    1    |
+-------------------+---------+---------+---------+---------+
+-----------------------+-------------+-------------+-------------+-------------+
|         Event         |   core 8    |   core 9    |   core 10   |   core 11   |
+-----------------------+-------------+-------------+-------------+-------------+
|   INSTR_RETIRED_ANY   | 6.38913e+08 | 6.34148e+08 | 6.32104e+08 | 6.29149e+08 |
| CPU_CLK_UNHALTED_CORE | 6.45153e+09 | 6.45008e+09 | 6.45041e+09 | 6.4504e+09  |
| CPU_CLK_UNHALTED_REF  | 5.44346e+09 | 5.44225e+09 | 5.44251e+09 | 5.44253e+09 |
|    PWR_PKG_ENERGY     |   146.273   |      0      |      0      |      0      |
+-----------------------+-------------+-------------+-------------+-------------+
+-------------------+---------+---------+---------+---------+
|      Metric       | core 8  | core 9  | core 10 | core 11 |
+-------------------+---------+---------+---------+---------+
|    Runtime [s]    | 2.39535 | 2.39481 | 2.39494 | 2.39493 |
| Runtime rdtsc [s] | 2.03051 | 2.03051 | 2.03051 | 2.03051 |
|    Clock [MHz]    | 3192.14 | 3192.13 | 3192.14 | 3192.12 |
|        CPI        | 10.0977 | 10.1713 | 10.2047 | 10.2526 |
|    Energy [J]     |   146   |    0    |    0    |    0    |
|     Power [W]     | 71.9031 |    0    |    0    |    0    |
+-------------------+---------+---------+---------+---------+
+------------------------+---------+---------+---------+---------+
|         Metric         |   Sum   |   Max   |   Min   |   Avg   |
+------------------------+---------+---------+---------+---------+
|    Runtime [s] STAT    | 9.58003 | 2.39535 | 2.39481 | 2.39501 |
| Runtime rdtsc [s] STAT | 8.12204 | 2.03051 | 2.03051 | 2.03051 |
|    Clock [MHz] STAT    | 12768.5 | 3192.14 | 3192.12 | 3192.13 |
|        CPI STAT        | 40.7262 | 10.2526 | 10.0977 | 10.1815 |
|    Energy [J] STAT     |   146   |   146   |    0    |  36.5   |
|     Power [W] STAT     | 71.9031 | 71.9031 |    0    | 17.9758 |
+------------------------+---------+---------+---------+---------+

As you can see the ENERGY group measures among other things the cycles and instructions executed and will output the derived metrics CPI, Clock, Energy in Joule and Power in Watt. For us the interesting parts are the Clock and the Energy. The Energy counter can only measured per package, therefore you only see one result in the table. For the statistics table not all columns make sense, still I guess you can figure out yourself where this can be applied. All cores manage to overclock up to the 3.2 GHz as predicted by likwid-powermeter.

So I did this for various settings to get something like that:

Performance scaling of STREAM triad



You can see the higher clocked variants saturate around 3 cores while the low clocked runs need 4 cores to saturate. Also notable is that the bandwidth decreases with lower core frequency. If you repeat this using a variant of STREAM triad employing so called non temporal stores (triad_mem in likwid-bench) this changes to:
Performance scaling  of STREAM triad with NT stores

Now all requencies need  4 cores to saturate the bandwidth. Well what about energy to solution?
Energy to solution for STREAM triad

Energy to solution for STREAM triad with NT stores

As you can see without NT stores the lowest Energy is reached using 1.6GHz and 4 cores. With NT stores the optimum is either also 1.6GHz or 1.2 GHz with 8 cores. And it is notable that with 8 cores there is a factor of two saving in Energy to solution between running with Turbo mode against 1.2GHz fixed frequency for this  main memory limited benchmark. Of course also the performance is slightly lower with 1.2 GHz, still of course not a factor of two but around 20%.  You can figure out interesting things using likwid-bench with instrumentation turned on and likwid-perfctr.