Tuesday, February 7, 2012

Raw arithmetic SIMD Instruction throughput on Interlagos and SandyBridge

The arithmetic performance of modern processors is apart from task parallelism generated mainly by data parallel instructions. There exists a zoo of x86 instruction set extensions: SSEx AVX FMA3 FMA4 . This article wants to give a short overview of raw SIMD instruction throughput on the recent processors AMD Interlagos and Intel SandyBridge. Of course a short article cannot cover the topic in full depth, so for more detailed tips and tricks please refer to the optimization manuals released by the vendors.

Testmachines are a AMD Opteron (trademark) 6276 and a Intel Xeon (trademark) E31280. Methodology is simple: Create small loop kernels in likwid-bench using different instruction and operation mixes and measure  CPI. To keep things simple we restrict ourself to the double precision case. Still this makes no difference with regard to instruction throughput. CPI stands for Cycles Per Instruction and gives you how many instructions per cycle a processor core is able to execute in average. If, e.g., a processor can execute four instructions per cycle the optimal CPI on this processor is 0.25. The ability to schedule multiple instructions of an otherwise sequential instruction stream in one cycle is also referred to as superscalar execution.
The following illustration shows the FPU capabilities of both processors:








Interlagos has two 128bit wide units which are fused multiply accumulate units. This means Interlagos is superscalar  for any mix of operations while SandyBridge  is only superscalar for an equal multiply/add  instruction mix. In contrast to Interlagos SandyBridge has 256bit wide execution units. For AVX code and a multiply/add operation mix SandyBridge is superscalar while Interlagos is not.  Let's look at the results for different instruction mixes:



Blue is equal, green is better and red is worse. Interlagos has advantages with pure add or mult SSE code.
For AVX multadd SandyBridge is better while AVX pure add and mult both are equal. SandyBridge is slightly more efficient though. On paper FMA should perform similar than the SandyBridge AVX variant. Still the code is very dense for this case and it is more difficult to get efficient results. For one core the result was not as expected. Using both cores sharing a FPU showed a better performance. Not as efficient as possible but still the best possible on Interlagos with regard to MFlops/s .

For completeness the results in terms of arithmetic FP instruction throughput:



Please note that this is only reflecting a single aspect of the processor microarchitecture and is not representative for the overall processor performance. Still I hope this can clear some uncertainties with regard to SIMD arithmetic instruction throughput.

Friday, February 3, 2012

Intel SandyBridge and counting the flops

In HPC it is all about the flops. While this can be seen critical it is still handy to measure the MFlop rate with hardware performance counters. This is a derived metric which computes the number of floating point operation for a measured time period.

While AMD has counters which measure floating point operations (not instructions) on Intel processors you can measure how many floating point instructions were executed. You have to distinguish if a packed instruction was used (with multiple operations per instruction) or a scalar one. And depending on the type (double or float) you can compute the number of operations.

On Core 2 the relevant events are (taken out of the likwid event file):


EVENT_SIMD_COMP_INST_RETIRED     0xCA   PMC
UMASK_SIMD_COMP_INST_RETIRED_PACKED_SINGLE     0x01
UMASK_SIMD_COMP_INST_RETIRED_SCALAR_SINGLE     0x02
UMASK_SIMD_COMP_INST_RETIRED_PACKED_DOUBLE     0x04
UMASK_SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE     0x08

This is a retired event and hence produces very accurate results.
This changed on Nehalem and Westmere. On Westmere for example the relevant events are:

EVENT_FP_COMP_OPS_EXE            0x10   PMC
UMASK_FP_COMP_OPS_EXE_SSE2_INTEGER             0x08
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED            0x10
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR            0x20
UMASK_FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION     0x40
UMASK_FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION     0x80

Two problems with this, you can not measure mixed precision performance with this because the events are not connected and therefore it is impossible to compute the ratios between packed/scalar and double/float. Second the event is now an executed event. This results in a slight overcount for speculative execution (can be up to 5% overcount).

So how does this look like on SandyBridge?

EVENT_FP_COMP_OPS_EXE            0x10   PMC
UMASK_FP_COMP_OPS_EXE_X87       0x01
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE     0x10
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE     0x20
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE     0x40
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE     0x80

That looks good, we have again mixed events (e.g. packed double) as on Core 2. It is still  an executed event. But as long as the overcount is within acceptable limits this is OK.

Validation against likwid-bench using the marker API shows the following:


The graph shows performance in MFlops/s of the stream benchmark as it is part of likwid-bench. This is fully vectorized using SSE instructions. We consider sequential execution. The red points are results computed by likwid-bench from accurate runtime measuremants and computed flop counts. This is the correct flop value. The blue line is the value computed from hardware performance counters. I execute this benchmark multiple times with different data set size (L1, L2, L3, Memory). This are the steps you can see.

The good news is it is accurate in L1 cache. The bad news it overcounts  as soon as the data comes from higher cache levels. I also tried a different variant using RISC style code separating the loads from the arithmetic instructions, same result.  I can only guess but it appears the event is speculatively triggered when waiting for data to arrive.

The result for the triad benchmark is slightly different:

Here it overcounts less. The difference in triad is that it has one load stream more (A=B+C*D). I have to investigate this further. Still at the moment that the Flop group on SandyBridge may be (very likely) produce wrong results. I have not yet found an errata with regard to this.