Friday, February 3, 2012

Intel SandyBridge and counting the flops

In HPC it is all about the flops. While this can be seen critical it is still handy to measure the MFlop rate with hardware performance counters. This is a derived metric which computes the number of floating point operation for a measured time period.

While AMD has counters which measure floating point operations (not instructions) on Intel processors you can measure how many floating point instructions were executed. You have to distinguish if a packed instruction was used (with multiple operations per instruction) or a scalar one. And depending on the type (double or float) you can compute the number of operations.

On Core 2 the relevant events are (taken out of the likwid event file):


EVENT_SIMD_COMP_INST_RETIRED     0xCA   PMC
UMASK_SIMD_COMP_INST_RETIRED_PACKED_SINGLE     0x01
UMASK_SIMD_COMP_INST_RETIRED_SCALAR_SINGLE     0x02
UMASK_SIMD_COMP_INST_RETIRED_PACKED_DOUBLE     0x04
UMASK_SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE     0x08

This is a retired event and hence produces very accurate results.
This changed on Nehalem and Westmere. On Westmere for example the relevant events are:

EVENT_FP_COMP_OPS_EXE            0x10   PMC
UMASK_FP_COMP_OPS_EXE_SSE2_INTEGER             0x08
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED            0x10
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR            0x20
UMASK_FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION     0x40
UMASK_FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION     0x80

Two problems with this, you can not measure mixed precision performance with this because the events are not connected and therefore it is impossible to compute the ratios between packed/scalar and double/float. Second the event is now an executed event. This results in a slight overcount for speculative execution (can be up to 5% overcount).

So how does this look like on SandyBridge?

EVENT_FP_COMP_OPS_EXE            0x10   PMC
UMASK_FP_COMP_OPS_EXE_X87       0x01
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE     0x10
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE     0x20
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE     0x40
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE     0x80

That looks good, we have again mixed events (e.g. packed double) as on Core 2. It is still  an executed event. But as long as the overcount is within acceptable limits this is OK.

Validation against likwid-bench using the marker API shows the following:


The graph shows performance in MFlops/s of the stream benchmark as it is part of likwid-bench. This is fully vectorized using SSE instructions. We consider sequential execution. The red points are results computed by likwid-bench from accurate runtime measuremants and computed flop counts. This is the correct flop value. The blue line is the value computed from hardware performance counters. I execute this benchmark multiple times with different data set size (L1, L2, L3, Memory). This are the steps you can see.

The good news is it is accurate in L1 cache. The bad news it overcounts  as soon as the data comes from higher cache levels. I also tried a different variant using RISC style code separating the loads from the arithmetic instructions, same result.  I can only guess but it appears the event is speculatively triggered when waiting for data to arrive.

The result for the triad benchmark is slightly different:

Here it overcounts less. The difference in triad is that it has one load stream more (A=B+C*D). I have to investigate this further. Still at the moment that the Flop group on SandyBridge may be (very likely) produce wrong results. I have not yet found an errata with regard to this.

No comments:

Post a Comment