While AMD has counters which measure floating point operations (not instructions) on Intel processors you can measure how many floating point instructions were executed. You have to distinguish if a packed instruction was used (with multiple operations per instruction) or a scalar one. And depending on the type (double or float) you can compute the number of operations.
On Core 2 the relevant events are (taken out of the likwid event file):
EVENT_SIMD_COMP_INST_RETIRED 0xCA PMC
UMASK_SIMD_COMP_INST_RETIRED_PACKED_SINGLE 0x01
UMASK_SIMD_COMP_INST_RETIRED_SCALAR_SINGLE 0x02
UMASK_SIMD_COMP_INST_RETIRED_PACKED_DOUBLE 0x04
UMASK_SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE 0x08
This is a retired event and hence produces very accurate results.
This changed on Nehalem and Westmere. On Westmere for example the relevant events are:
EVENT_FP_COMP_OPS_EXE 0x10 PMC
UMASK_FP_COMP_OPS_EXE_SSE2_INTEGER 0x08
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED 0x10
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR 0x20
UMASK_FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION 0x40
UMASK_FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION 0x80
Two problems with this, you can not measure mixed precision performance with this because the events are not connected and therefore it is impossible to compute the ratios between packed/scalar and double/float. Second the event is now an executed event. This results in a slight overcount for speculative execution (can be up to 5% overcount).
So how does this look like on SandyBridge?
EVENT_FP_COMP_OPS_EXE 0x10 PMC
UMASK_FP_COMP_OPS_EXE_X87 0x01
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE 0x10
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE 0x20
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE 0x40
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE 0x80
That looks good, we have again mixed events (e.g. packed double) as on Core 2. It is still an executed event. But as long as the overcount is within acceptable limits this is OK.
Validation against likwid-bench using the marker API shows the following:
The graph shows performance in MFlops/s of the stream benchmark as it is part of likwid-bench. This is fully vectorized using SSE instructions. We consider sequential execution. The red points are results computed by likwid-bench from accurate runtime measuremants and computed flop counts. This is the correct flop value. The blue line is the value computed from hardware performance counters. I execute this benchmark multiple times with different data set size (L1, L2, L3, Memory). This are the steps you can see.
The good news is it is accurate in L1 cache. The bad news it overcounts as soon as the data comes from higher cache levels. I also tried a different variant using RISC style code separating the loads from the arithmetic instructions, same result. I can only guess but it appears the event is speculatively triggered when waiting for data to arrive.
The result for the triad benchmark is slightly different:
Here it overcounts less. The difference in triad is that it has one load stream more (A=B+C*D). I have to investigate this further. Still at the moment that the Flop group on SandyBridge may be (very likely) produce wrong results. I have not yet found an errata with regard to this.
No comments:
Post a Comment