Tuesday, February 7, 2012

Raw arithmetic SIMD Instruction throughput on Interlagos and SandyBridge

The arithmetic performance of modern processors is apart from task parallelism generated mainly by data parallel instructions. There exists a zoo of x86 instruction set extensions: SSEx AVX FMA3 FMA4 . This article wants to give a short overview of raw SIMD instruction throughput on the recent processors AMD Interlagos and Intel SandyBridge. Of course a short article cannot cover the topic in full depth, so for more detailed tips and tricks please refer to the optimization manuals released by the vendors.

Testmachines are a AMD Opteron (trademark) 6276 and a Intel Xeon (trademark) E31280. Methodology is simple: Create small loop kernels in likwid-bench using different instruction and operation mixes and measure  CPI. To keep things simple we restrict ourself to the double precision case. Still this makes no difference with regard to instruction throughput. CPI stands for Cycles Per Instruction and gives you how many instructions per cycle a processor core is able to execute in average. If, e.g., a processor can execute four instructions per cycle the optimal CPI on this processor is 0.25. The ability to schedule multiple instructions of an otherwise sequential instruction stream in one cycle is also referred to as superscalar execution.
The following illustration shows the FPU capabilities of both processors:








Interlagos has two 128bit wide units which are fused multiply accumulate units. This means Interlagos is superscalar  for any mix of operations while SandyBridge  is only superscalar for an equal multiply/add  instruction mix. In contrast to Interlagos SandyBridge has 256bit wide execution units. For AVX code and a multiply/add operation mix SandyBridge is superscalar while Interlagos is not.  Let's look at the results for different instruction mixes:



Blue is equal, green is better and red is worse. Interlagos has advantages with pure add or mult SSE code.
For AVX multadd SandyBridge is better while AVX pure add and mult both are equal. SandyBridge is slightly more efficient though. On paper FMA should perform similar than the SandyBridge AVX variant. Still the code is very dense for this case and it is more difficult to get efficient results. For one core the result was not as expected. Using both cores sharing a FPU showed a better performance. Not as efficient as possible but still the best possible on Interlagos with regard to MFlops/s .

For completeness the results in terms of arithmetic FP instruction throughput:



Please note that this is only reflecting a single aspect of the processor microarchitecture and is not representative for the overall processor performance. Still I hope this can clear some uncertainties with regard to SIMD arithmetic instruction throughput.

No comments:

Post a Comment