Tuesday, April 24, 2012

load/store throughput on SandyBridge

For many codes the biggest improvements  with SandyBridge do not come from its AVX capabilities but from the doubled load throughput, Twi times 16 bytes instead of 16 bytes per cycle. The width of the load store units is still 16 bytes. So it can load/store 16 bytes per instruction. Where Intel processors before SandyBridge (SNB) could execute either one load and one store or  one load or one store per cycle SNB can in principal throughput two loads and one store per cycle. The 32 byte wide AVX load and store instructions are executed split. This means the loads/stores take 2 cycles each to execute for 256bit AVX code.

This means that on paper SSE and AVX should have similar load/store capabilities. Below results for a triad benchmark code are shown (serial execution). The according C code is:


for (int i=0; i < size; i++) {

    A[i] = B[i] + C[i] * C[i];
 }

Interesting for us is the L1 cache performance. The Intel compiler was used.  It can be seen that for  -O3 (SSE code) SNB (i7-2600) is not twice  as fast as the Westmere code as expected, despite the fact it also is clocked higher. The Westmere clocks 2.93GHz  with Turbo Mode. This code is still not optimal because split 8 byte loads are used for some of the vectors. Adding the vector aligned pragma leads to optimal code.  Now Westmere reaches the theoretical limit of 3 cycles per SIMD update resulting in: 2.93 GHz/ 3 cycles * 4 flops = 3.9 GFlops . While SNB also improves it is still far away from its optimal performance.



You can see that AVX brings an additional boost on SNB. This is not so clear because a purely load/store throughput limited code as the triad should not profit from the present implementation of AVX on SNB, as the load store capabilities are the same. It turns out that a specific detail of the scheduler prevent that the SSE 128bit code actually can reach the full throughput of SNB. Each data move instruction consists of a address generation part and a data part. The load schedule ports (2 and 3) are also used for the address generation of the stores. The actual store port is port 4. This means that for  a store two ports are blocked (4 and one of 2/3). On port 2/3 they compete with  possible loads. This causes that for 128bit code the full throughput cannot be reached. With AVX code this is possible as it appears that one 256bit load uses two cycles but only one uop. This means that in the second cycle the  port is free and the address generation uop of the store can be scheduled. This is also described in the very good microarchitecture manual from Agner Fog (update 29-02-2012, page 100).