Monday, March 17, 2014

Probing instruction throughput of recent Intel processors with likwid-bench - Part 1: Theory

Microarchitectures gradually improve over generations and one thing which is addressed are the instruction throughput capabilities in terms of superscalar execution. This article will focus on the development of the load/store unit improvements from Intel Westmere, Intel IvyBridge to Intel Haswell microarchitectures. The following information is to the best of my knowledge. If you find any errors please let me know to correct it. The present article will perform an architectural analysis for the STREAM triad kernel in order to predict the expected performance with the data located in L1 cache on the considered architectures. In a subsequent article we will try to measure this prediction using the likwid-bench microbenchmarking tool.

likwid-bench is part of the LIKWID tool suite. It is an application which comes out of the box with many streaming based testcases but also allows to easily add more testcases in terms of simple text files. The application cares for data set size and placement and threading configuration. We choose the so called STREAM triad as implemented in McCalpins STREAM microbenchmark. This is a simple C code for it (data type is float 32bit floating point):

for (int i=0; i
{
    vecA[i] = vecB[i] + scalar *  vecC[i];
}

One iteration requires two loads and one store and a multiply/add operations. For Intel Westmere the fastest implementation uses SSE packed SIMD instructions. This is the resulting assembly code:

1:
movaps xmm0, [rdx + rax*4]
movaps xmm1, [rdx + rax*4+16]
movaps xmm2, [rdx + rax*4+32]
movaps xmm3, [rdx + rax*4+48]
mulps xmm0, xmm4
mulps xmm1, xmm4
mulps xmm2, xmm4
mulps xmm3, xmm4
addps xmm0, [rcx + rax*4]
addps xmm1, [rcx + rax*4+16]
addps xmm2, [rcx + rax*4+32]
addps xmm3, [rcx + rax*4+48]
movaps [rsi + rax*4] , xmm0
movaps [rsi + rax*4+16], xmm1
movaps [rsi + rax*4+32], xmm2
movaps [rsi + rax*4+48], xmm3
add rax, 16
cmp rax, rdi
jl 1b

Lets look for the theory first. Here is a schematic figure illustrating the instruction throughput capabilities of the Intel Westmere architecture. The architecture can issue and retire 4 uops per cycle. It is capable of executing one load and one store or one load or one store per cycle. Both load or store can be up to 16 byte wide. On the arithmetic side it can execute one multiply and one add or one add or one multiply.

Schematic illustration of  instruction throughput capabilities for Intel Westmere

Throughout the article we will consider cycles it takes to execute loop iterations equivalent to one cache line (64bytes). Because packed SSE SIMD is 16 bytes wide this results in 4 SIMD iterations to update one cache line. The throughput bottleneck on this architecture is the load port and the minimum time for executing one SIMD iteration is 2 cycles. Therefore we end up with 4 SIMD iterations x 2 cycles = 8 cycles to update one cache line. We can easily compute the lightspeed performance (lets consider a clock of 3GHz) as 3 GHz / 2 cycles * 4 iterations * 2 flops/iteration =  12 GFlops/s.

After Westmere, which was a so called tick (incremental) update, Intel released a larger update with the SandyBridge processor. Intel SandyBridge introduced the AVX SIMD instruction set extension with 32 bytes SIMD width instead of the previous width of 16 bytes with SSE. The resulting AVX kernel looks like the following:

1:
vmovaps ymm1, [rdx + rax*4]
vmovaps ymm2, [rdx + rax*4+32]
vmovaps ymm3, [rdx + rax*4+64]
vmovaps ymm4, [rdx + rax*4+96]
vmulps ymm1, ymm1, ymm5
vmulps ymm2, ymm2, ymm5
vmulps ymm3, ymm3, ymm5
vmulps ymm4, ymm4, ymm5
vaddps ymm1, ymm1, [rcx + rax*4]
vaddps ymm2, ymm2, [rcx + rax*4+32]
vaddps ymm3, ymm3, [rcx + rax*4+64]
vaddps ymm4, ymm4, [rcx + rax*4+96]
vmovaps [rsi + rax*4] , ymm1
vmovaps [rsi + rax*4+32], ymm2
vmovaps [rsi + rax*4+64], ymm3
vmovaps [rsi + rax*4+96], ymm4
add rax, 32
cmp rax, rdi
jl 1b

Because its successor IvyBridge performs similar we will only consider IvyBridge in this comparison. The following illustration again illustrates the important properties. The execution units are 32 bytes wide. This microarchitecture also adds a second load port. The load/store units are still 16 byte wide. This architecture can execute either one AVX packed (SIMD) load instruction and half a packed AVX store or one AVX packed load or half a packed AVX store. For SSE code the architecture suggests it can execute two packed SSE loads and one packed SSE store or one packed SSE load and one packed SSE store or one SSE store. Still as can be seen in the illustration the store (data) unit shares the address generation with the load units (indicated by AGU in the illustration). If you have two SSE loads and one store instruction mix the store competes with the loads for port 2 or 3. The maximum throughput therefore cannot be reached with SSE code. This does not apply for AVX, here the 32 byte load only occupies one port, the other port can be used for the store address generation.

Schematic illustration of  instruction throughput capabilities for Intel Ivy Bridge
But what does that mean for the execution of our STREAM triad test case? Again the load/store unit is the bottleneck. The two loads as well as the store take 2 cycles for one AVX SIMD iterations. Because you need two AVX SIMD iterations to update one cache lines we end up with 2 SIMD iterations x 2 cycles = 4 cycles. Lightspeed performance is then  3 GHz / 2 cycles * 8 iterations * 2 flops/iteration =  24 GFlops/s.

The next microarchitecture Haswell was a larger update (a tock in Intel nomenclature). Haswell widens all load/store paths to 32 bytes. Moreover the processor adds two additional ports, one of them with an address generation unit for the stores. Haswell can now issue up to 8 uops per cycles but still is limited to 4 uops retired per cycle. One could ask why you want to do that, stuffing stuff in on top when not more can exit at the bottom. One explanation is that the 4 uops per cycle throughput in e.g. the Westmere design could not be reached in practice. Already a CPI of 0.35-0.5 is the best you can expect. By issuing more instructions you increase the average instruction throughput by getting closer to the theoretical limit of 4 uops per cycles. The following illustration shows the basic setup of Haswell.

Schematic illustration of  instruction throughput capabilities for Intel Haswell

Haswell has support for AVX2 which adds things like gather and promotes most instructions to 32 byte wide execution. Also Haswell has fused multiply add instructions (FMA). There is a drawback here: A naive view suggests Haswell should be able to execute either one add and one multiply or two adds or two multiplies because it has two FMA units. This holds for add instructions but not for multiplies. For multiplications the throughput is still one instruction per cycle.

Again lets look what consequences these changes have for the throughput of the STREAM triad kernel. From the execution units it should be possible to issue all instructions in one cycle. But due to the fact that only 4 instructions can retire this throughput cannot be reached. This situation should be improved by using the FMA instructions. There only one instruction is necessary and we can get away with 4 instructions for one SIMD iteration. This is the changed code using the FMA instruction:

1:
vmovaps ymm1, [rdx + rax*4]
vmovaps ymm2, [rdx + rax*4+32]
vmovaps ymm3, [rdx + rax*4+64]
vmovaps ymm4, [rdx + rax*4+96]
vfmadd213ps ymm1, ymm5, [rcx + rax*4]
vfmadd213ps ymm2, ymm5, [rcx + rax*4+32]
vfmadd213ps ymm3, ymm5, [rcx + rax*4+64]
vfmadd213ps ymm4, ymm5, [rcx + rax*4+96]
vmovaps [rsi+ rax*4] , ymm1
vmovaps [rsi+ rax*4+32], ymm2
vmovaps [rsi+ rax*4+64], ymm3
vmovaps [rsi+ rax*4+96], ymm4
add rax, 32
cmp rax, rdi
jl 1b

For this code one SIMD iteration should be able to execute in just 1 cycle ending up with 2 cycles to update a cache line. This is due to the doubled data path of 32 bytes of the load/store units. Lightspeed performance is then  3 GHz / 1 cycle * 8 iterations * 2 flops/iteration =  48 GFlops/s. So theoretically the L1 performance for the STREAM triad is doubled from Westmere to IvyBridge and doubled again from IvyBridge to Haswell.

In a next article we will try to confirm the prediction with likwid-bench.

Friday, July 20, 2012

Notes on the Sandy Bridge Uncore (part 1)

For those of you who never heard of something called Uncore on a processor: On recent chips Hardware performance monitoring (HPM) is different for the cores and the part of the chip shared by multiple cores (called the Uncore). Before the Intel Nehalem processor there was no Uncore. Hardware performance monitoring was limited to the cores. Still also at that time questions came up how the shared portions of the processor are measured. Nehalem introduced the Uncore, which are parts of the microarchitecture shared by multiple cores. This can be the shared last level cache, the main memory controller or the QPI bus connecting multiple sockets. The Uncore had its own hardware performance monitoring unit with eight counters per Uncore (means socket). For many tools which used sampling to relate hardware performance counter measurments to code lines this caused great headaches as the Uncore measurements cannot be  related to code executed on a specific core anymore.

The LIKWID tool suite has no problem with this as it restricts itself to simple end to end measurements of hardware performance counter data. The relation between the measurement and your code is realized through pinning the execution of code to dedicated cores (which the tool can also do for you). As you might think the Nehalem Uncore is a bad idea Intel introduced the EX type processors. This new design mainly introduced a completely new Uncore, which is now a system on a chip (Uncore HPM manual: Intel document reference number 323535). In its first  implementation this was very complex to program with tons of MSR registers which needed to be programmed and a lot of dependencies and restrictions. The new mainstream server/desktop SandyBridge microarchitecture also uses this system on a chip type Uncore design. Still the implementation of the hardware performance monitoring was changed.

First I have to warn you: Intel is not very strict about consistency in naming. E.g. the naming of the MSR registers in the SDM manuals can be different from the naming used for the same MSR registers in documents written in other parts of the company (e.g. the Uncore manuals). This is unfortunatly also true for the naming of the entities in the Uncore. The Uncore does not have one HPM unit anymore but a bunch of them. On NehalemEX and WestmereEX the different parts of the Uncore were called boxes, there were mbox's  (main memory controller) and cbox's (last level cache segments) and a bunch of others. While in SNB there still exist the same type of boxes they are named differently now, e.g. the mbox's are now called iMC and the cbox's are called CBo's. Still in LIKWID I stick with the old naming, since I want to build on the stuff I implemented for the EX type processors.

The mapping is as follows:

  • Caching agent,  SNB: CBo   EX: CBOX
  • Home agent,  SNB: HA   EX: BBOX
  • Memory controller, SNB: iMC  EX: MBOX
  • Power Control, SNB: PCU  EX: WBOX
  • QPI,  SNB: QPI  EX: SBOX/RBOX
The complexity comes in by the  large number of places you can measure something. So you have one Uncore per socket, each sockets has one or multiple performance monitoring units of different types. E.g. there are four iMC units, one per memory channel. And each of those has four counters. So you have 2*4*4=32  different memory related counters you can measure stuff on a two socket system.


Before hardware performance monitoring was controlled via writing/reading to MSR registers (model specific registers).  This was still true on EX type processors. Starting with SNB the Uncore is now partly programmed by PCI bus address space.  Some parts are still programmed using the MSR registers, e.g. the CBo boxes. Still most of the unit are now programmed with PCI config space registers.


I am no specialist on PCI buses, still for the practical part the operating system maps the the pci configuration space. For PCI this is 256bytes per device using usually 32bit addressing. The device memory is organized in BUS / DEVICE / FUNCTION . The BUS is the socket in the HPM sense, or the other way round there is one new BUS per socket in the system. DEVICE is the HPM unit type (e.g. main memory box) and the FUNCTION is then the concrete HPM unit.

On a two socket SandyBridge-EP system there are the following devices (this is taken from LIKWID source):




typedef enum {
    PCI_R3QPI_DEVICE_LINK_0 = 0,
    PCI_R3QPI_DEVICE_LINK_1,
    PCI_R2PCIE_DEVICE,
    PCI_IMC_DEVICE_CH_0,
    PCI_IMC_DEVICE_CH_1,
    PCI_IMC_DEVICE_CH_2,
    PCI_IMC_DEVICE_CH_3,
    PCI_HA_DEVICE,
    PCI_QPI_DEVICE_PORT_0,
    PCI_QPI_DEVICE_PORT_1,
    PCI_QPI_MASK_DEVICE_PORT_0,
    PCI_QPI_MASK_DEVICE_PORT_1,
    PCI_QPI_MISC_DEVICE_PORT_0,
    PCI_QPI_MISC_DEVICE_PORT_1,
    MAX_NUM_DEVICES
} PciDeviceIndex;





static char* pci_DevicePath[MAX_NUM_DEVICES] = {
 "13.5", "13.6", "13.1", "10.0", "10.1", "10.4",
 "10.5", "0e.1", "08.2", "09.2", "08.6", "09.6",
 "08.0", "09.0" };

So e.g. the memory channel 1 (PCI_IMC_DEVICE_CH_1) on socket 0 is: BUS 0x7f  DEVICE 0x10 FUNCTION 0x1 .
The Linux OS maps this memory in different locations in /sys and /proc filesystems. In LIKWID I use the /proc filesystem. Above device is accessible via the path: /proc/bus/pci/7f/10.1 .  Unfortunatly if you make a hexdump as user on such a file you only get the header part (first 30-40 bytes). The rest is only visible to root. For LIKWID this means you have to use the tool as root if you want to use direct access or you have to setup the daemon mode proxy to access these files.  In my next post I will explain how the SNB Uncore is implemented in likwid-perfctr and what performance groups are available.


Wednesday, June 20, 2012

Validation of performance groups on AMD Interlagos

As I fortunatly have again access to a AMD Interlagos server system I was able to review the groups and events. I also validated some of the performance groups in the new accuracy testsuite. There I use two variants of likwid-bench: one ist plain and measures the performance based on flop and byte counts, the other one is instrumented using the LIWID marker API. I run a  variety of testcases with different benchmark type and dataset size and compare the results with each other. At this time we focus on serial measurements. I do multiple repititions per data set size. Data set sizes (from left to right) fit in in L1 (12kB), L2 (1MB), L3 (4MB) and MEM (1GB).  The red curve is the result output by likwid-bench (the performance as measured by the application). The blue curve is the performance measured with likwid-perfctr based on hardware performance monitoring.

The test system is a AMD Opteron(TM) Processor 6276 in a dual socket server system.
The new and updated groups are available in the upcoming LIKWID release.

The following performance groups are supported:



  • BRANCH: Branch prediction miss rate/ratio
  • CACHE: Data cache miss rate/ratio
  • CPI: Cycles per instruction
  • DATA: Load to store ratio
  • FLOPS_DP: Double Precision MFlops/s
  • FLOPS_SP: Single Precision MFlops/s
  • FPU_EXCEPTION: Floating point exceptions
  • ICACHE: Instruction cache miss rate/ratio
  • L2: L2 cache bandwidth in MBytes/s
  • L2CACHE: L2 cache miss rate/ratio
  • L3: L3 cache bandwidth in MBytes/s
  • L3CACHE: L3 cache miss rate/ratio
  • LINKS: Bandwidth on the Hypertransport links
  • MEM: Main memory bandwidth in MBytes/s
  • NUMA: Read/Write Events between the ccNUMA nodes



First  tests are covering the FLOPS_DP and FLOPS_SP groups:
Both groups use the RETIRED_FLOPS event with different umasks.


As can be seen this event provides a very accurate flop count independant from the arithmetic instructions used. The absolute performance does not matter as different data set sizes are used.

Next the cache bandwidth groups L2 and L3.  The L2 group is based on the  DATA_CACHE_REFILLS_ALL event, while the L3 group uses the L2_FILL_WB_FILL and L2_FILL_WB_WB events. 






As can be seen for L2 cache the results are very accurate. Because the cache is inclusive and  the L1 is write through the bandwidths measured are the same as the bandwidth seen by your application.




Also here the results are very accurate. Still because the L3 cache is write allocate and exclusive with regard to the L2 the measured bandwidths are in all cases two times the bandwidth seen by your application due to the cache architecture.




Last not least the MEM group. Again the results are very accurate. All benchmarks involving a store have a write allocate transfer resulting in a higher measured bandwidth opposed to the bandwidth seen by your application.

The performance groups work now fine on Interlagos. The updated groups as shown here will be part of the upcoming release of LIKWID.

Wednesday, June 13, 2012

Tutorial how to measure energy consumption on Sandy Bridge processors

This small study illustrates how to use likwid-perfctr to measure energy consumption using the RAPL counter on Intel Sandy Bridge Processors. You have to set up likwid to access the msr device files.

We will use the famous STREAM triad benchmark for this study. The STREAM triad performs the operation A[i]=B[i]+s*C[i]. For easy measurement we will use likwid-bench which already includes different variants of STREAM triad out of the box. We want to measure the variations of energy to solution for different frequencies within one ccNUMA domain. The STREAM triad benchmark is regarded to be solely limited by main memory bandwidth. We start out doing a scaling study on a SandyBridge system running with a nominal clock of 2.7 GHz. If you have set the power governor to performance you have to be careful as the clock can be different from this due to turbo mode. We first will get the turbo mode steps using likwid-powermeter. This is what the output looks like on our system:


$ likwid-powermeter -i

-------------------------------------------------------------
CPU name: Intel Core SandyBridge EP processor 
CPU clock: 2.69 GHz 
-------------------------------------------------------------
Base clock: 2700.00 MHz 
Minimal clock: 1200.00 MHz 
Turbo Boost Steps:
C1 3500.00 MHz 
C2 3500.00 MHz 
C3 3400.00 MHz 
C4 3200.00 MHz 
C5 3200.00 MHz 
C6 3200.00 MHz 
C7 3100.00 MHz 
C8 3100.00 MHz 
-------------------------------------------------------------
Thermal Spec Power: 130 Watts 
Minimum  Power: 51 Watts 
Maximum  Power: 130 Watts 
Maximum  Time Window: 0.398438 micro sec 
-------------------------------------------------------------


For our benchmark runs we want to verify the actual clock. To do so you can enable the instrumentation available in likwid-bench. If you compile likwid with gcc you have to uncomment the following line in the file include_GCC.mk:



DEFINES  += -DPERFMON


Then rebuild everything. To use the instrumentation you have to call likwid-bench together with likwid-perfctr. While you can measure energy with likwid-powermeter, it is more convenient to use likwid-perfctr as it is much more flexible in different modes. Also you can correlate your energy measurements with other hardware performance counter data. All things we are interested in are in the ENERGY group available on SandyBridge processors.The following example will run a AVX version of the STREAM triad on all 8 cores of socket 1 with the data set being 1GB large. likwid-perfctr will setup the counters to measure the ENERGY group. To use the instrumented regions you have to specify the -m option.


$./likwid-perfctr -c S1:0-3 -g ENERGY -m ./likwid-bench  -g 1 -i 50 -t stream_avx  -w S1:1GB:4
=====================
Region: bench 
=====================
+-------------------+---------+---------+---------+---------+
|    Region Info    | core 8  | core 9  | core 10 | core 11 |
+-------------------+---------+---------+---------+---------+
| RDTSC Runtime [s] | 2.03051 | 2.03034 | 2.03045 | 2.03027 |
|    call count     |    1    |    1    |    1    |    1    |
+-------------------+---------+---------+---------+---------+
+-----------------------+-------------+-------------+-------------+-------------+
|         Event         |   core 8    |   core 9    |   core 10   |   core 11   |
+-----------------------+-------------+-------------+-------------+-------------+
|   INSTR_RETIRED_ANY   | 6.38913e+08 | 6.34148e+08 | 6.32104e+08 | 6.29149e+08 |
| CPU_CLK_UNHALTED_CORE | 6.45153e+09 | 6.45008e+09 | 6.45041e+09 | 6.4504e+09  |
| CPU_CLK_UNHALTED_REF  | 5.44346e+09 | 5.44225e+09 | 5.44251e+09 | 5.44253e+09 |
|    PWR_PKG_ENERGY     |   146.273   |      0      |      0      |      0      |
+-----------------------+-------------+-------------+-------------+-------------+
+-------------------+---------+---------+---------+---------+
|      Metric       | core 8  | core 9  | core 10 | core 11 |
+-------------------+---------+---------+---------+---------+
|    Runtime [s]    | 2.39535 | 2.39481 | 2.39494 | 2.39493 |
| Runtime rdtsc [s] | 2.03051 | 2.03051 | 2.03051 | 2.03051 |
|    Clock [MHz]    | 3192.14 | 3192.13 | 3192.14 | 3192.12 |
|        CPI        | 10.0977 | 10.1713 | 10.2047 | 10.2526 |
|    Energy [J]     |   146   |    0    |    0    |    0    |
|     Power [W]     | 71.9031 |    0    |    0    |    0    |
+-------------------+---------+---------+---------+---------+
+------------------------+---------+---------+---------+---------+
|         Metric         |   Sum   |   Max   |   Min   |   Avg   |
+------------------------+---------+---------+---------+---------+
|    Runtime [s] STAT    | 9.58003 | 2.39535 | 2.39481 | 2.39501 |
| Runtime rdtsc [s] STAT | 8.12204 | 2.03051 | 2.03051 | 2.03051 |
|    Clock [MHz] STAT    | 12768.5 | 3192.14 | 3192.12 | 3192.13 |
|        CPI STAT        | 40.7262 | 10.2526 | 10.0977 | 10.1815 |
|    Energy [J] STAT     |   146   |   146   |    0    |  36.5   |
|     Power [W] STAT     | 71.9031 | 71.9031 |    0    | 17.9758 |
+------------------------+---------+---------+---------+---------+

As you can see the ENERGY group measures among other things the cycles and instructions executed and will output the derived metrics CPI, Clock, Energy in Joule and Power in Watt. For us the interesting parts are the Clock and the Energy. The Energy counter can only measured per package, therefore you only see one result in the table. For the statistics table not all columns make sense, still I guess you can figure out yourself where this can be applied. All cores manage to overclock up to the 3.2 GHz as predicted by likwid-powermeter.

So I did this for various settings to get something like that:

Performance scaling of STREAM triad



You can see the higher clocked variants saturate around 3 cores while the low clocked runs need 4 cores to saturate. Also notable is that the bandwidth decreases with lower core frequency. If you repeat this using a variant of STREAM triad employing so called non temporal stores (triad_mem in likwid-bench) this changes to:
Performance scaling  of STREAM triad with NT stores

Now all requencies need  4 cores to saturate the bandwidth. Well what about energy to solution?
Energy to solution for STREAM triad

Energy to solution for STREAM triad with NT stores

As you can see without NT stores the lowest Energy is reached using 1.6GHz and 4 cores. With NT stores the optimum is either also 1.6GHz or 1.2 GHz with 8 cores. And it is notable that with 8 cores there is a factor of two saving in Energy to solution between running with Turbo mode against 1.2GHz fixed frequency for this  main memory limited benchmark. Of course also the performance is slightly lower with 1.2 GHz, still of course not a factor of two but around 20%.  You can figure out interesting things using likwid-bench with instrumentation turned on and likwid-perfctr. 

Tuesday, April 24, 2012

load/store throughput on SandyBridge

For many codes the biggest improvements  with SandyBridge do not come from its AVX capabilities but from the doubled load throughput, Twi times 16 bytes instead of 16 bytes per cycle. The width of the load store units is still 16 bytes. So it can load/store 16 bytes per instruction. Where Intel processors before SandyBridge (SNB) could execute either one load and one store or  one load or one store per cycle SNB can in principal throughput two loads and one store per cycle. The 32 byte wide AVX load and store instructions are executed split. This means the loads/stores take 2 cycles each to execute for 256bit AVX code.

This means that on paper SSE and AVX should have similar load/store capabilities. Below results for a triad benchmark code are shown (serial execution). The according C code is:


for (int i=0; i < size; i++) {

    A[i] = B[i] + C[i] * C[i];
 }

Interesting for us is the L1 cache performance. The Intel compiler was used.  It can be seen that for  -O3 (SSE code) SNB (i7-2600) is not twice  as fast as the Westmere code as expected, despite the fact it also is clocked higher. The Westmere clocks 2.93GHz  with Turbo Mode. This code is still not optimal because split 8 byte loads are used for some of the vectors. Adding the vector aligned pragma leads to optimal code.  Now Westmere reaches the theoretical limit of 3 cycles per SIMD update resulting in: 2.93 GHz/ 3 cycles * 4 flops = 3.9 GFlops . While SNB also improves it is still far away from its optimal performance.



You can see that AVX brings an additional boost on SNB. This is not so clear because a purely load/store throughput limited code as the triad should not profit from the present implementation of AVX on SNB, as the load store capabilities are the same. It turns out that a specific detail of the scheduler prevent that the SSE 128bit code actually can reach the full throughput of SNB. Each data move instruction consists of a address generation part and a data part. The load schedule ports (2 and 3) are also used for the address generation of the stores. The actual store port is port 4. This means that for  a store two ports are blocked (4 and one of 2/3). On port 2/3 they compete with  possible loads. This causes that for 128bit code the full throughput cannot be reached. With AVX code this is possible as it appears that one 256bit load uses two cycles but only one uop. This means that in the second cycle the  port is free and the address generation uop of the store can be scheduled. This is also described in the very good microarchitecture manual from Agner Fog (update 29-02-2012, page 100).

Tuesday, February 7, 2012

Raw arithmetic SIMD Instruction throughput on Interlagos and SandyBridge

The arithmetic performance of modern processors is apart from task parallelism generated mainly by data parallel instructions. There exists a zoo of x86 instruction set extensions: SSEx AVX FMA3 FMA4 . This article wants to give a short overview of raw SIMD instruction throughput on the recent processors AMD Interlagos and Intel SandyBridge. Of course a short article cannot cover the topic in full depth, so for more detailed tips and tricks please refer to the optimization manuals released by the vendors.

Testmachines are a AMD Opteron (trademark) 6276 and a Intel Xeon (trademark) E31280. Methodology is simple: Create small loop kernels in likwid-bench using different instruction and operation mixes and measure  CPI. To keep things simple we restrict ourself to the double precision case. Still this makes no difference with regard to instruction throughput. CPI stands for Cycles Per Instruction and gives you how many instructions per cycle a processor core is able to execute in average. If, e.g., a processor can execute four instructions per cycle the optimal CPI on this processor is 0.25. The ability to schedule multiple instructions of an otherwise sequential instruction stream in one cycle is also referred to as superscalar execution.
The following illustration shows the FPU capabilities of both processors:








Interlagos has two 128bit wide units which are fused multiply accumulate units. This means Interlagos is superscalar  for any mix of operations while SandyBridge  is only superscalar for an equal multiply/add  instruction mix. In contrast to Interlagos SandyBridge has 256bit wide execution units. For AVX code and a multiply/add operation mix SandyBridge is superscalar while Interlagos is not.  Let's look at the results for different instruction mixes:



Blue is equal, green is better and red is worse. Interlagos has advantages with pure add or mult SSE code.
For AVX multadd SandyBridge is better while AVX pure add and mult both are equal. SandyBridge is slightly more efficient though. On paper FMA should perform similar than the SandyBridge AVX variant. Still the code is very dense for this case and it is more difficult to get efficient results. For one core the result was not as expected. Using both cores sharing a FPU showed a better performance. Not as efficient as possible but still the best possible on Interlagos with regard to MFlops/s .

For completeness the results in terms of arithmetic FP instruction throughput:



Please note that this is only reflecting a single aspect of the processor microarchitecture and is not representative for the overall processor performance. Still I hope this can clear some uncertainties with regard to SIMD arithmetic instruction throughput.

Friday, February 3, 2012

Intel SandyBridge and counting the flops

In HPC it is all about the flops. While this can be seen critical it is still handy to measure the MFlop rate with hardware performance counters. This is a derived metric which computes the number of floating point operation for a measured time period.

While AMD has counters which measure floating point operations (not instructions) on Intel processors you can measure how many floating point instructions were executed. You have to distinguish if a packed instruction was used (with multiple operations per instruction) or a scalar one. And depending on the type (double or float) you can compute the number of operations.

On Core 2 the relevant events are (taken out of the likwid event file):


EVENT_SIMD_COMP_INST_RETIRED     0xCA   PMC
UMASK_SIMD_COMP_INST_RETIRED_PACKED_SINGLE     0x01
UMASK_SIMD_COMP_INST_RETIRED_SCALAR_SINGLE     0x02
UMASK_SIMD_COMP_INST_RETIRED_PACKED_DOUBLE     0x04
UMASK_SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE     0x08

This is a retired event and hence produces very accurate results.
This changed on Nehalem and Westmere. On Westmere for example the relevant events are:

EVENT_FP_COMP_OPS_EXE            0x10   PMC
UMASK_FP_COMP_OPS_EXE_SSE2_INTEGER             0x08
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED            0x10
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR            0x20
UMASK_FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION     0x40
UMASK_FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION     0x80

Two problems with this, you can not measure mixed precision performance with this because the events are not connected and therefore it is impossible to compute the ratios between packed/scalar and double/float. Second the event is now an executed event. This results in a slight overcount for speculative execution (can be up to 5% overcount).

So how does this look like on SandyBridge?

EVENT_FP_COMP_OPS_EXE            0x10   PMC
UMASK_FP_COMP_OPS_EXE_X87       0x01
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE     0x10
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE     0x20
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE     0x40
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE     0x80

That looks good, we have again mixed events (e.g. packed double) as on Core 2. It is still  an executed event. But as long as the overcount is within acceptable limits this is OK.

Validation against likwid-bench using the marker API shows the following:


The graph shows performance in MFlops/s of the stream benchmark as it is part of likwid-bench. This is fully vectorized using SSE instructions. We consider sequential execution. The red points are results computed by likwid-bench from accurate runtime measuremants and computed flop counts. This is the correct flop value. The blue line is the value computed from hardware performance counters. I execute this benchmark multiple times with different data set size (L1, L2, L3, Memory). This are the steps you can see.

The good news is it is accurate in L1 cache. The bad news it overcounts  as soon as the data comes from higher cache levels. I also tried a different variant using RISC style code separating the loads from the arithmetic instructions, same result.  I can only guess but it appears the event is speculatively triggered when waiting for data to arrive.

The result for the triad benchmark is slightly different:

Here it overcounts less. The difference in triad is that it has one load stream more (A=B+C*D). I have to investigate this further. Still at the moment that the Flop group on SandyBridge may be (very likely) produce wrong results. I have not yet found an errata with regard to this.