Friday, July 20, 2012

Notes on the Sandy Bridge Uncore (part 1)

For those of you who never heard of something called Uncore on a processor: On recent chips Hardware performance monitoring (HPM) is different for the cores and the part of the chip shared by multiple cores (called the Uncore). Before the Intel Nehalem processor there was no Uncore. Hardware performance monitoring was limited to the cores. Still also at that time questions came up how the shared portions of the processor are measured. Nehalem introduced the Uncore, which are parts of the microarchitecture shared by multiple cores. This can be the shared last level cache, the main memory controller or the QPI bus connecting multiple sockets. The Uncore had its own hardware performance monitoring unit with eight counters per Uncore (means socket). For many tools which used sampling to relate hardware performance counter measurments to code lines this caused great headaches as the Uncore measurements cannot be  related to code executed on a specific core anymore.

The LIKWID tool suite has no problem with this as it restricts itself to simple end to end measurements of hardware performance counter data. The relation between the measurement and your code is realized through pinning the execution of code to dedicated cores (which the tool can also do for you). As you might think the Nehalem Uncore is a bad idea Intel introduced the EX type processors. This new design mainly introduced a completely new Uncore, which is now a system on a chip (Uncore HPM manual: Intel document reference number 323535). In its first  implementation this was very complex to program with tons of MSR registers which needed to be programmed and a lot of dependencies and restrictions. The new mainstream server/desktop SandyBridge microarchitecture also uses this system on a chip type Uncore design. Still the implementation of the hardware performance monitoring was changed.

First I have to warn you: Intel is not very strict about consistency in naming. E.g. the naming of the MSR registers in the SDM manuals can be different from the naming used for the same MSR registers in documents written in other parts of the company (e.g. the Uncore manuals). This is unfortunatly also true for the naming of the entities in the Uncore. The Uncore does not have one HPM unit anymore but a bunch of them. On NehalemEX and WestmereEX the different parts of the Uncore were called boxes, there were mbox's  (main memory controller) and cbox's (last level cache segments) and a bunch of others. While in SNB there still exist the same type of boxes they are named differently now, e.g. the mbox's are now called iMC and the cbox's are called CBo's. Still in LIKWID I stick with the old naming, since I want to build on the stuff I implemented for the EX type processors.

The mapping is as follows:

  • Caching agent,  SNB: CBo   EX: CBOX
  • Home agent,  SNB: HA   EX: BBOX
  • Memory controller, SNB: iMC  EX: MBOX
  • Power Control, SNB: PCU  EX: WBOX
  • QPI,  SNB: QPI  EX: SBOX/RBOX
The complexity comes in by the  large number of places you can measure something. So you have one Uncore per socket, each sockets has one or multiple performance monitoring units of different types. E.g. there are four iMC units, one per memory channel. And each of those has four counters. So you have 2*4*4=32  different memory related counters you can measure stuff on a two socket system.


Before hardware performance monitoring was controlled via writing/reading to MSR registers (model specific registers).  This was still true on EX type processors. Starting with SNB the Uncore is now partly programmed by PCI bus address space.  Some parts are still programmed using the MSR registers, e.g. the CBo boxes. Still most of the unit are now programmed with PCI config space registers.


I am no specialist on PCI buses, still for the practical part the operating system maps the the pci configuration space. For PCI this is 256bytes per device using usually 32bit addressing. The device memory is organized in BUS / DEVICE / FUNCTION . The BUS is the socket in the HPM sense, or the other way round there is one new BUS per socket in the system. DEVICE is the HPM unit type (e.g. main memory box) and the FUNCTION is then the concrete HPM unit.

On a two socket SandyBridge-EP system there are the following devices (this is taken from LIKWID source):




typedef enum {
    PCI_R3QPI_DEVICE_LINK_0 = 0,
    PCI_R3QPI_DEVICE_LINK_1,
    PCI_R2PCIE_DEVICE,
    PCI_IMC_DEVICE_CH_0,
    PCI_IMC_DEVICE_CH_1,
    PCI_IMC_DEVICE_CH_2,
    PCI_IMC_DEVICE_CH_3,
    PCI_HA_DEVICE,
    PCI_QPI_DEVICE_PORT_0,
    PCI_QPI_DEVICE_PORT_1,
    PCI_QPI_MASK_DEVICE_PORT_0,
    PCI_QPI_MASK_DEVICE_PORT_1,
    PCI_QPI_MISC_DEVICE_PORT_0,
    PCI_QPI_MISC_DEVICE_PORT_1,
    MAX_NUM_DEVICES
} PciDeviceIndex;





static char* pci_DevicePath[MAX_NUM_DEVICES] = {
 "13.5", "13.6", "13.1", "10.0", "10.1", "10.4",
 "10.5", "0e.1", "08.2", "09.2", "08.6", "09.6",
 "08.0", "09.0" };

So e.g. the memory channel 1 (PCI_IMC_DEVICE_CH_1) on socket 0 is: BUS 0x7f  DEVICE 0x10 FUNCTION 0x1 .
The Linux OS maps this memory in different locations in /sys and /proc filesystems. In LIKWID I use the /proc filesystem. Above device is accessible via the path: /proc/bus/pci/7f/10.1 .  Unfortunatly if you make a hexdump as user on such a file you only get the header part (first 30-40 bytes). The rest is only visible to root. For LIKWID this means you have to use the tool as root if you want to use direct access or you have to setup the daemon mode proxy to access these files.  In my next post I will explain how the SNB Uncore is implemented in likwid-perfctr and what performance groups are available.


Wednesday, June 20, 2012

Validation of performance groups on AMD Interlagos

As I fortunatly have again access to a AMD Interlagos server system I was able to review the groups and events. I also validated some of the performance groups in the new accuracy testsuite. There I use two variants of likwid-bench: one ist plain and measures the performance based on flop and byte counts, the other one is instrumented using the LIWID marker API. I run a  variety of testcases with different benchmark type and dataset size and compare the results with each other. At this time we focus on serial measurements. I do multiple repititions per data set size. Data set sizes (from left to right) fit in in L1 (12kB), L2 (1MB), L3 (4MB) and MEM (1GB).  The red curve is the result output by likwid-bench (the performance as measured by the application). The blue curve is the performance measured with likwid-perfctr based on hardware performance monitoring.

The test system is a AMD Opteron(TM) Processor 6276 in a dual socket server system.
The new and updated groups are available in the upcoming LIKWID release.

The following performance groups are supported:



  • BRANCH: Branch prediction miss rate/ratio
  • CACHE: Data cache miss rate/ratio
  • CPI: Cycles per instruction
  • DATA: Load to store ratio
  • FLOPS_DP: Double Precision MFlops/s
  • FLOPS_SP: Single Precision MFlops/s
  • FPU_EXCEPTION: Floating point exceptions
  • ICACHE: Instruction cache miss rate/ratio
  • L2: L2 cache bandwidth in MBytes/s
  • L2CACHE: L2 cache miss rate/ratio
  • L3: L3 cache bandwidth in MBytes/s
  • L3CACHE: L3 cache miss rate/ratio
  • LINKS: Bandwidth on the Hypertransport links
  • MEM: Main memory bandwidth in MBytes/s
  • NUMA: Read/Write Events between the ccNUMA nodes



First  tests are covering the FLOPS_DP and FLOPS_SP groups:
Both groups use the RETIRED_FLOPS event with different umasks.


As can be seen this event provides a very accurate flop count independant from the arithmetic instructions used. The absolute performance does not matter as different data set sizes are used.

Next the cache bandwidth groups L2 and L3.  The L2 group is based on the  DATA_CACHE_REFILLS_ALL event, while the L3 group uses the L2_FILL_WB_FILL and L2_FILL_WB_WB events. 






As can be seen for L2 cache the results are very accurate. Because the cache is inclusive and  the L1 is write through the bandwidths measured are the same as the bandwidth seen by your application.




Also here the results are very accurate. Still because the L3 cache is write allocate and exclusive with regard to the L2 the measured bandwidths are in all cases two times the bandwidth seen by your application due to the cache architecture.




Last not least the MEM group. Again the results are very accurate. All benchmarks involving a store have a write allocate transfer resulting in a higher measured bandwidth opposed to the bandwidth seen by your application.

The performance groups work now fine on Interlagos. The updated groups as shown here will be part of the upcoming release of LIKWID.

Wednesday, June 13, 2012

Tutorial how to measure energy consumption on Sandy Bridge processors

This small study illustrates how to use likwid-perfctr to measure energy consumption using the RAPL counter on Intel Sandy Bridge Processors. You have to set up likwid to access the msr device files.

We will use the famous STREAM triad benchmark for this study. The STREAM triad performs the operation A[i]=B[i]+s*C[i]. For easy measurement we will use likwid-bench which already includes different variants of STREAM triad out of the box. We want to measure the variations of energy to solution for different frequencies within one ccNUMA domain. The STREAM triad benchmark is regarded to be solely limited by main memory bandwidth. We start out doing a scaling study on a SandyBridge system running with a nominal clock of 2.7 GHz. If you have set the power governor to performance you have to be careful as the clock can be different from this due to turbo mode. We first will get the turbo mode steps using likwid-powermeter. This is what the output looks like on our system:


$ likwid-powermeter -i

-------------------------------------------------------------
CPU name: Intel Core SandyBridge EP processor 
CPU clock: 2.69 GHz 
-------------------------------------------------------------
Base clock: 2700.00 MHz 
Minimal clock: 1200.00 MHz 
Turbo Boost Steps:
C1 3500.00 MHz 
C2 3500.00 MHz 
C3 3400.00 MHz 
C4 3200.00 MHz 
C5 3200.00 MHz 
C6 3200.00 MHz 
C7 3100.00 MHz 
C8 3100.00 MHz 
-------------------------------------------------------------
Thermal Spec Power: 130 Watts 
Minimum  Power: 51 Watts 
Maximum  Power: 130 Watts 
Maximum  Time Window: 0.398438 micro sec 
-------------------------------------------------------------


For our benchmark runs we want to verify the actual clock. To do so you can enable the instrumentation available in likwid-bench. If you compile likwid with gcc you have to uncomment the following line in the file include_GCC.mk:



DEFINES  += -DPERFMON


Then rebuild everything. To use the instrumentation you have to call likwid-bench together with likwid-perfctr. While you can measure energy with likwid-powermeter, it is more convenient to use likwid-perfctr as it is much more flexible in different modes. Also you can correlate your energy measurements with other hardware performance counter data. All things we are interested in are in the ENERGY group available on SandyBridge processors.The following example will run a AVX version of the STREAM triad on all 8 cores of socket 1 with the data set being 1GB large. likwid-perfctr will setup the counters to measure the ENERGY group. To use the instrumented regions you have to specify the -m option.


$./likwid-perfctr -c S1:0-3 -g ENERGY -m ./likwid-bench  -g 1 -i 50 -t stream_avx  -w S1:1GB:4
=====================
Region: bench 
=====================
+-------------------+---------+---------+---------+---------+
|    Region Info    | core 8  | core 9  | core 10 | core 11 |
+-------------------+---------+---------+---------+---------+
| RDTSC Runtime [s] | 2.03051 | 2.03034 | 2.03045 | 2.03027 |
|    call count     |    1    |    1    |    1    |    1    |
+-------------------+---------+---------+---------+---------+
+-----------------------+-------------+-------------+-------------+-------------+
|         Event         |   core 8    |   core 9    |   core 10   |   core 11   |
+-----------------------+-------------+-------------+-------------+-------------+
|   INSTR_RETIRED_ANY   | 6.38913e+08 | 6.34148e+08 | 6.32104e+08 | 6.29149e+08 |
| CPU_CLK_UNHALTED_CORE | 6.45153e+09 | 6.45008e+09 | 6.45041e+09 | 6.4504e+09  |
| CPU_CLK_UNHALTED_REF  | 5.44346e+09 | 5.44225e+09 | 5.44251e+09 | 5.44253e+09 |
|    PWR_PKG_ENERGY     |   146.273   |      0      |      0      |      0      |
+-----------------------+-------------+-------------+-------------+-------------+
+-------------------+---------+---------+---------+---------+
|      Metric       | core 8  | core 9  | core 10 | core 11 |
+-------------------+---------+---------+---------+---------+
|    Runtime [s]    | 2.39535 | 2.39481 | 2.39494 | 2.39493 |
| Runtime rdtsc [s] | 2.03051 | 2.03051 | 2.03051 | 2.03051 |
|    Clock [MHz]    | 3192.14 | 3192.13 | 3192.14 | 3192.12 |
|        CPI        | 10.0977 | 10.1713 | 10.2047 | 10.2526 |
|    Energy [J]     |   146   |    0    |    0    |    0    |
|     Power [W]     | 71.9031 |    0    |    0    |    0    |
+-------------------+---------+---------+---------+---------+
+------------------------+---------+---------+---------+---------+
|         Metric         |   Sum   |   Max   |   Min   |   Avg   |
+------------------------+---------+---------+---------+---------+
|    Runtime [s] STAT    | 9.58003 | 2.39535 | 2.39481 | 2.39501 |
| Runtime rdtsc [s] STAT | 8.12204 | 2.03051 | 2.03051 | 2.03051 |
|    Clock [MHz] STAT    | 12768.5 | 3192.14 | 3192.12 | 3192.13 |
|        CPI STAT        | 40.7262 | 10.2526 | 10.0977 | 10.1815 |
|    Energy [J] STAT     |   146   |   146   |    0    |  36.5   |
|     Power [W] STAT     | 71.9031 | 71.9031 |    0    | 17.9758 |
+------------------------+---------+---------+---------+---------+

As you can see the ENERGY group measures among other things the cycles and instructions executed and will output the derived metrics CPI, Clock, Energy in Joule and Power in Watt. For us the interesting parts are the Clock and the Energy. The Energy counter can only measured per package, therefore you only see one result in the table. For the statistics table not all columns make sense, still I guess you can figure out yourself where this can be applied. All cores manage to overclock up to the 3.2 GHz as predicted by likwid-powermeter.

So I did this for various settings to get something like that:

Performance scaling of STREAM triad



You can see the higher clocked variants saturate around 3 cores while the low clocked runs need 4 cores to saturate. Also notable is that the bandwidth decreases with lower core frequency. If you repeat this using a variant of STREAM triad employing so called non temporal stores (triad_mem in likwid-bench) this changes to:
Performance scaling  of STREAM triad with NT stores

Now all requencies need  4 cores to saturate the bandwidth. Well what about energy to solution?
Energy to solution for STREAM triad

Energy to solution for STREAM triad with NT stores

As you can see without NT stores the lowest Energy is reached using 1.6GHz and 4 cores. With NT stores the optimum is either also 1.6GHz or 1.2 GHz with 8 cores. And it is notable that with 8 cores there is a factor of two saving in Energy to solution between running with Turbo mode against 1.2GHz fixed frequency for this  main memory limited benchmark. Of course also the performance is slightly lower with 1.2 GHz, still of course not a factor of two but around 20%.  You can figure out interesting things using likwid-bench with instrumentation turned on and likwid-perfctr. 

Tuesday, April 24, 2012

load/store throughput on SandyBridge

For many codes the biggest improvements  with SandyBridge do not come from its AVX capabilities but from the doubled load throughput, Twi times 16 bytes instead of 16 bytes per cycle. The width of the load store units is still 16 bytes. So it can load/store 16 bytes per instruction. Where Intel processors before SandyBridge (SNB) could execute either one load and one store or  one load or one store per cycle SNB can in principal throughput two loads and one store per cycle. The 32 byte wide AVX load and store instructions are executed split. This means the loads/stores take 2 cycles each to execute for 256bit AVX code.

This means that on paper SSE and AVX should have similar load/store capabilities. Below results for a triad benchmark code are shown (serial execution). The according C code is:


for (int i=0; i < size; i++) {

    A[i] = B[i] + C[i] * C[i];
 }

Interesting for us is the L1 cache performance. The Intel compiler was used.  It can be seen that for  -O3 (SSE code) SNB (i7-2600) is not twice  as fast as the Westmere code as expected, despite the fact it also is clocked higher. The Westmere clocks 2.93GHz  with Turbo Mode. This code is still not optimal because split 8 byte loads are used for some of the vectors. Adding the vector aligned pragma leads to optimal code.  Now Westmere reaches the theoretical limit of 3 cycles per SIMD update resulting in: 2.93 GHz/ 3 cycles * 4 flops = 3.9 GFlops . While SNB also improves it is still far away from its optimal performance.



You can see that AVX brings an additional boost on SNB. This is not so clear because a purely load/store throughput limited code as the triad should not profit from the present implementation of AVX on SNB, as the load store capabilities are the same. It turns out that a specific detail of the scheduler prevent that the SSE 128bit code actually can reach the full throughput of SNB. Each data move instruction consists of a address generation part and a data part. The load schedule ports (2 and 3) are also used for the address generation of the stores. The actual store port is port 4. This means that for  a store two ports are blocked (4 and one of 2/3). On port 2/3 they compete with  possible loads. This causes that for 128bit code the full throughput cannot be reached. With AVX code this is possible as it appears that one 256bit load uses two cycles but only one uop. This means that in the second cycle the  port is free and the address generation uop of the store can be scheduled. This is also described in the very good microarchitecture manual from Agner Fog (update 29-02-2012, page 100).

Tuesday, February 7, 2012

Raw arithmetic SIMD Instruction throughput on Interlagos and SandyBridge

The arithmetic performance of modern processors is apart from task parallelism generated mainly by data parallel instructions. There exists a zoo of x86 instruction set extensions: SSEx AVX FMA3 FMA4 . This article wants to give a short overview of raw SIMD instruction throughput on the recent processors AMD Interlagos and Intel SandyBridge. Of course a short article cannot cover the topic in full depth, so for more detailed tips and tricks please refer to the optimization manuals released by the vendors.

Testmachines are a AMD Opteron (trademark) 6276 and a Intel Xeon (trademark) E31280. Methodology is simple: Create small loop kernels in likwid-bench using different instruction and operation mixes and measure  CPI. To keep things simple we restrict ourself to the double precision case. Still this makes no difference with regard to instruction throughput. CPI stands for Cycles Per Instruction and gives you how many instructions per cycle a processor core is able to execute in average. If, e.g., a processor can execute four instructions per cycle the optimal CPI on this processor is 0.25. The ability to schedule multiple instructions of an otherwise sequential instruction stream in one cycle is also referred to as superscalar execution.
The following illustration shows the FPU capabilities of both processors:








Interlagos has two 128bit wide units which are fused multiply accumulate units. This means Interlagos is superscalar  for any mix of operations while SandyBridge  is only superscalar for an equal multiply/add  instruction mix. In contrast to Interlagos SandyBridge has 256bit wide execution units. For AVX code and a multiply/add operation mix SandyBridge is superscalar while Interlagos is not.  Let's look at the results for different instruction mixes:



Blue is equal, green is better and red is worse. Interlagos has advantages with pure add or mult SSE code.
For AVX multadd SandyBridge is better while AVX pure add and mult both are equal. SandyBridge is slightly more efficient though. On paper FMA should perform similar than the SandyBridge AVX variant. Still the code is very dense for this case and it is more difficult to get efficient results. For one core the result was not as expected. Using both cores sharing a FPU showed a better performance. Not as efficient as possible but still the best possible on Interlagos with regard to MFlops/s .

For completeness the results in terms of arithmetic FP instruction throughput:



Please note that this is only reflecting a single aspect of the processor microarchitecture and is not representative for the overall processor performance. Still I hope this can clear some uncertainties with regard to SIMD arithmetic instruction throughput.

Friday, February 3, 2012

Intel SandyBridge and counting the flops

In HPC it is all about the flops. While this can be seen critical it is still handy to measure the MFlop rate with hardware performance counters. This is a derived metric which computes the number of floating point operation for a measured time period.

While AMD has counters which measure floating point operations (not instructions) on Intel processors you can measure how many floating point instructions were executed. You have to distinguish if a packed instruction was used (with multiple operations per instruction) or a scalar one. And depending on the type (double or float) you can compute the number of operations.

On Core 2 the relevant events are (taken out of the likwid event file):


EVENT_SIMD_COMP_INST_RETIRED     0xCA   PMC
UMASK_SIMD_COMP_INST_RETIRED_PACKED_SINGLE     0x01
UMASK_SIMD_COMP_INST_RETIRED_SCALAR_SINGLE     0x02
UMASK_SIMD_COMP_INST_RETIRED_PACKED_DOUBLE     0x04
UMASK_SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE     0x08

This is a retired event and hence produces very accurate results.
This changed on Nehalem and Westmere. On Westmere for example the relevant events are:

EVENT_FP_COMP_OPS_EXE            0x10   PMC
UMASK_FP_COMP_OPS_EXE_SSE2_INTEGER             0x08
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED            0x10
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR            0x20
UMASK_FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION     0x40
UMASK_FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION     0x80

Two problems with this, you can not measure mixed precision performance with this because the events are not connected and therefore it is impossible to compute the ratios between packed/scalar and double/float. Second the event is now an executed event. This results in a slight overcount for speculative execution (can be up to 5% overcount).

So how does this look like on SandyBridge?

EVENT_FP_COMP_OPS_EXE            0x10   PMC
UMASK_FP_COMP_OPS_EXE_X87       0x01
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE     0x10
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE     0x20
UMASK_FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE     0x40
UMASK_FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE     0x80

That looks good, we have again mixed events (e.g. packed double) as on Core 2. It is still  an executed event. But as long as the overcount is within acceptable limits this is OK.

Validation against likwid-bench using the marker API shows the following:


The graph shows performance in MFlops/s of the stream benchmark as it is part of likwid-bench. This is fully vectorized using SSE instructions. We consider sequential execution. The red points are results computed by likwid-bench from accurate runtime measuremants and computed flop counts. This is the correct flop value. The blue line is the value computed from hardware performance counters. I execute this benchmark multiple times with different data set size (L1, L2, L3, Memory). This are the steps you can see.

The good news is it is accurate in L1 cache. The bad news it overcounts  as soon as the data comes from higher cache levels. I also tried a different variant using RISC style code separating the loads from the arithmetic instructions, same result.  I can only guess but it appears the event is speculatively triggered when waiting for data to arrive.

The result for the triad benchmark is slightly different:

Here it overcounts less. The difference in triad is that it has one load stream more (A=B+C*D). I have to investigate this further. Still at the moment that the Flop group on SandyBridge may be (very likely) produce wrong results. I have not yet found an errata with regard to this.

Thursday, January 26, 2012

Current status likwid tool suite

As you may have noticed the last release is already some time ago. I added a lot of things and realized that for the next release I do not get away without automated testing. So it will take some more time.

Still to keep you informed I reactivated this blog. Please let me know what you like or dislike about likwid and what new features or tools could be useful for you.

First some notice if you want look in the mercurial repository. The trunk is not the head ;-). So my actual head is in branch v2. The reason for this is that the trunk was prepared for multi OS support. Still I did not merge my other head which pushed forward. This is still on the todo list, but at the moment if you want to browse new things in the repository you must look at branch v2.

Things I added in my current developer version:

* Support for Intel SandyBridge
* Support for AMD Interlagos
* likwid-powermeter: A new tool for querying turbo mode capabilities and measure energy on Sandy Bridge CPUs. Those power counters are also available as another event in perfctr.
* Extended support for the EX type Xeon Uncore
* test suite for accuracy of performance groups
* and many more things

I will keep you informed here about the further steps.