likwid: June 2012

As I fortunatly have again access to a AMD Interlagos server system I was able to review the groups and events. I also validated some of the performance groups in the new accuracy testsuite. There I use two variants of likwid-bench: one ist plain and measures the performance based on flop and byte counts, the other one is instrumented using the LIWID marker API. I run a variety of testcases with different benchmark type and dataset size and compare the results with each other. At this time we focus on serial measurements. I do multiple repititions per data set size. Data set sizes (from left to right) fit in in L1 (12kB), L2 (1MB), L3 (4MB) and MEM (1GB). The red curve is the result output by likwid-bench (the performance as measured by the application). The blue curve is the performance measured with likwid-perfctr based on hardware performance monitoring.

The test system is a AMD Opteron(TM) Processor 6276 in a dual socket server system.
The new and updated groups are available in the upcoming LIKWID release.

The following performance groups are supported:

BRANCH: Branch prediction miss rate/ratio
CACHE: Data cache miss rate/ratio
CPI: Cycles per instruction
DATA: Load to store ratio
FLOPS_DP: Double Precision MFlops/s
FLOPS_SP: Single Precision MFlops/s
FPU_EXCEPTION: Floating point exceptions
ICACHE: Instruction cache miss rate/ratio
L2: L2 cache bandwidth in MBytes/s
L2CACHE: L2 cache miss rate/ratio
L3: L3 cache bandwidth in MBytes/s
L3CACHE: L3 cache miss rate/ratio
LINKS: Bandwidth on the Hypertransport links
MEM: Main memory bandwidth in MBytes/s
NUMA: Read/Write Events between the ccNUMA nodes

First tests are covering the FLOPS_DP and FLOPS_SP groups:
Both groups use the RETIRED_FLOPS event with different umasks.

As can be seen this event provides a very accurate flop count independant from the arithmetic instructions used. The absolute performance does not matter as different data set sizes are used.

Next the cache bandwidth groups L2 and L3. The L2 group is based on the DATA_CACHE_REFILLS_ALL event, while the L3 group uses the L2_FILL_WB_FILL and L2_FILL_WB_WB events.

As can be seen for L2 cache the results are very accurate. Because the cache is inclusive and the L1 is write through the bandwidths measured are the same as the bandwidth seen by your application.

Also here the results are very accurate. Still because the L3 cache is write allocate and exclusive with regard to the L2 the measured bandwidths are in all cases two times the bandwidth seen by your application due to the cache architecture.

Last not least the MEM group. Again the results are very accurate. All benchmarks involving a store have a write allocate transfer resulting in a higher measured bandwidth opposed to the bandwidth seen by your application.

The performance groups work now fine on Interlagos. The updated groups as shown here will be part of the upcoming release of LIKWID.

This small study illustrates how to use likwid-perfctr to measure energy consumption using the RAPL counter on Intel Sandy Bridge Processors. You have to set up likwid to access the msr device files.

We will use the famous STREAM triad benchmark for this study. The STREAM triad performs the operation A[i]=B[i]+s*C[i]. For easy measurement we will use likwid-bench which already includes different variants of STREAM triad out of the box. We want to measure the variations of energy to solution for different frequencies within one ccNUMA domain. The STREAM triad benchmark is regarded to be solely limited by main memory bandwidth. We start out doing a scaling study on a SandyBridge system running with a nominal clock of 2.7 GHz. If you have set the power governor to performance you have to be careful as the clock can be different from this due to turbo mode. We first will get the turbo mode steps using likwid-powermeter. This is what the output looks like on our system:

$ likwid-powermeter -i

-------------------------------------------------------------
CPU name: Intel Core SandyBridge EP processor
CPU clock: 2.69 GHz
-------------------------------------------------------------
Base clock: 2700.00 MHz
Minimal clock: 1200.00 MHz
Turbo Boost Steps:
C1 3500.00 MHz
C2 3500.00 MHz
C3 3400.00 MHz
C4 3200.00 MHz
C5 3200.00 MHz
C6 3200.00 MHz
C7 3100.00 MHz
C8 3100.00 MHz
-------------------------------------------------------------
Thermal Spec Power: 130 Watts
Minimum Power: 51 Watts
Maximum Power: 130 Watts
Maximum Time Window: 0.398438 micro sec
-------------------------------------------------------------

For our benchmark runs we want to verify the actual clock. To do so you can enable the instrumentation available in likwid-bench. If you compile likwid with gcc you have to uncomment the following line in the file include_GCC.mk:

DEFINES += -DPERFMON

Then rebuild everything. To use the instrumentation you have to call likwid-bench together with likwid-perfctr. While you can measure energy with likwid-powermeter, it is more convenient to use likwid-perfctr as it is much more flexible in different modes. Also you can correlate your energy measurements with other hardware performance counter data. All things we are interested in are in the ENERGY group available on SandyBridge processors.The following example will run a AVX version of the STREAM triad on all 8 cores of socket 1 with the data set being 1GB large. likwid-perfctr will setup the counters to measure the ENERGY group. To use the instrumented regions you have to specify the -m option.

$./likwid-perfctr -c S1:0-3 -g ENERGY -m ./likwid-bench -g 1 -i 50 -t stream_avx -w S1:1GB:4

=====================

Region: bench 

=====================

+-------------------+---------+---------+---------+---------+

|    Region Info    | core 8  | core 9  | core 10 | core 11 |

+-------------------+---------+---------+---------+---------+

| RDTSC Runtime [s] | 2.03051 | 2.03034 | 2.03045 | 2.03027 |

|    call count     |    1    |    1    |    1    |    1    |

+-------------------+---------+---------+---------+---------+

+-----------------------+-------------+-------------+-------------+-------------+

|         Event         |   core 8    |   core 9    |   core 10   |   core 11   |

+-----------------------+-------------+-------------+-------------+-------------+

|   INSTR_RETIRED_ANY   | 6.38913e+08 | 6.34148e+08 | 6.32104e+08 | 6.29149e+08 |

| CPU_CLK_UNHALTED_CORE | 6.45153e+09 | 6.45008e+09 | 6.45041e+09 | 6.4504e+09  |

| CPU_CLK_UNHALTED_REF  | 5.44346e+09 | 5.44225e+09 | 5.44251e+09 | 5.44253e+09 |

|    PWR_PKG_ENERGY     |   146.273   |      0      |      0      |      0      |

+-----------------------+-------------+-------------+-------------+-------------+

+-------------------+---------+---------+---------+---------+

|      Metric       | core 8  | core 9  | core 10 | core 11 |

+-------------------+---------+---------+---------+---------+

|    Runtime [s]    | 2.39535 | 2.39481 | 2.39494 | 2.39493 |

| Runtime rdtsc [s] | 2.03051 | 2.03051 | 2.03051 | 2.03051 |

|    Clock [MHz]    | 3192.14 | 3192.13 | 3192.14 | 3192.12 |

|        CPI        | 10.0977 | 10.1713 | 10.2047 | 10.2526 |

|    Energy [J]     |   146   |    0    |    0    |    0    |

|     Power [W]     | 71.9031 |    0    |    0    |    0    |

+-------------------+---------+---------+---------+---------+

+------------------------+---------+---------+---------+---------+

|         Metric         |   Sum   |   Max   |   Min   |   Avg   |

+------------------------+---------+---------+---------+---------+

|    Runtime [s] STAT    | 9.58003 | 2.39535 | 2.39481 | 2.39501 |

| Runtime rdtsc [s] STAT | 8.12204 | 2.03051 | 2.03051 | 2.03051 |

|    Clock [MHz] STAT    | 12768.5 | 3192.14 | 3192.12 | 3192.13 |

|        CPI STAT        | 40.7262 | 10.2526 | 10.0977 | 10.1815 |

|    Energy [J] STAT     |   146   |   146   |    0    |  36.5   |

|     Power [W] STAT     | 71.9031 | 71.9031 |    0    | 17.9758 |

+------------------------+---------+---------+---------+---------+

As you can see the ENERGY group measures among other things the cycles and instructions executed and will output the derived metrics CPI, Clock, Energy in Joule and Power in Watt. For us the interesting parts are the Clock and the Energy. The Energy counter can only measured per package, therefore you only see one result in the table. For the statistics table not all columns make sense, still I guess you can figure out yourself where this can be applied. All cores manage to overclock up to the 3.2 GHz as predicted by likwid-powermeter.

So I did this for various settings to get something like that:

Performance scaling of STREAM triad

You can see the higher clocked variants saturate around 3 cores while the low clocked runs need 4 cores to saturate. Also notable is that the bandwidth decreases with lower core frequency. If you repeat this using a variant of STREAM triad employing so called non temporal stores (triad_mem in likwid-bench) this changes to:

Performance scaling of STREAM triad with NT stores

Now all requencies need 4 cores to saturate the bandwidth. Well what about energy to solution?

Energy to solution for STREAM triad

Energy to solution for STREAM triad with NT stores

As you can see without NT stores the lowest Energy is reached using 1.6GHz and 4 cores. With NT stores the optimum is either also 1.6GHz or 1.2 GHz with 8 cores. And it is notable that with 8 cores there is a factor of two saving in Energy to solution between running with Turbo mode against 1.2GHz fixed frequency for this main memory limited benchmark. Of course also the performance is slightly lower with 1.2 GHz, still of course not a factor of two but around 20%. You can figure out interesting things using likwid-bench with instrumentation turned on and likwid-perfctr.

likwid

Wednesday, June 20, 2012

Validation of performance groups on AMD Interlagos

Wednesday, June 13, 2012

Tutorial how to measure energy consumption on Sandy Bridge processors