Wednesday, June 13, 2012

Tutorial how to measure energy consumption on Sandy Bridge processors

This small study illustrates how to use likwid-perfctr to measure energy consumption using the RAPL counter on Intel Sandy Bridge Processors. You have to set up likwid to access the msr device files.

We will use the famous STREAM triad benchmark for this study. The STREAM triad performs the operation A[i]=B[i]+s*C[i]. For easy measurement we will use likwid-bench which already includes different variants of STREAM triad out of the box. We want to measure the variations of energy to solution for different frequencies within one ccNUMA domain. The STREAM triad benchmark is regarded to be solely limited by main memory bandwidth. We start out doing a scaling study on a SandyBridge system running with a nominal clock of 2.7 GHz. If you have set the power governor to performance you have to be careful as the clock can be different from this due to turbo mode. We first will get the turbo mode steps using likwid-powermeter. This is what the output looks like on our system:


$ likwid-powermeter -i

-------------------------------------------------------------
CPU name: Intel Core SandyBridge EP processor 
CPU clock: 2.69 GHz 
-------------------------------------------------------------
Base clock: 2700.00 MHz 
Minimal clock: 1200.00 MHz 
Turbo Boost Steps:
C1 3500.00 MHz 
C2 3500.00 MHz 
C3 3400.00 MHz 
C4 3200.00 MHz 
C5 3200.00 MHz 
C6 3200.00 MHz 
C7 3100.00 MHz 
C8 3100.00 MHz 
-------------------------------------------------------------
Thermal Spec Power: 130 Watts 
Minimum  Power: 51 Watts 
Maximum  Power: 130 Watts 
Maximum  Time Window: 0.398438 micro sec 
-------------------------------------------------------------


For our benchmark runs we want to verify the actual clock. To do so you can enable the instrumentation available in likwid-bench. If you compile likwid with gcc you have to uncomment the following line in the file include_GCC.mk:



DEFINES  += -DPERFMON


Then rebuild everything. To use the instrumentation you have to call likwid-bench together with likwid-perfctr. While you can measure energy with likwid-powermeter, it is more convenient to use likwid-perfctr as it is much more flexible in different modes. Also you can correlate your energy measurements with other hardware performance counter data. All things we are interested in are in the ENERGY group available on SandyBridge processors.The following example will run a AVX version of the STREAM triad on all 8 cores of socket 1 with the data set being 1GB large. likwid-perfctr will setup the counters to measure the ENERGY group. To use the instrumented regions you have to specify the -m option.


$./likwid-perfctr -c S1:0-3 -g ENERGY -m ./likwid-bench  -g 1 -i 50 -t stream_avx  -w S1:1GB:4
=====================
Region: bench 
=====================
+-------------------+---------+---------+---------+---------+
|    Region Info    | core 8  | core 9  | core 10 | core 11 |
+-------------------+---------+---------+---------+---------+
| RDTSC Runtime [s] | 2.03051 | 2.03034 | 2.03045 | 2.03027 |
|    call count     |    1    |    1    |    1    |    1    |
+-------------------+---------+---------+---------+---------+
+-----------------------+-------------+-------------+-------------+-------------+
|         Event         |   core 8    |   core 9    |   core 10   |   core 11   |
+-----------------------+-------------+-------------+-------------+-------------+
|   INSTR_RETIRED_ANY   | 6.38913e+08 | 6.34148e+08 | 6.32104e+08 | 6.29149e+08 |
| CPU_CLK_UNHALTED_CORE | 6.45153e+09 | 6.45008e+09 | 6.45041e+09 | 6.4504e+09  |
| CPU_CLK_UNHALTED_REF  | 5.44346e+09 | 5.44225e+09 | 5.44251e+09 | 5.44253e+09 |
|    PWR_PKG_ENERGY     |   146.273   |      0      |      0      |      0      |
+-----------------------+-------------+-------------+-------------+-------------+
+-------------------+---------+---------+---------+---------+
|      Metric       | core 8  | core 9  | core 10 | core 11 |
+-------------------+---------+---------+---------+---------+
|    Runtime [s]    | 2.39535 | 2.39481 | 2.39494 | 2.39493 |
| Runtime rdtsc [s] | 2.03051 | 2.03051 | 2.03051 | 2.03051 |
|    Clock [MHz]    | 3192.14 | 3192.13 | 3192.14 | 3192.12 |
|        CPI        | 10.0977 | 10.1713 | 10.2047 | 10.2526 |
|    Energy [J]     |   146   |    0    |    0    |    0    |
|     Power [W]     | 71.9031 |    0    |    0    |    0    |
+-------------------+---------+---------+---------+---------+
+------------------------+---------+---------+---------+---------+
|         Metric         |   Sum   |   Max   |   Min   |   Avg   |
+------------------------+---------+---------+---------+---------+
|    Runtime [s] STAT    | 9.58003 | 2.39535 | 2.39481 | 2.39501 |
| Runtime rdtsc [s] STAT | 8.12204 | 2.03051 | 2.03051 | 2.03051 |
|    Clock [MHz] STAT    | 12768.5 | 3192.14 | 3192.12 | 3192.13 |
|        CPI STAT        | 40.7262 | 10.2526 | 10.0977 | 10.1815 |
|    Energy [J] STAT     |   146   |   146   |    0    |  36.5   |
|     Power [W] STAT     | 71.9031 | 71.9031 |    0    | 17.9758 |
+------------------------+---------+---------+---------+---------+

As you can see the ENERGY group measures among other things the cycles and instructions executed and will output the derived metrics CPI, Clock, Energy in Joule and Power in Watt. For us the interesting parts are the Clock and the Energy. The Energy counter can only measured per package, therefore you only see one result in the table. For the statistics table not all columns make sense, still I guess you can figure out yourself where this can be applied. All cores manage to overclock up to the 3.2 GHz as predicted by likwid-powermeter.

So I did this for various settings to get something like that:

Performance scaling of STREAM triad



You can see the higher clocked variants saturate around 3 cores while the low clocked runs need 4 cores to saturate. Also notable is that the bandwidth decreases with lower core frequency. If you repeat this using a variant of STREAM triad employing so called non temporal stores (triad_mem in likwid-bench) this changes to:
Performance scaling  of STREAM triad with NT stores

Now all requencies need  4 cores to saturate the bandwidth. Well what about energy to solution?
Energy to solution for STREAM triad

Energy to solution for STREAM triad with NT stores

As you can see without NT stores the lowest Energy is reached using 1.6GHz and 4 cores. With NT stores the optimum is either also 1.6GHz or 1.2 GHz with 8 cores. And it is notable that with 8 cores there is a factor of two saving in Energy to solution between running with Turbo mode against 1.2GHz fixed frequency for this  main memory limited benchmark. Of course also the performance is slightly lower with 1.2 GHz, still of course not a factor of two but around 20%.  You can figure out interesting things using likwid-bench with instrumentation turned on and likwid-perfctr. 

No comments:

Post a Comment