Wednesday, June 4, 2014

LIKWID, the capabilities system and setuid root

The set of LIKWID tools provide important measuring and managing functionalities that access hardware registers and priviledged sections of the operating system.
Similar to other operating system tools that open, manipulate and close features of the operating system, it needs special permission to perform these operations.
The common way were setuid root applications that are allowed to change their uid to root during runtime. Examples for those applications are ping, (un)mount and su.

But there is also another way to switch priviledges between the two user groups, root and non-root. The Linux kernel supports the feature of capabilities for processes since the version 2.2. When a process is forked it inherits the capabilities of the parent process. The parent process was then able to restrict the scope of a child process by removing capabilities from the child's set. With the version 2.6.24 the full support for the capabilities feature was implemented. Executable files can be associated with sets of capability flags for fine grained management of permissions. But the management of these sets is rather complicated and the names of the three sets per executable are missleading. The sets are called permitted, effective and inheritable.

  • The effective set contains the capabilities that are currently set for executable and which are checked when triggering system calls.
  • The permitted set is the superset of the effective set and specifies the flags that can be maximally enabled for the executable.
  • The inheritable set is used at fork time to specify which flags are set in the child process's permitted set.
For a more detailed and possibly more correct explaination of the capabilities see the manpage (man 7 capabilities or online) or this website .

So how does this affect the LIKWID tools?
For measuring the performance counters with likwid-perfctr we need permission to read and write the MSR registers. Since the related assembler instructions RDMSR and WRMSR are priviledged operations, they can only be executed at kernel space. User space applications can access the registers through a kernel module that exports the functionality with device files.

Access Daemon

LIKWID therefore started to offer a daemon application that forwards the read and write requests to the MSR registers with higher priviledges. Hence, in order to gain higher priviledges the daemon must be owned by root and an application needs the permission to set its uid to root. The common way was:

$ sudo chown root:root likwid-accessD
$ sudo chmod u+s likwid-accessD


This still works but, having security in mind, this gives LIKWID's access daemon more priviledges than it needs. As an advantage the permissions of the MSR device files do not need to be modified because the daemon will access the files as user root.

The alternative is to give the daemon the capability to read and write the MSR device files. The commands to do this:

$ sudo chmod 666 /dev/cpu/*/msr
$ sudo setcap cap_sys_rawio+ep likwid-accessD


The changing of the permissions is mandatory because the daemon will still be executed with the uid of the user but with more capabilities. This only works until the next reboot because the file permissions are not persistent. You can use udev to set them every time the MSR kernel module is loaded:

<must be done by root>
$ cat > /etc/udev/rules.d/99-persistent-msr.rules <<EOF
> KERNEL=="msr*", MODE="0666"
> EOF


As a possibility, one can also avoid setting the read and write permission for _others_ on the device files, create a likwid group for all LIKWID users and assign it to the MSR device files. Add to the rules GROUP="<likwid-group>" and modify the MODE string to afford this.

All commands are tested on an Ubuntu 12.04.4 LTS system with Haswell CPUs.
Since the common Haswell CPU does not have Uncore support, I wanted to test the same procedures for a SandyBridge EP system with SuSE Linux Enterprise Server 11, patchlevel 3. The setuid root way is usable but the setting of capabilities for the daemon is not enough for the SuSE system. The system forbids the access to the MSR device files even if the file permissions are valid. The setuid root method has another advantage and maybe also mandatory for accessing the Uncore counters. Those counters are mapped in the PCI address space and can be accessed through PCI device files. The current documentation of capabilities does not mention a flag that can permit these accesses, the used flag cap_sys_rawio only brings up MSR device file access.

Consequently, since the method works for both operating systems and Uncore counters, we recommend using the setuid root method for LIKWID's access daemon.

One might think that if we need this, why not using these procedures on the likwid-perfctr executable directly. When building LIKWID with a static library the setuid root method as well as the capabilities method is usable. But when LIKWID should be linked again the shared library another problem arises. When an application has the  setuid root bit set, some environment variables are ignored for security. One of these variables is the LD_LIBRARY_PATH and consequently the application cannot find its library anymore.
The capabilites do not ignore those variables. Therefore a setuid root application with linked shared library should either use the capabilites method, build the application with static libraries or build the application with static search path for the library. Static search paths can be included in LIKWID like this:

$ vi make/include_<Compiler>.mk
<add rpath to SHARED_CFLAGS and SHARED_LFLAGS>
SHARED_CFLAGS = -fpic -Wl,-rpath=$(PREFIX)/lib
SHARED_LFLAGS = -shared -Wl,-rpath=$(PREFIX)/lib

Sometime the setting of capabilities is not possible because the underlying file system does not support extended attributes. You can test this with:

$ zcat /proc/config.gz | grep FS_XATTR
or
$ cat /boot/config-<kernel-version> | grep FS_XATTR

For EXT3 you also need to set the EXT3_FS_SECURITY kernel option to enable the storing of capabilities.
For kernels that have only a slight higher number as 2.6.24, the first kernel supporting capabilities, the TMPFS does not support the extended attributes.
One example of an operating system is CentOs 6.5.

Frequency Daemon

Starting with the version 3.1.2, LIKWID also contains an executable to manipulate the frequency of CPU cores. The following commands enable the setting of the scaling governor and the frequency for each CPU core for the Ubuntu 12.04.4 LTS system:

$ sudo setcap cap_sys_rawio+ep likwid-setFreq
$ cat > /etc/udev/rules.d/99-persistent-setFreq.rules <<EOF
> KERNEL=="cpu*", RUN+="/bin/chmod 0666 /sys/devices/system/cpu/cpu%n/cpufreq/scaling_governor"
> KERNEL=="cpu*", RUN+="/bin/chmod 0666 /sys/devices/system/cpu/cpu%n/cpufreq/scaling_setspeed"
> EOF


For the SuSE Linux Enterprise Server system, it is also enough to set the capabilities, therefore the capabilities method is perferable for the likwid-setFreq daemon.
For completeness, the setuid root method for the frequency daemon: 

$ sudo chown root:root likwid-setFreq
$ sudo chmod u+s likwid-setFreq



Monday, March 17, 2014

Probing instruction throughput of recent Intel processors with likwid-bench - Part 1: Theory

Microarchitectures gradually improve over generations and one thing which is addressed are the instruction throughput capabilities in terms of superscalar execution. This article will focus on the development of the load/store unit improvements from Intel Westmere, Intel IvyBridge to Intel Haswell microarchitectures. The following information is to the best of my knowledge. If you find any errors please let me know to correct it. The present article will perform an architectural analysis for the STREAM triad kernel in order to predict the expected performance with the data located in L1 cache on the considered architectures. In a subsequent article we will try to measure this prediction using the likwid-bench microbenchmarking tool.

likwid-bench is part of the LIKWID tool suite. It is an application which comes out of the box with many streaming based testcases but also allows to easily add more testcases in terms of simple text files. The application cares for data set size and placement and threading configuration. We choose the so called STREAM triad as implemented in McCalpins STREAM microbenchmark. This is a simple C code for it (data type is float 32bit floating point):

for (int i=0; i
{
    vecA[i] = vecB[i] + scalar *  vecC[i];
}

One iteration requires two loads and one store and a multiply/add operations. For Intel Westmere the fastest implementation uses SSE packed SIMD instructions. This is the resulting assembly code:

1:
movaps xmm0, [rdx + rax*4]
movaps xmm1, [rdx + rax*4+16]
movaps xmm2, [rdx + rax*4+32]
movaps xmm3, [rdx + rax*4+48]
mulps xmm0, xmm4
mulps xmm1, xmm4
mulps xmm2, xmm4
mulps xmm3, xmm4
addps xmm0, [rcx + rax*4]
addps xmm1, [rcx + rax*4+16]
addps xmm2, [rcx + rax*4+32]
addps xmm3, [rcx + rax*4+48]
movaps [rsi + rax*4] , xmm0
movaps [rsi + rax*4+16], xmm1
movaps [rsi + rax*4+32], xmm2
movaps [rsi + rax*4+48], xmm3
add rax, 16
cmp rax, rdi
jl 1b

Lets look for the theory first. Here is a schematic figure illustrating the instruction throughput capabilities of the Intel Westmere architecture. The architecture can issue and retire 4 uops per cycle. It is capable of executing one load and one store or one load or one store per cycle. Both load or store can be up to 16 byte wide. On the arithmetic side it can execute one multiply and one add or one add or one multiply.

Schematic illustration of  instruction throughput capabilities for Intel Westmere

Throughout the article we will consider cycles it takes to execute loop iterations equivalent to one cache line (64bytes). Because packed SSE SIMD is 16 bytes wide this results in 4 SIMD iterations to update one cache line. The throughput bottleneck on this architecture is the load port and the minimum time for executing one SIMD iteration is 2 cycles. Therefore we end up with 4 SIMD iterations x 2 cycles = 8 cycles to update one cache line. We can easily compute the lightspeed performance (lets consider a clock of 3GHz) as 3 GHz / 2 cycles * 4 iterations * 2 flops/iteration =  12 GFlops/s.

After Westmere, which was a so called tick (incremental) update, Intel released a larger update with the SandyBridge processor. Intel SandyBridge introduced the AVX SIMD instruction set extension with 32 bytes SIMD width instead of the previous width of 16 bytes with SSE. The resulting AVX kernel looks like the following:

1:
vmovaps ymm1, [rdx + rax*4]
vmovaps ymm2, [rdx + rax*4+32]
vmovaps ymm3, [rdx + rax*4+64]
vmovaps ymm4, [rdx + rax*4+96]
vmulps ymm1, ymm1, ymm5
vmulps ymm2, ymm2, ymm5
vmulps ymm3, ymm3, ymm5
vmulps ymm4, ymm4, ymm5
vaddps ymm1, ymm1, [rcx + rax*4]
vaddps ymm2, ymm2, [rcx + rax*4+32]
vaddps ymm3, ymm3, [rcx + rax*4+64]
vaddps ymm4, ymm4, [rcx + rax*4+96]
vmovaps [rsi + rax*4] , ymm1
vmovaps [rsi + rax*4+32], ymm2
vmovaps [rsi + rax*4+64], ymm3
vmovaps [rsi + rax*4+96], ymm4
add rax, 32
cmp rax, rdi
jl 1b

Because its successor IvyBridge performs similar we will only consider IvyBridge in this comparison. The following illustration again illustrates the important properties. The execution units are 32 bytes wide. This microarchitecture also adds a second load port. The load/store units are still 16 byte wide. This architecture can execute either one AVX packed (SIMD) load instruction and half a packed AVX store or one AVX packed load or half a packed AVX store. For SSE code the architecture suggests it can execute two packed SSE loads and one packed SSE store or one packed SSE load and one packed SSE store or one SSE store. Still as can be seen in the illustration the store (data) unit shares the address generation with the load units (indicated by AGU in the illustration). If you have two SSE loads and one store instruction mix the store competes with the loads for port 2 or 3. The maximum throughput therefore cannot be reached with SSE code. This does not apply for AVX, here the 32 byte load only occupies one port, the other port can be used for the store address generation.

Schematic illustration of  instruction throughput capabilities for Intel Ivy Bridge
But what does that mean for the execution of our STREAM triad test case? Again the load/store unit is the bottleneck. The two loads as well as the store take 2 cycles for one AVX SIMD iterations. Because you need two AVX SIMD iterations to update one cache lines we end up with 2 SIMD iterations x 2 cycles = 4 cycles. Lightspeed performance is then  3 GHz / 2 cycles * 8 iterations * 2 flops/iteration =  24 GFlops/s.

The next microarchitecture Haswell was a larger update (a tock in Intel nomenclature). Haswell widens all load/store paths to 32 bytes. Moreover the processor adds two additional ports, one of them with an address generation unit for the stores. Haswell can now issue up to 8 uops per cycles but still is limited to 4 uops retired per cycle. One could ask why you want to do that, stuffing stuff in on top when not more can exit at the bottom. One explanation is that the 4 uops per cycle throughput in e.g. the Westmere design could not be reached in practice. Already a CPI of 0.35-0.5 is the best you can expect. By issuing more instructions you increase the average instruction throughput by getting closer to the theoretical limit of 4 uops per cycles. The following illustration shows the basic setup of Haswell.

Schematic illustration of  instruction throughput capabilities for Intel Haswell

Haswell has support for AVX2 which adds things like gather and promotes most instructions to 32 byte wide execution. Also Haswell has fused multiply add instructions (FMA). There is a drawback here: A naive view suggests Haswell should be able to execute either one add and one multiply or two adds or two multiplies because it has two FMA units. This holds for add instructions but not for multiplies. For multiplications the throughput is still one instruction per cycle.

Again lets look what consequences these changes have for the throughput of the STREAM triad kernel. From the execution units it should be possible to issue all instructions in one cycle. But due to the fact that only 4 instructions can retire this throughput cannot be reached. This situation should be improved by using the FMA instructions. There only one instruction is necessary and we can get away with 4 instructions for one SIMD iteration. This is the changed code using the FMA instruction:

1:
vmovaps ymm1, [rdx + rax*4]
vmovaps ymm2, [rdx + rax*4+32]
vmovaps ymm3, [rdx + rax*4+64]
vmovaps ymm4, [rdx + rax*4+96]
vfmadd213ps ymm1, ymm5, [rcx + rax*4]
vfmadd213ps ymm2, ymm5, [rcx + rax*4+32]
vfmadd213ps ymm3, ymm5, [rcx + rax*4+64]
vfmadd213ps ymm4, ymm5, [rcx + rax*4+96]
vmovaps [rsi+ rax*4] , ymm1
vmovaps [rsi+ rax*4+32], ymm2
vmovaps [rsi+ rax*4+64], ymm3
vmovaps [rsi+ rax*4+96], ymm4
add rax, 32
cmp rax, rdi
jl 1b

For this code one SIMD iteration should be able to execute in just 1 cycle ending up with 2 cycles to update a cache line. This is due to the doubled data path of 32 bytes of the load/store units. Lightspeed performance is then  3 GHz / 1 cycle * 8 iterations * 2 flops/iteration =  48 GFlops/s. So theoretically the L1 performance for the STREAM triad is doubled from Westmere to IvyBridge and doubled again from IvyBridge to Haswell.

In a next article we will try to confirm the prediction with likwid-bench.

Friday, July 20, 2012

Notes on the Sandy Bridge Uncore (part 1)

For those of you who never heard of something called Uncore on a processor: On recent chips Hardware performance monitoring (HPM) is different for the cores and the part of the chip shared by multiple cores (called the Uncore). Before the Intel Nehalem processor there was no Uncore. Hardware performance monitoring was limited to the cores. Still also at that time questions came up how the shared portions of the processor are measured. Nehalem introduced the Uncore, which are parts of the microarchitecture shared by multiple cores. This can be the shared last level cache, the main memory controller or the QPI bus connecting multiple sockets. The Uncore had its own hardware performance monitoring unit with eight counters per Uncore (means socket). For many tools which used sampling to relate hardware performance counter measurments to code lines this caused great headaches as the Uncore measurements cannot be  related to code executed on a specific core anymore.

The LIKWID tool suite has no problem with this as it restricts itself to simple end to end measurements of hardware performance counter data. The relation between the measurement and your code is realized through pinning the execution of code to dedicated cores (which the tool can also do for you). As you might think the Nehalem Uncore is a bad idea Intel introduced the EX type processors. This new design mainly introduced a completely new Uncore, which is now a system on a chip (Uncore HPM manual: Intel document reference number 323535). In its first  implementation this was very complex to program with tons of MSR registers which needed to be programmed and a lot of dependencies and restrictions. The new mainstream server/desktop SandyBridge microarchitecture also uses this system on a chip type Uncore design. Still the implementation of the hardware performance monitoring was changed.

First I have to warn you: Intel is not very strict about consistency in naming. E.g. the naming of the MSR registers in the SDM manuals can be different from the naming used for the same MSR registers in documents written in other parts of the company (e.g. the Uncore manuals). This is unfortunatly also true for the naming of the entities in the Uncore. The Uncore does not have one HPM unit anymore but a bunch of them. On NehalemEX and WestmereEX the different parts of the Uncore were called boxes, there were mbox's  (main memory controller) and cbox's (last level cache segments) and a bunch of others. While in SNB there still exist the same type of boxes they are named differently now, e.g. the mbox's are now called iMC and the cbox's are called CBo's. Still in LIKWID I stick with the old naming, since I want to build on the stuff I implemented for the EX type processors.

The mapping is as follows:

  • Caching agent,  SNB: CBo   EX: CBOX
  • Home agent,  SNB: HA   EX: BBOX
  • Memory controller, SNB: iMC  EX: MBOX
  • Power Control, SNB: PCU  EX: WBOX
  • QPI,  SNB: QPI  EX: SBOX/RBOX
The complexity comes in by the  large number of places you can measure something. So you have one Uncore per socket, each sockets has one or multiple performance monitoring units of different types. E.g. there are four iMC units, one per memory channel. And each of those has four counters. So you have 2*4*4=32  different memory related counters you can measure stuff on a two socket system.


Before hardware performance monitoring was controlled via writing/reading to MSR registers (model specific registers).  This was still true on EX type processors. Starting with SNB the Uncore is now partly programmed by PCI bus address space.  Some parts are still programmed using the MSR registers, e.g. the CBo boxes. Still most of the unit are now programmed with PCI config space registers.


I am no specialist on PCI buses, still for the practical part the operating system maps the the pci configuration space. For PCI this is 256bytes per device using usually 32bit addressing. The device memory is organized in BUS / DEVICE / FUNCTION . The BUS is the socket in the HPM sense, or the other way round there is one new BUS per socket in the system. DEVICE is the HPM unit type (e.g. main memory box) and the FUNCTION is then the concrete HPM unit.

On a two socket SandyBridge-EP system there are the following devices (this is taken from LIKWID source):




typedef enum {
    PCI_R3QPI_DEVICE_LINK_0 = 0,
    PCI_R3QPI_DEVICE_LINK_1,
    PCI_R2PCIE_DEVICE,
    PCI_IMC_DEVICE_CH_0,
    PCI_IMC_DEVICE_CH_1,
    PCI_IMC_DEVICE_CH_2,
    PCI_IMC_DEVICE_CH_3,
    PCI_HA_DEVICE,
    PCI_QPI_DEVICE_PORT_0,
    PCI_QPI_DEVICE_PORT_1,
    PCI_QPI_MASK_DEVICE_PORT_0,
    PCI_QPI_MASK_DEVICE_PORT_1,
    PCI_QPI_MISC_DEVICE_PORT_0,
    PCI_QPI_MISC_DEVICE_PORT_1,
    MAX_NUM_DEVICES
} PciDeviceIndex;





static char* pci_DevicePath[MAX_NUM_DEVICES] = {
 "13.5", "13.6", "13.1", "10.0", "10.1", "10.4",
 "10.5", "0e.1", "08.2", "09.2", "08.6", "09.6",
 "08.0", "09.0" };

So e.g. the memory channel 1 (PCI_IMC_DEVICE_CH_1) on socket 0 is: BUS 0x7f  DEVICE 0x10 FUNCTION 0x1 .
The Linux OS maps this memory in different locations in /sys and /proc filesystems. In LIKWID I use the /proc filesystem. Above device is accessible via the path: /proc/bus/pci/7f/10.1 .  Unfortunatly if you make a hexdump as user on such a file you only get the header part (first 30-40 bytes). The rest is only visible to root. For LIKWID this means you have to use the tool as root if you want to use direct access or you have to setup the daemon mode proxy to access these files.  In my next post I will explain how the SNB Uncore is implemented in likwid-perfctr and what performance groups are available.


Wednesday, June 20, 2012

Validation of performance groups on AMD Interlagos

As I fortunatly have again access to a AMD Interlagos server system I was able to review the groups and events. I also validated some of the performance groups in the new accuracy testsuite. There I use two variants of likwid-bench: one ist plain and measures the performance based on flop and byte counts, the other one is instrumented using the LIWID marker API. I run a  variety of testcases with different benchmark type and dataset size and compare the results with each other. At this time we focus on serial measurements. I do multiple repititions per data set size. Data set sizes (from left to right) fit in in L1 (12kB), L2 (1MB), L3 (4MB) and MEM (1GB).  The red curve is the result output by likwid-bench (the performance as measured by the application). The blue curve is the performance measured with likwid-perfctr based on hardware performance monitoring.

The test system is a AMD Opteron(TM) Processor 6276 in a dual socket server system.
The new and updated groups are available in the upcoming LIKWID release.

The following performance groups are supported:



  • BRANCH: Branch prediction miss rate/ratio
  • CACHE: Data cache miss rate/ratio
  • CPI: Cycles per instruction
  • DATA: Load to store ratio
  • FLOPS_DP: Double Precision MFlops/s
  • FLOPS_SP: Single Precision MFlops/s
  • FPU_EXCEPTION: Floating point exceptions
  • ICACHE: Instruction cache miss rate/ratio
  • L2: L2 cache bandwidth in MBytes/s
  • L2CACHE: L2 cache miss rate/ratio
  • L3: L3 cache bandwidth in MBytes/s
  • L3CACHE: L3 cache miss rate/ratio
  • LINKS: Bandwidth on the Hypertransport links
  • MEM: Main memory bandwidth in MBytes/s
  • NUMA: Read/Write Events between the ccNUMA nodes



First  tests are covering the FLOPS_DP and FLOPS_SP groups:
Both groups use the RETIRED_FLOPS event with different umasks.


As can be seen this event provides a very accurate flop count independant from the arithmetic instructions used. The absolute performance does not matter as different data set sizes are used.

Next the cache bandwidth groups L2 and L3.  The L2 group is based on the  DATA_CACHE_REFILLS_ALL event, while the L3 group uses the L2_FILL_WB_FILL and L2_FILL_WB_WB events. 






As can be seen for L2 cache the results are very accurate. Because the cache is inclusive and  the L1 is write through the bandwidths measured are the same as the bandwidth seen by your application.




Also here the results are very accurate. Still because the L3 cache is write allocate and exclusive with regard to the L2 the measured bandwidths are in all cases two times the bandwidth seen by your application due to the cache architecture.




Last not least the MEM group. Again the results are very accurate. All benchmarks involving a store have a write allocate transfer resulting in a higher measured bandwidth opposed to the bandwidth seen by your application.

The performance groups work now fine on Interlagos. The updated groups as shown here will be part of the upcoming release of LIKWID.

Wednesday, June 13, 2012

Tutorial how to measure energy consumption on Sandy Bridge processors

This small study illustrates how to use likwid-perfctr to measure energy consumption using the RAPL counter on Intel Sandy Bridge Processors. You have to set up likwid to access the msr device files.

We will use the famous STREAM triad benchmark for this study. The STREAM triad performs the operation A[i]=B[i]+s*C[i]. For easy measurement we will use likwid-bench which already includes different variants of STREAM triad out of the box. We want to measure the variations of energy to solution for different frequencies within one ccNUMA domain. The STREAM triad benchmark is regarded to be solely limited by main memory bandwidth. We start out doing a scaling study on a SandyBridge system running with a nominal clock of 2.7 GHz. If you have set the power governor to performance you have to be careful as the clock can be different from this due to turbo mode. We first will get the turbo mode steps using likwid-powermeter. This is what the output looks like on our system:


$ likwid-powermeter -i

-------------------------------------------------------------
CPU name: Intel Core SandyBridge EP processor 
CPU clock: 2.69 GHz 
-------------------------------------------------------------
Base clock: 2700.00 MHz 
Minimal clock: 1200.00 MHz 
Turbo Boost Steps:
C1 3500.00 MHz 
C2 3500.00 MHz 
C3 3400.00 MHz 
C4 3200.00 MHz 
C5 3200.00 MHz 
C6 3200.00 MHz 
C7 3100.00 MHz 
C8 3100.00 MHz 
-------------------------------------------------------------
Thermal Spec Power: 130 Watts 
Minimum  Power: 51 Watts 
Maximum  Power: 130 Watts 
Maximum  Time Window: 0.398438 micro sec 
-------------------------------------------------------------


For our benchmark runs we want to verify the actual clock. To do so you can enable the instrumentation available in likwid-bench. If you compile likwid with gcc you have to uncomment the following line in the file include_GCC.mk:



DEFINES  += -DPERFMON


Then rebuild everything. To use the instrumentation you have to call likwid-bench together with likwid-perfctr. While you can measure energy with likwid-powermeter, it is more convenient to use likwid-perfctr as it is much more flexible in different modes. Also you can correlate your energy measurements with other hardware performance counter data. All things we are interested in are in the ENERGY group available on SandyBridge processors.The following example will run a AVX version of the STREAM triad on all 8 cores of socket 1 with the data set being 1GB large. likwid-perfctr will setup the counters to measure the ENERGY group. To use the instrumented regions you have to specify the -m option.


$./likwid-perfctr -c S1:0-3 -g ENERGY -m ./likwid-bench  -g 1 -i 50 -t stream_avx  -w S1:1GB:4
=====================
Region: bench 
=====================
+-------------------+---------+---------+---------+---------+
|    Region Info    | core 8  | core 9  | core 10 | core 11 |
+-------------------+---------+---------+---------+---------+
| RDTSC Runtime [s] | 2.03051 | 2.03034 | 2.03045 | 2.03027 |
|    call count     |    1    |    1    |    1    |    1    |
+-------------------+---------+---------+---------+---------+
+-----------------------+-------------+-------------+-------------+-------------+
|         Event         |   core 8    |   core 9    |   core 10   |   core 11   |
+-----------------------+-------------+-------------+-------------+-------------+
|   INSTR_RETIRED_ANY   | 6.38913e+08 | 6.34148e+08 | 6.32104e+08 | 6.29149e+08 |
| CPU_CLK_UNHALTED_CORE | 6.45153e+09 | 6.45008e+09 | 6.45041e+09 | 6.4504e+09  |
| CPU_CLK_UNHALTED_REF  | 5.44346e+09 | 5.44225e+09 | 5.44251e+09 | 5.44253e+09 |
|    PWR_PKG_ENERGY     |   146.273   |      0      |      0      |      0      |
+-----------------------+-------------+-------------+-------------+-------------+
+-------------------+---------+---------+---------+---------+
|      Metric       | core 8  | core 9  | core 10 | core 11 |
+-------------------+---------+---------+---------+---------+
|    Runtime [s]    | 2.39535 | 2.39481 | 2.39494 | 2.39493 |
| Runtime rdtsc [s] | 2.03051 | 2.03051 | 2.03051 | 2.03051 |
|    Clock [MHz]    | 3192.14 | 3192.13 | 3192.14 | 3192.12 |
|        CPI        | 10.0977 | 10.1713 | 10.2047 | 10.2526 |
|    Energy [J]     |   146   |    0    |    0    |    0    |
|     Power [W]     | 71.9031 |    0    |    0    |    0    |
+-------------------+---------+---------+---------+---------+
+------------------------+---------+---------+---------+---------+
|         Metric         |   Sum   |   Max   |   Min   |   Avg   |
+------------------------+---------+---------+---------+---------+
|    Runtime [s] STAT    | 9.58003 | 2.39535 | 2.39481 | 2.39501 |
| Runtime rdtsc [s] STAT | 8.12204 | 2.03051 | 2.03051 | 2.03051 |
|    Clock [MHz] STAT    | 12768.5 | 3192.14 | 3192.12 | 3192.13 |
|        CPI STAT        | 40.7262 | 10.2526 | 10.0977 | 10.1815 |
|    Energy [J] STAT     |   146   |   146   |    0    |  36.5   |
|     Power [W] STAT     | 71.9031 | 71.9031 |    0    | 17.9758 |
+------------------------+---------+---------+---------+---------+

As you can see the ENERGY group measures among other things the cycles and instructions executed and will output the derived metrics CPI, Clock, Energy in Joule and Power in Watt. For us the interesting parts are the Clock and the Energy. The Energy counter can only measured per package, therefore you only see one result in the table. For the statistics table not all columns make sense, still I guess you can figure out yourself where this can be applied. All cores manage to overclock up to the 3.2 GHz as predicted by likwid-powermeter.

So I did this for various settings to get something like that:

Performance scaling of STREAM triad



You can see the higher clocked variants saturate around 3 cores while the low clocked runs need 4 cores to saturate. Also notable is that the bandwidth decreases with lower core frequency. If you repeat this using a variant of STREAM triad employing so called non temporal stores (triad_mem in likwid-bench) this changes to:
Performance scaling  of STREAM triad with NT stores

Now all requencies need  4 cores to saturate the bandwidth. Well what about energy to solution?
Energy to solution for STREAM triad

Energy to solution for STREAM triad with NT stores

As you can see without NT stores the lowest Energy is reached using 1.6GHz and 4 cores. With NT stores the optimum is either also 1.6GHz or 1.2 GHz with 8 cores. And it is notable that with 8 cores there is a factor of two saving in Energy to solution between running with Turbo mode against 1.2GHz fixed frequency for this  main memory limited benchmark. Of course also the performance is slightly lower with 1.2 GHz, still of course not a factor of two but around 20%.  You can figure out interesting things using likwid-bench with instrumentation turned on and likwid-perfctr. 

Tuesday, April 24, 2012

load/store throughput on SandyBridge

For many codes the biggest improvements  with SandyBridge do not come from its AVX capabilities but from the doubled load throughput, Twi times 16 bytes instead of 16 bytes per cycle. The width of the load store units is still 16 bytes. So it can load/store 16 bytes per instruction. Where Intel processors before SandyBridge (SNB) could execute either one load and one store or  one load or one store per cycle SNB can in principal throughput two loads and one store per cycle. The 32 byte wide AVX load and store instructions are executed split. This means the loads/stores take 2 cycles each to execute for 256bit AVX code.

This means that on paper SSE and AVX should have similar load/store capabilities. Below results for a triad benchmark code are shown (serial execution). The according C code is:


for (int i=0; i < size; i++) {

    A[i] = B[i] + C[i] * C[i];
 }

Interesting for us is the L1 cache performance. The Intel compiler was used.  It can be seen that for  -O3 (SSE code) SNB (i7-2600) is not twice  as fast as the Westmere code as expected, despite the fact it also is clocked higher. The Westmere clocks 2.93GHz  with Turbo Mode. This code is still not optimal because split 8 byte loads are used for some of the vectors. Adding the vector aligned pragma leads to optimal code.  Now Westmere reaches the theoretical limit of 3 cycles per SIMD update resulting in: 2.93 GHz/ 3 cycles * 4 flops = 3.9 GFlops . While SNB also improves it is still far away from its optimal performance.



You can see that AVX brings an additional boost on SNB. This is not so clear because a purely load/store throughput limited code as the triad should not profit from the present implementation of AVX on SNB, as the load store capabilities are the same. It turns out that a specific detail of the scheduler prevent that the SSE 128bit code actually can reach the full throughput of SNB. Each data move instruction consists of a address generation part and a data part. The load schedule ports (2 and 3) are also used for the address generation of the stores. The actual store port is port 4. This means that for  a store two ports are blocked (4 and one of 2/3). On port 2/3 they compete with  possible loads. This causes that for 128bit code the full throughput cannot be reached. With AVX code this is possible as it appears that one 256bit load uses two cycles but only one uop. This means that in the second cycle the  port is free and the address generation uop of the store can be scheduled. This is also described in the very good microarchitecture manual from Agner Fog (update 29-02-2012, page 100).

Tuesday, February 7, 2012

Raw arithmetic SIMD Instruction throughput on Interlagos and SandyBridge

The arithmetic performance of modern processors is apart from task parallelism generated mainly by data parallel instructions. There exists a zoo of x86 instruction set extensions: SSEx AVX FMA3 FMA4 . This article wants to give a short overview of raw SIMD instruction throughput on the recent processors AMD Interlagos and Intel SandyBridge. Of course a short article cannot cover the topic in full depth, so for more detailed tips and tricks please refer to the optimization manuals released by the vendors.

Testmachines are a AMD Opteron (trademark) 6276 and a Intel Xeon (trademark) E31280. Methodology is simple: Create small loop kernels in likwid-bench using different instruction and operation mixes and measure  CPI. To keep things simple we restrict ourself to the double precision case. Still this makes no difference with regard to instruction throughput. CPI stands for Cycles Per Instruction and gives you how many instructions per cycle a processor core is able to execute in average. If, e.g., a processor can execute four instructions per cycle the optimal CPI on this processor is 0.25. The ability to schedule multiple instructions of an otherwise sequential instruction stream in one cycle is also referred to as superscalar execution.
The following illustration shows the FPU capabilities of both processors:








Interlagos has two 128bit wide units which are fused multiply accumulate units. This means Interlagos is superscalar  for any mix of operations while SandyBridge  is only superscalar for an equal multiply/add  instruction mix. In contrast to Interlagos SandyBridge has 256bit wide execution units. For AVX code and a multiply/add operation mix SandyBridge is superscalar while Interlagos is not.  Let's look at the results for different instruction mixes:



Blue is equal, green is better and red is worse. Interlagos has advantages with pure add or mult SSE code.
For AVX multadd SandyBridge is better while AVX pure add and mult both are equal. SandyBridge is slightly more efficient though. On paper FMA should perform similar than the SandyBridge AVX variant. Still the code is very dense for this case and it is more difficult to get efficient results. For one core the result was not as expected. Using both cores sharing a FPU showed a better performance. Not as efficient as possible but still the best possible on Interlagos with regard to MFlops/s .

For completeness the results in terms of arithmetic FP instruction throughput:



Please note that this is only reflecting a single aspect of the processor microarchitecture and is not representative for the overall processor performance. Still I hope this can clear some uncertainties with regard to SIMD arithmetic instruction throughput.