likwid: 2014

The set of LIKWID tools provide important measuring and managing functionalities that access hardware registers and priviledged sections of the operating system.
Similar to other operating system tools that open, manipulate and close features of the operating system, it needs special permission to perform these operations.
The common way were setuid root applications that are allowed to change their uid to root during runtime. Examples for those applications are ping, (un)mount and su.

But there is also another way to switch priviledges between the two user groups, root and non-root. The Linux kernel supports the feature of capabilities for processes since the version 2.2. When a process is forked it inherits the capabilities of the parent process. The parent process was then able to restrict the scope of a child process by removing capabilities from the child's set. With the version 2.6.24 the full support for the capabilities feature was implemented. Executable files can be associated with sets of capability flags for fine grained management of permissions. But the management of these sets is rather complicated and the names of the three sets per executable are missleading. The sets are called permitted, effective and inheritable.

The effective set contains the capabilities that are currently set for executable and which are checked when triggering system calls.

The permitted set is the superset of the effective set and specifies the flags that can be maximally enabled for the executable.

The inheritable set is used at fork time to specify which flags are set in the child process's permitted set.

For a more detailed and possibly more correct explaination of the capabilities see the manpage (man 7 capabilities or online) or this website .

So how does this affect the LIKWID tools?
For measuring the performance counters with likwid-perfctr we need permission to read and write the MSR registers. Since the related assembler instructions RDMSR and WRMSR are priviledged operations, they can only be executed at kernel space. User space applications can access the registers through a kernel module that exports the functionality with device files.

Access Daemon

LIKWID therefore started to offer a daemon application that forwards the read and write requests to the MSR registers with higher priviledges. Hence, in order to gain higher priviledges the daemon must be owned by root and an application needs the permission to set its uid to root. The common way was:

$ sudo chown root:root likwid-accessD
$ sudo chmod u+s likwid-accessD

This still works but, having security in mind, this gives LIKWID's access daemon more priviledges than it needs. As an advantage the permissions of the MSR device files do not need to be modified because the daemon will access the files as user root.

The alternative is to give the daemon the capability to read and write the MSR device files. The commands to do this:

$ sudo chmod 666 /dev/cpu/*/msr
$ sudo setcap cap_sys_rawio+ep likwid-accessD

The changing of the permissions is mandatory because the daemon will still be executed with the uid of the user but with more capabilities. This only works until the next reboot because the file permissions are not persistent. You can use udev to set them every time the MSR kernel module is loaded:

<must be done by root>
$ cat > /etc/udev/rules.d/99-persistent-msr.rules <<EOF
> KERNEL=="msr*", MODE="0666"
> EOF

As a possibility, one can also avoid setting the read and write permission for _others_ on the device files, create a likwid group for all LIKWID users and assign it to the MSR device files. Add to the rules GROUP="<likwid-group>" and modify the MODE string to afford this.

All commands are tested on an Ubuntu 12.04.4 LTS system with Haswell CPUs.
Since the common Haswell CPU does not have Uncore support, I wanted to test the same procedures for a SandyBridge EP system with SuSE Linux Enterprise Server 11, patchlevel 3. The setuid root way is usable but the setting of capabilities for the daemon is not enough for the SuSE system. The system forbids the access to the MSR device files even if the file permissions are valid. The setuid root method has another advantage and maybe also mandatory for accessing the Uncore counters. Those counters are mapped in the PCI address space and can be accessed through PCI device files. The current documentation of capabilities does not mention a flag that can permit these accesses, the used flag cap_sys_rawio only brings up MSR device file access.

Consequently, since the method works for both operating systems and Uncore counters, we recommend using the setuid root method for LIKWID's access daemon.

One might think that if we need this, why not using these procedures on the likwid-perfctr executable directly. When building LIKWID with a static library the setuid root method as well as the capabilities method is usable. But when LIKWID should be linked again the shared library another problem arises. When an application has the setuid root bit set, some environment variables are ignored for security. One of these variables is the LD_LIBRARY_PATH and consequently the application cannot find its library anymore.
The capabilites do not ignore those variables. Therefore a setuid root application with linked shared library should either use the capabilites method, build the application with static libraries or build the application with static search path for the library. Static search paths can be included in LIKWID like this:

$ vi make/include_<Compiler>.mk
<add rpath to SHARED_CFLAGS and SHARED_LFLAGS>
SHARED_CFLAGS = -fpic -Wl,-rpath=$(PREFIX)/lib
SHARED_LFLAGS = -shared -Wl,-rpath=$(PREFIX)/lib

Sometime the setting of capabilities is not possible because the underlying file system does not support extended attributes. You can test this with:

$ zcat /proc/config.gz | grep FS_XATTR
or
$ cat /boot/config-<kernel-version> | grep FS_XATTR

For EXT3 you also need to set the EXT3_FS_SECURITY kernel option to enable the storing of capabilities.
For kernels that have only a slight higher number as 2.6.24, the first kernel supporting capabilities, the TMPFS does not support the extended attributes.
One example of an operating system is CentOs 6.5.

Frequency Daemon

Starting with the version 3.1.2, LIKWID also contains an executable to manipulate the frequency of CPU cores. The following commands enable the setting of the scaling governor and the frequency for each CPU core for the Ubuntu 12.04.4 LTS system:

$ sudo setcap cap_sys_rawio+ep likwid-setFreq
$ cat > /etc/udev/rules.d/99-persistent-setFreq.rules <<EOF
> KERNEL=="cpu*", RUN+="/bin/chmod 0666 /sys/devices/system/cpu/cpu%n/cpufreq/scaling_governor"
> KERNEL=="cpu*", RUN+="/bin/chmod 0666 /sys/devices/system/cpu/cpu%n/cpufreq/scaling_setspeed"
> EOF

For the SuSE Linux Enterprise Server system, it is also enough to set the capabilities, therefore the capabilities method is perferable for the likwid-setFreq daemon.
For completeness, the setuid root method for the frequency daemon:

$ sudo chown root:root likwid-setFreq
$ sudo chmod u+s likwid-setFreq

Microarchitectures gradually improve over generations and one thing which is addressed are the instruction throughput capabilities in terms of superscalar execution. This article will focus on the development of the load/store unit improvements from Intel Westmere, Intel IvyBridge to Intel Haswell microarchitectures. The following information is to the best of my knowledge. If you find any errors please let me know to correct it. The present article will perform an architectural analysis for the STREAM triad kernel in order to predict the expected performance with the data located in L1 cache on the considered architectures. In a subsequent article we will try to measure this prediction using the likwid-bench microbenchmarking tool.

likwid-bench is part of the LIKWID tool suite. It is an application which comes out of the box with many streaming based testcases but also allows to easily add more testcases in terms of simple text files. The application cares for data set size and placement and threading configuration. We choose the so called STREAM triad as implemented in McCalpins STREAM microbenchmark. This is a simple C code for it (data type is float 32bit floating point):

for (int i=0; i
{
vecA[i] = vecB[i] + scalar * vecC[i];
}

One iteration requires two loads and one store and a multiply/add operations. For Intel Westmere the fastest implementation uses SSE packed SIMD instructions. This is the resulting assembly code:

1:
movaps xmm0, [rdx + rax*4]
movaps xmm1, [rdx + rax*4+16]
movaps xmm2, [rdx + rax*4+32]
movaps xmm3, [rdx + rax*4+48]
mulps xmm0, xmm4
mulps xmm1, xmm4
mulps xmm2, xmm4
mulps xmm3, xmm4
addps xmm0, [rcx + rax*4]
addps xmm1, [rcx + rax*4+16]
addps xmm2, [rcx + rax*4+32]
addps xmm3, [rcx + rax*4+48]
movaps [rsi + rax*4] , xmm0
movaps [rsi + rax*4+16], xmm1
movaps [rsi + rax*4+32], xmm2
movaps [rsi + rax*4+48], xmm3
add rax, 16
cmp rax, rdi
jl 1b

Lets look for the theory first. Here is a schematic figure illustrating the instruction throughput capabilities of the Intel Westmere architecture. The architecture can issue and retire 4 uops per cycle. It is capable of executing one load and one store or one load or one store per cycle. Both load or store can be up to 16 byte wide. On the arithmetic side it can execute one multiply and one add or one add or one multiply.

Schematic illustration of instruction throughput capabilities for Intel Westmere

Throughout the article we will consider cycles it takes to execute loop iterations equivalent to one cache line (64bytes). Because packed SSE SIMD is 16 bytes wide this results in 4 SIMD iterations to update one cache line. The throughput bottleneck on this architecture is the load port and the minimum time for executing one SIMD iteration is 2 cycles. Therefore we end up with 4 SIMD iterations x 2 cycles = 8 cycles to update one cache line. We can easily compute the lightspeed performance (lets consider a clock of 3GHz) as 3 GHz / 2 cycles * 4 iterations * 2 flops/iteration = 12 GFlops/s.

After Westmere, which was a so called tick (incremental) update, Intel released a larger update with the SandyBridge processor. Intel SandyBridge introduced the AVX SIMD instruction set extension with 32 bytes SIMD width instead of the previous width of 16 bytes with SSE. The resulting AVX kernel looks like the following:

1:
vmovaps ymm1, [rdx + rax*4]
vmovaps ymm2, [rdx + rax*4+32]
vmovaps ymm3, [rdx + rax*4+64]
vmovaps ymm4, [rdx + rax*4+96]
vmulps ymm1, ymm1, ymm5
vmulps ymm2, ymm2, ymm5
vmulps ymm3, ymm3, ymm5
vmulps ymm4, ymm4, ymm5
vaddps ymm1, ymm1, [rcx + rax*4]
vaddps ymm2, ymm2, [rcx + rax*4+32]
vaddps ymm3, ymm3, [rcx + rax*4+64]
vaddps ymm4, ymm4, [rcx + rax*4+96]
vmovaps [rsi + rax*4] , ymm1
vmovaps [rsi + rax*4+32], ymm2
vmovaps [rsi + rax*4+64], ymm3
vmovaps [rsi + rax*4+96], ymm4
add rax, 32
cmp rax, rdi
jl 1b

Because its successor IvyBridge performs similar we will only consider IvyBridge in this comparison. The following illustration again illustrates the important properties. The execution units are 32 bytes wide. This microarchitecture also adds a second load port. The load/store units are still 16 byte wide. This architecture can execute either one AVX packed (SIMD) load instruction and half a packed AVX store or one AVX packed load or half a packed AVX store. For SSE code the architecture suggests it can execute two packed SSE loads and one packed SSE store or one packed SSE load and one packed SSE store or one SSE store. Still as can be seen in the illustration the store (data) unit shares the address generation with the load units (indicated by AGU in the illustration). If you have two SSE loads and one store instruction mix the store competes with the loads for port 2 or 3. The maximum throughput therefore cannot be reached with SSE code. This does not apply for AVX, here the 32 byte load only occupies one port, the other port can be used for the store address generation.

Schematic illustration of instruction throughput capabilities for Intel Ivy Bridge

But what does that mean for the execution of our STREAM triad test case? Again the load/store unit is the bottleneck. The two loads as well as the store take 2 cycles for one AVX SIMD iterations. Because you need two AVX SIMD iterations to update one cache lines we end up with 2 SIMD iterations x 2 cycles = 4 cycles. Lightspeed performance is then 3 GHz / 2 cycles * 8 iterations * 2 flops/iteration = 24 GFlops/s.

The next microarchitecture Haswell was a larger update (a tock in Intel nomenclature). Haswell widens all load/store paths to 32 bytes. Moreover the processor adds two additional ports, one of them with an address generation unit for the stores. Haswell can now issue up to 8 uops per cycles but still is limited to 4 uops retired per cycle. One could ask why you want to do that, stuffing stuff in on top when not more can exit at the bottom. One explanation is that the 4 uops per cycle throughput in e.g. the Westmere design could not be reached in practice. Already a CPI of 0.35-0.5 is the best you can expect. By issuing more instructions you increase the average instruction throughput by getting closer to the theoretical limit of 4 uops per cycles. The following illustration shows the basic setup of Haswell.

Schematic illustration of instruction throughput capabilities for Intel Haswell

Haswell has support for AVX2 which adds things like gather and promotes most instructions to 32 byte wide execution. Also Haswell has fused multiply add instructions (FMA). There is a drawback here: A naive view suggests Haswell should be able to execute either one add and one multiply or two adds or two multiplies because it has two FMA units. This holds for add instructions but not for multiplies. For multiplications the throughput is still one instruction per cycle.

Again lets look what consequences these changes have for the throughput of the STREAM triad kernel. From the execution units it should be possible to issue all instructions in one cycle. But due to the fact that only 4 instructions can retire this throughput cannot be reached. This situation should be improved by using the FMA instructions. There only one instruction is necessary and we can get away with 4 instructions for one SIMD iteration. This is the changed code using the FMA instruction:

1:
vmovaps ymm1, [rdx + rax*4]
vmovaps ymm2, [rdx + rax*4+32]
vmovaps ymm3, [rdx + rax*4+64]
vmovaps ymm4, [rdx + rax*4+96]
vfmadd213ps ymm1, ymm5, [rcx + rax*4]
vfmadd213ps ymm2, ymm5, [rcx + rax*4+32]
vfmadd213ps ymm3, ymm5, [rcx + rax*4+64]
vfmadd213ps ymm4, ymm5, [rcx + rax*4+96]
vmovaps [rsi+ rax*4] , ymm1
vmovaps [rsi+ rax*4+32], ymm2
vmovaps [rsi+ rax*4+64], ymm3
vmovaps [rsi+ rax*4+96], ymm4
add rax, 32
cmp rax, rdi
jl 1b

For this code one SIMD iteration should be able to execute in just 1 cycle ending up with 2 cycles to update a cache line. This is due to the doubled data path of 32 bytes of the load/store units. Lightspeed performance is then 3 GHz / 1 cycle * 8 iterations * 2 flops/iteration = 48 GFlops/s. So theoretically the L1 performance for the STREAM triad is doubled from Westmere to IvyBridge and doubled again from IvyBridge to Haswell.

In a next article we will try to confirm the prediction with likwid-bench.

likwid

Wednesday, June 4, 2014

LIKWID, the capabilities system and setuid root

Access Daemon

Frequency Daemon

Monday, March 17, 2014

Probing instruction throughput of recent Intel processors with likwid-bench - Part 1: Theory