Wednesday, June 20, 2012

Validation of performance groups on AMD Interlagos

As I fortunatly have again access to a AMD Interlagos server system I was able to review the groups and events. I also validated some of the performance groups in the new accuracy testsuite. There I use two variants of likwid-bench: one ist plain and measures the performance based on flop and byte counts, the other one is instrumented using the LIWID marker API. I run a  variety of testcases with different benchmark type and dataset size and compare the results with each other. At this time we focus on serial measurements. I do multiple repititions per data set size. Data set sizes (from left to right) fit in in L1 (12kB), L2 (1MB), L3 (4MB) and MEM (1GB).  The red curve is the result output by likwid-bench (the performance as measured by the application). The blue curve is the performance measured with likwid-perfctr based on hardware performance monitoring.

The test system is a AMD Opteron(TM) Processor 6276 in a dual socket server system.
The new and updated groups are available in the upcoming LIKWID release.

The following performance groups are supported:



  • BRANCH: Branch prediction miss rate/ratio
  • CACHE: Data cache miss rate/ratio
  • CPI: Cycles per instruction
  • DATA: Load to store ratio
  • FLOPS_DP: Double Precision MFlops/s
  • FLOPS_SP: Single Precision MFlops/s
  • FPU_EXCEPTION: Floating point exceptions
  • ICACHE: Instruction cache miss rate/ratio
  • L2: L2 cache bandwidth in MBytes/s
  • L2CACHE: L2 cache miss rate/ratio
  • L3: L3 cache bandwidth in MBytes/s
  • L3CACHE: L3 cache miss rate/ratio
  • LINKS: Bandwidth on the Hypertransport links
  • MEM: Main memory bandwidth in MBytes/s
  • NUMA: Read/Write Events between the ccNUMA nodes



First  tests are covering the FLOPS_DP and FLOPS_SP groups:
Both groups use the RETIRED_FLOPS event with different umasks.


As can be seen this event provides a very accurate flop count independant from the arithmetic instructions used. The absolute performance does not matter as different data set sizes are used.

Next the cache bandwidth groups L2 and L3.  The L2 group is based on the  DATA_CACHE_REFILLS_ALL event, while the L3 group uses the L2_FILL_WB_FILL and L2_FILL_WB_WB events. 






As can be seen for L2 cache the results are very accurate. Because the cache is inclusive and  the L1 is write through the bandwidths measured are the same as the bandwidth seen by your application.




Also here the results are very accurate. Still because the L3 cache is write allocate and exclusive with regard to the L2 the measured bandwidths are in all cases two times the bandwidth seen by your application due to the cache architecture.




Last not least the MEM group. Again the results are very accurate. All benchmarks involving a store have a write allocate transfer resulting in a higher measured bandwidth opposed to the bandwidth seen by your application.

The performance groups work now fine on Interlagos. The updated groups as shown here will be part of the upcoming release of LIKWID.

No comments:

Post a Comment