Linux Applications Debugging Techniques/Aiming for and measuring performance

gprof & -pg

To profile the application with gprof:

Compile the code with -pg
Link with -pg
Run the application. This creates a file gmon.out in the current folder of the application.
At the prompt, in the folder where gmon.out lives: gprof path-to-application

PAPI

The Performance Application Programming Interface (PAPI) offers the programmer access to the performance counter hardware found in most major microprocessors. With a decent C++ wrapper, measuring branch mispredictions and cache misses (and much more) is literally one line of code away.

By default, these are the events that papi::counters<Print_policy> is looking for:

static  const events_type events = {  
      {PAPI_TOT_INS, "Total instructions"} 
    , {PAPI_TOT_CYC, "Total cpu cycles"}
    , {PAPI_L1_DCM,  "L1 load  misses"}
//  , {PAPI_L1_STM,  "L1 store  missess"}
    , {PAPI_L2_DCM,  "L2 load  misses"}
//  , {PAPI_L2_STM,  "L2 store  missess"}
    , {PAPI_BR_MSP,  "Branch mispredictions"}
};

The counters class is parameterized with a Print_policy instructing what to do with the counters once they get out of scope.

As an example, lets look a bit at these lines:

   const int nlines = 196608;
   const int ncols  = 64;
   char ctrash[nlines][ncols];
   {
       int x;
       papi::counters<papi::stdout_print> pc("by column"); //<== the famous one-line
       for (int c = 0; c < ncols; ++c) {
           for (int l = 0; l < nlines; ++l) {
               x = ctrash[l][c];
           }
       }
   }

The code just loops over an array but in the wrong order: the innermost loop iterates on the outer index. While the result is the same whether we loop over the first index first or over the last one, theorically, to preserve cache locality, the innermost loop should iterate over the innermost index. This should make a big difference for the time it takes to iterate over the array:

   {
       int x;
       papi::counters<papi::stdout_print> pc("By line");
       for (int l = 0; l < nlines; ++l) {
           for (int c = 0; c < ncols; ++c) {
               x = ctrash[l][c];
           }
       }
   }

papi::counters is a class wrapping around PAPI functionality. It will take a snaphost of some performance counters (in our case, we are interested in cache misses and in branch mispredictions) when a counters object is instantiated and another snapshot when the object is destroyed. Then it will print out the differences.

A first measure, with non-optimized code (-O0), shows the following:

Delta by column:
  PAPI_TOT_INS (Total instructions): 188744788 (380506167-191761379)
  PAPI_TOT_CYC (Total cpu cycles): 92390347 (187804288-95413941)
  PAPI_L1_DCM (L1 load  misses): 28427 (30620-2193)                 <==
  PAPI_L2_DCM (L2 load  misses): 102 (1269-1167)
  PAPI_BR_MSP (Branch mispredictions): 176 (207651-207475)          <==

Delta By line:
  PAPI_TOT_INS (Total instructions): 190909841 (191734047-824206)
  PAPI_TOT_CYC (Total cpu cycles): 94460862 (95387664-926802)
  PAPI_L1_DCM (L1 load  misses): 403 (2046-1643)                    <==
  PAPI_L2_DCM (L2 load  misses): 21 (1081-1060)
  PAPI_BR_MSP (Branch mispredictions): 205934 (207350-1416)         <==

While the cache misses have indeed improved, branch mispredictions exploded. Not exactly a good tradeoff. Down in the pipeline of the processor, a comparison operation translates into a branch operation. Something is funny with the unoptimized code the compiler generated.

Typically, branch machine code is generated directly by if/else and ternary operators; and indirectly by virtual calls and by calls though pointers

Maybe the optimized code (-O2) is behaving better? Or maybe not:

Delta by column:
  PAPI_TOT_INS (Total instructions): 329 (229368-229039)
  PAPI_TOT_CYC (Total cpu cycles): 513 (186217-185704)
  PAPI_L1_DCM (L1 load  misses): 2 (1523-1521)
  PAPI_L2_DCM (L2 load  misses): 0 (993-993)
  PAPI_BR_MSP (Branch mispredictions): 7 (1287-1280)

Delta By line:
  PAPI_TOT_INS (Total instructions): 330 (209614-209284)
  PAPI_TOT_CYC (Total cpu cycles): 499 (173487-172988)
  PAPI_L1_DCM (L1 load  misses): 2 (1498-1496)
  PAPI_L2_DCM (L2 load  misses): 0 (992-992)
  PAPI_BR_MSP (Branch mispredictions): 7 (1225-1218)

This time the compiler optimized the loops out! It figured we do not really use the data in the array, so it got rid of. Completely!

Let's see how this code behaves:

   {
       int x;
       papi::counters<papi::stdout_print> pc("by column");
       for (int c = 0; c < ncols; ++c) {
           for (int l = 0; l < nlines; ++l) {
               x = ctrash[l][c];
               ctrash[l][c] = x + 1;
           }
       }
   }

Delta by column:
  PAPI_TOT_INS (Total instructions): 62918492 (63167552-249060)
  PAPI_TOT_CYC (Total cpu cycles): 224705473 (224904307-198834)
  PAPI_L1_DCM (L1 load  misses): 12415661 (12417203-1542)
  PAPI_L2_DCM (L2 load  misses): 9654638 (9655632-994)
  PAPI_BR_MSP (Branch mispredictions): 14217 (15558-1341)

Delta By line:
  PAPI_TOT_INS (Total instructions): 51904854 (115092642-63187788)
  PAPI_TOT_CYC (Total cpu cycles): 25914254 (250864272-224950018)
  PAPI_L1_DCM (L1 load  misses): 197104 (12614449-12417345)
  PAPI_L2_DCM (L2 load  misses): 6330 (9662090-9655760)
  PAPI_BR_MSP (Branch mispredictions): 296 (16066-15770)

Both cache misses and branch mispredictions improved by at least an order of magnitude. A run with unoptimized code will show the same order of improvement.

References

Locality of reference

OProfile

OProfile offers access to the same hardware counters as PAPI but without having to instrument the code:

It is coarser grained than PAPI - at function level.
Some out of the box kernels (RedHat) are not OProfile-friendly.
You need root access.

#!/bin/bash

#
# A script to OProfile a program. 
# Must be run as root. 
#


if [ $# -ne 1 ]
then
  echo "Usage: `basename $0` <for-binary-image>"
  exit -1
else
  binimg=$1
fi

opcontrol --stop
opcontrol --shutdown

# Out of the box RedHat kernels are OProfile repellent.
opcontrol --no-vmlinux
opcontrol --reset

# List of events for platform to be found in /usr/share/oprofile/<>/events
opcontrol --event=L2_CACHE_MISSES:1000


opcontrol --start

$binimg

opcontrol --stop
opcontrol --dump


rm $binimg.opreport.log
opreport > $binimg.opreport.log

rm $binimg.opreport.sym
opreport -l > $binimg.opreport.sym


opcontrol --shutdown
opcontrol --deinit
echo "Done"

References

perf

perf is a kernel-based subsystem that provide a framework for performance analysis of the impact of the progams being run on the kernel. It covers: hardware (CPU/PMU, Performance Monitoring Unit) features; and software features (software counters, tracepoints).

perf tutorial

perf list

Lists the events available on a particular machine. These events vary based on the performance monitoring hardware and the software configuration of the system.

$ perf list

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]
  minor-faults                                       [Software event]
  major-faults                                       [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  alignment-faults                                   [Software event]
  emulation-faults                                   [Software event]

  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-prefetches                               [Hardware cache event]
  L1-icache-prefetch-misses                          [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  LLC-store-misses                                   [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-stores                                        [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  dTLB-prefetches                                    [Hardware cache event]
  dTLB-prefetch-misses                               [Hardware cache event]
  iTLB-loads                                         [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  node-loads                                         [Hardware cache event]
  node-load-misses                                   [Hardware cache event]
  node-stores                                        [Hardware cache event]
  node-store-misses                                  [Hardware cache event]
  node-prefetches                                    [Hardware cache event]
  node-prefetch-misses                               [Hardware cache event]

  rNNN (see 'perf list --help' on how to encode it)  [Raw hardware event descriptor]

  mem:<addr>[:access]                                [Hardware breakpoint]

Note: Running this as root will give out an extended list of events; some of the events (tracepoints?) require root privileges.

perf stat <options> <cmd>

Gathers overall statistics for common performance events, including instructions executed and clock cycles consumed. There are option flags to gather statistics on events other than the default measurement events.

$ g++ -std=c++11 -ggdb -fno-omit-frame-pointer perftest.cpp -o perftest

$ perf stat ./perftest
 Performance counter stats for './perftest':

        379.991103 task-clock                #    0.996 CPUs utilized          
                62 context-switches          #    0.000 M/sec                  
                 0 CPU-migrations            #    0.000 M/sec                  
             6,436 page-faults               #    0.017 M/sec                  
       984,969,006 cycles                    #    2.592 GHz                     [83.27%]
       663,592,329 stalled-cycles-frontend   #   67.37% frontend cycles idle    [83.17%]
       473,904,165 stalled-cycles-backend    #   48.11% backend  cycles idle    [66.42%]
     1,010,613,552 instructions              #    1.03  insns per cycle        
                                             #    0.66  stalled cycles per insn [83.23%]
        53,831,403 branches                  #  141.665 M/sec                   [84.14%]
           401,518 branch-misses             #    0.75% of all branches         [83.48%]

       0.381602838 seconds time elapsed

$ perf stat --event=L1-dcache-load-misses ./perftest
 Performance counter stats for './perftest':

        12,942,048 L1-dcache-load-misses                                       

       0.373719009 seconds time elapsed

perf record

Records performance data into a file which can be later analyzed using perf report.

$ perf record --event=L1-dcache-load-misses ./perftest

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.025 MB perf.data (~1078 samples) ]

$ ls -al
...
-rw-------  1 amelinte amelinte  27764 Feb 17 17:23 perf.data

perf report

Reads the performance data from a file and analyzes the recorded data.

$ perf report --stdio
# ========
# captured on: Sun Feb 17 17:23:34 2013
# hostname : bear
# os release : 3.2.0-4-amd64
# perf version : 3.2.17
# arch : x86_64
# nrcpus online : 4
# nrcpus avail : 4
# cpudesc : Intel(R) Core(TM) i3 CPU M 390 @ 2.67GHz
# cpuid : GenuineIntel,6,37,5
# total memory : 3857640 kB
# cmdline : /usr/bin/perf_3.2 record --event=L1-dcache-load-misses ./perftest 
# event : name = L1-dcache-load-misses, type = 3, config = 0x10000, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, id = { 
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# ========
#
# Events: 274  L1-dcache-load-misses
#
# Overhead         Command   Shared Object      Symbol
# ........         ........  .................  ..................
#
#   95.93 percent  perftest  perftest           [.] 0xd35           
#    1.06 percent  perftest  [kernel.kallsyms]  [k] clear_page_c
#    0.82 percent  perftest  [kernel.kallsyms]  [k] run_timer_softirq
#    0.42 percent  perftest  [kernel.kallsyms]  [k] trylock_page
#    0.41 percent  perftest  [kernel.kallsyms]  [k] __rcu_pending
#    0.41 percent  perftest  [kernel.kallsyms]  [k] update_curr
#    0.33 percent  perftest  [kernel.kallsyms]  [k] do_raw_spin_lock
#    0.26 percent  perftest  [kernel.kallsyms]  [k] __flush_tlb_one
#    0.18 percent  perftest  [kernel.kallsyms]  [k] flush_old_exec
#    0.06 percent  perftest  [kernel.kallsyms]  [k] __free_one_page
#    0.05 percent  perftest  [kernel.kallsyms]  [k] free_swap_cache
#    0.05 percent  perftest  [kernel.kallsyms]  [k] zone_statistics
#    0.04 percent  perftest  [kernel.kallsyms]  [k] alloc_pages_vma
#    0.01 percent  perftest  [kernel.kallsyms]  [k] mm_init
#    0.01 percent  perftest  [kernel.kallsyms]  [k] vfs_read
#    0.00 percent  perftest  [kernel.kallsyms]  [k] __cond_resched
#    0.00 percent  perftest  [kernel.kallsyms]  [k] finish_task_switch
#
# (For a higher level overview, try: perf report --sort comm,dso)
#

$ perf record -g ./perftest
$ perf report -g --stdio
...
# Overhead   Command      Shared Object                  Symbol
# ........  ........  .................  ......................
#
    97.23%  perftest  perftest           [.] 0xc75           
            |
            --- 0x400d2c
                0x400dfb
                __libc_start_main

perf annotate

Reads the input file and displays an annotated version of the code. If the object file has debug symbols then the source code will be displayed alongside assembly code. If there is no debug info in the object, then annotated assembly is displayed. Broken!?

$ perf annotate -i ./perf.data -d ./perftest --stdio -f
Warning:
The ./perf.data file has no samples!

perf top

Performance top tool look-alike. It generates and displays a performance counter profile in realtime.

perf archive

perf buildid-cache

perf buildid-list

perf diff

perf inject

perf kmem

perf kvm

perf probe

perf sched

perf script

perf timechart

Valgrind: cachegrind

Cachegrind simulates a machine with two level ([I1 & D1] and L2) cache and branch (mis)prediction. Useful in that it annotates the code down to line level. Can significantly differ from machine's real CPU. Will not go very far on an AMD64 CPU (vex disassembler issues). Extremely slow, typically slows down the application 12-15 times.

DIY: libhitcount

libmemleak can be easily modified to keep track of what calls into a particular point in the code. Just insert an mtrace() call in that place.

getloadavg(3)

How things scale up

L1 cache reference ......................... 0.5 ns
Branch mispredict ............................ 5 ns
L2 cache reference ........................... 7 ns
Mutex lock/unlock ........................... 25 ns
Main memory reference ...................... 100 ns             
Compress 1K bytes with Zippy ............. 3,000 ns  =   3 µs
Send 2K bytes over 1 Gbps network ....... 20,000 ns  =  20 µs
SSD random read ........................ 150,000 ns  = 150 µs
Read 1 MB sequentially from memory ..... 250,000 ns  = 250 µs
Round trip within same datacenter ...... 500,000 ns  = 0.5 ms
Read 1 MB sequentially from SSD* ..... 1,000,000 ns  =   1 ms
Disk seek ........................... 10,000,000 ns  =  10 ms
Read 1 MB sequentially from disk .... 20,000,000 ns  =  20 ms
Send packet CA->Netherlands->CA .... 150,000,000 ns  = 150 ms

Source: https://gist.github.com/2843375

Operation           Cost (ns)     Ratio
Clock period        0.6           1.0
Best-case CAS       37.9          63.2
Best-case lock      65.6          109.3
Single cache miss   139.5         232.5
CAS cache miss      306.0         510.0
Comms Fabric        3,000         5,000
Global Comms        130,000,000   216,000,000
Table 2.1: Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System

Source: Paul E. McKenney

Notes on Hyper Threading

Presumed loss of performance for intensive floating point calculations (only one FPU and one ALU (two pipelines) per core).
http://ayumilove.net/hyperthreading-faq/
http://en.wikipedia.org/wiki/Hyper-threading

Other tools

Varia