Most of this page originates from ANLs excellent gprof introduction.

Enabling Profiling

The first step in generating profile information for your program is to compile and link it with profiling enabled. 

To compile a source file for profiling, specify the `-pg' option when you run the compiler. (This is in addition to the options you normally use.)

To link the program for profiling, if you use a compiler such as cc to do the linking, simply specify `-pg' in addition to your usual options. The same option, "-pg", alters either compilation or linking to do what is necessary for profiling. Here are examples:

cc -g -c myprog.c utils.c -pg
cc -o myprog myprog.o utils.o -pg

If an application is linked but the individual routines not compiled with the '-pg' option instruction, profiling information will be collected, but invocation profiling information will not be collected. This means only estimated routine times will be reported; no call graph information will be available. Check out the instructions under https://sourceware.org/binutils/docs/gprof/Compiling.html#Compiling

Collecting Profile Information

Profile information is collected by running a program that has been compiled so as to enable profiling. Files named gmon.out.<id> will be produced after the program is executed. The gmon.out files contain the raw profile information gathered by the profiling framework during the run. The gmon.out files are written to the current working directory for the program at the time that the program exits. If the program does not exit normally, no gmon.out files will be produced. On many systems multiple gmon.out files will be produced for an MPI program—typically one for each MPI rank with the suffix of the file denoting the rank.

Using gprof

The gprof utility is used to analyze and view the profile data contained in the gmon.out files.  Basic usage of gprof is:

gprof <executable-file> gmon.out[.<id>] {gmon.out.<id> ...}

When multiple gmon.out.<id> files are specified, the profile data is summed across all gmon.out.<id> files and then displayed. The most commonly used options for gprof are:

   -p : display flat profile
   -P : suppress display of flat profile
   -q : display call graph information
   -Q : suppress display of call graph information
   -l : displays instruction profile at source line level instead of function level (requires debug info)
   -C : display routine names and the execution counts obtained by invocation profiling
   -A : print annotated source code, only annotates with function call count (requires debug info)
   -s : sums all the profile data for the specified input files and writes the results to the file gmon.sum
   -I : search path for source files, used mostly for annotating source (can also use GPROF_PATH environment variable)
   -b : prevents gprof from printing the verbose descriptions of the output
   -d : shows gprof debugging information, can be used to access raw data in gmon files
If invoked without any options, the default gprof behavior is the same as specifying the options '-p –q,' meaning a flat profile and call graph information will be displayed.

Five basic output forms can be produced by gprof:

  1. A routine level flat profile;
  2. A line level flat profile;
  3. Call graph information;,
  4. A routine execution count list; and
  5. Source code annotated with routine call counts.

These output forms are described below and share several common aspects and, in many cases, present the same information in different forms. As described earlier, time estimates are based on the number of executing instruction observed at an interrupt interval that belongs to a routine. By default, the sampling interval is 0.01 seconds, so each observation of an instruction represents 0.01 seconds of execution time in that routine. Therefore, all times reported extend to one-hundredth of a second. The number of observed instructions for a routine is the time reported for the routine divided by the sampling interval of 0.01 seconds

Some routines may execute with an execution time low enough that no executing instructions were observed. These routines will be estimated to have an execution time of zero seconds. If invocation profiling was not enabled for these routines or if the invocation count is zero, these routines will not be listed in the gprof output. If, however, the invocation count is non-zero, the routine will be listed even if the estimated execution time is zero.

Due to the sampling methodology employed, execution times are estimates with large relative error for small execution counts. Conversely, routine invocation counts are exact and will not vary between identical runs of a program. The calculated routine child times are subject to error due to error in the routine time estimates and due to the assumption that all invocations of a routine take the same amount of time. In addition,  gprof does not really support profiling of threaded code. So running multi-threaded programs can result in vastly incorrect absolute timings.

 

Routine Level Flat Profile

The flat profile, generated with the '-p' option, presents a list of routines that were estimated to consume time while the program was running or were invoked at least once, sorted by the amount of time the routine consumes. The flat profile provides an easy way to see which routines consume the largest percentage of the program run time and are therefore good candidates for optimization. An example of a flat profile (for the example described in the Invocation Profiling section above) is shown below.

 %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 50.23      9.98     9.98       10     1.00     1.00  routine2
 24.91     14.93     4.95        1     4.95    13.91  routine1
 14.95     17.90     2.97        2     1.49     1.49  routine3
  9.91     19.87     1.97        1     1.97    19.87  main

The third column in the output above, labeled “self seconds,” shows the time estimated to be spent in the routine. The time reported may be estimated to be zero if no instructions for that routine were observed to be executing. The first column shows the self time as a percentage of the total run time. The second column, labeled “cumulative seconds,” shows an accumulation of the self time in the rows above; therefore, for the last entry it represents the total estimated run time of the program. The fourth column shows the number of times a routine was called, as recorded by the framework’s invocation profiling mechanism. If a routine was compiled without invocation profiling enabled (with '-pg'), this information is not recorded for that routine and the call count will be blank.

The fifth column, “self s/call,“ is simply the self seconds divided by the number of call to the routine and represents the average spent in the routine time per call. The sixth column, “total s/call,” is the self time plus the estimated child time divided by the number of calls to the routine. It represents the total time used per call to a routine, including the time in the routine and in the routines that it calls.

Line Level Flat Profile

The line level profile, generated with the '-l -p' option, presents the same information as the routine level flat profile but with times broken out at the source code line level instead of at the routine level. Generating this profile requires that the routine be compiled with the '-g' option so as to include the debugging information. Generating this profile for large programs can take significant time.

   %   cumulative   self             self     total           
 time   seconds   seconds   calls  Ts/call  Ts/call  name    
 36.49      7.25     7.25                            routine2 (routine2.c:5 @ 1001604)
 16.26     10.48     3.23                            routine1 (routine1.c:9 @ 10014e4)
 13.74     13.21     2.73                            routine2 (routine2.c:6 @ 10015f8)
 10.57     15.31     2.10                            routine3 (routine3.c:5 @ 10016c4)
  8.66     17.03     1.72                            routine1 (routine1.c:10 @ 10014d8)
  7.00     18.42     1.39                            main (gprof_test.c:13 @ 10013bc)
  4.38     19.29     0.87                            routine3 (routine3.c:6 @ 10016b8)
  2.92     19.87     0.58                            main (gprof_test.c:14 @ 10013b0)
  0.00     19.87     0.00      10     0.00     0.00  routine2 (routine2.c:1 @ 10015a0)
  0.00     19.87     0.00       2     0.00     0.00  routine3 (routine3.c:1 @ 1001660)
  0.00     19.87     0.00       1     0.00     0.00  main (gprof_test.c:9 @ 1001360)
  0.00     19.87     0.00       1     0.00     0.00  routine1 (routine1.c:5 @ 1001480)

Call Graph Analysis

The call graph analysis, generated with the -q option, displays information about a routine, its child routines, and its parent routines. Each section of the call graph analysis provides information pertaining to a specific routine.

The routine of interest is indicated by its index number in the first column and indentation from the lists of parent and child routines. The percentage of time listed in the second column indicates the percentage of program runtime accounted for by the routine and its children. The self column indicates times with different meanings depending on whether it is for the primary routine for that section or the routine's parent or child routines. For the primary routine, the time represents the time spent in that routine during the execution of the program. For a parent routine, it represents the amount of the primary routine’s self time being propagated to that parent, based on the percentage of calls to the primary routine made by that parent. For a child routine, it represents the amount of that child routine’s self time being propagated to the primary routine based on percentage of calls made to that routine by the primary routine.

Similarly for the “children” column, the times reported vary depending on the routine context. For the primary routine, it is the total amount of time from child routines propagated to that routine; for child routines, it is the amount of that routine’s child time being propagated to the primary routine; and for parents, it is the amount of the primary routine’s child time being propagated to that parent.

The ”called” columns indicate the number of times the primary routine was called, and the fraction of call made from parents and to children. Allocations of time from child routines to parent routines are based on the fraction of calls to the child routine made by the parent, using the (sometimes incorrect) assumption that each call to a routine takes the same amount of time. If a routine was not compiled with the '-pg' option to enable invocation profiling, its parent routines will be unknown and indicated as <spontaneous>. If no routines are compiled with invocation profiling enabled, no call graph analysis information will be available (call graph analysis output may need to be suppressed with the -Q option in order for gprof to run). The line-by-line option, '-l', may be used with the call graph analysis option, '-q', to present the call graph analysis information broken down on a source line basis.

  index % time    self  children    called     name

    1.97   17.90       1/1           generic_start_main [2]
[1]    100.0    1.97   17.90       1         main [1]
                4.95    8.96       1/1           routine1 [3]
                3.99    0.00       4/10          routine2 [4]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00   19.87                 generic_start_main [2]
                1.97   17.90       1/1           main [1]
-----------------------------------------------
                4.95    8.96       1/1           main [1]
[3]     70.0    4.95    8.96       1         routine1 [3]
                5.99    0.00       6/10          routine2 [4]
                2.97    0.00       2/2           routine3 [5]
-----------------------------------------------
                3.99    0.00       4/10          main [1]
                5.99    0.00       6/10          routine1 [3]
[4]     50.2    9.98    0.00      10         routine2 [4]
-----------------------------------------------
                2.97    0.00       2/2           routine1 [3]
[5]     14.9    2.97    0.00       2         routine3 [5]

Routine Execution Count List

The routine execution count list, generated with the -C option, displays a simple list of routines and the execution counts for the routines as recorded by the invocation profiling mechanism. If a routine was not compiled with invocation profiling enabled, it will not appear in the list.

   src/gprof_test.c:9: (main:0x1001360)      1 executions
   src/routine1.c:5:   (routine1:0x1001480)  1 executions
   src/routine2.c:1:   (routine2:0x10015a0) 10 executions
   src/routine3.c:1:   (routine3:0x1001660)  2 executions

Annotated Source Listing

The annotated source listing, generated with the -A option, displays the program source code annotated with the number of calls made to the routines. This output can only be produced if the program was compiled with the debugging information option, '-g.'

Issues in Interpreting Profile Data

Several considerations must be keep in mind when interpreting the information reported by gprof:

  • The times reported by gprof are estimates based on the instructions observed to be executing when the program is interrupted at the end of the sampling interval. A consequence of this sampling methodology is that time estimates may vary from run to run and contain high relative error when based on a small number of observations.
  • The expected error in a time estimate is the square root of the number of observations upon which it is based. For example, for an estimate based on 9 observations, the expected error is approximately 3 and the relative error is 33.3%. For an estimate based on 400 observations, the expected error is 20, or about 5%.
  • Sampling with gprof imposes some overhead on the execution of the program. This overhead is generally small for instruction profling (only linking with -pg) and is generally small, but can in some instances be signficant, for invocation profiling (compiling and linking with -pg). If a significant amount of time is reported in the mcount() (or _mcount or __mcount) routine turning off invocation profiling can reduce the overhead.
  • The time spent in all functions, including all functions called within a library, is reported by gprof. This can make the output hard to interpret as many of the reported functions may be low level library functions that are not otherwise visible user of the library. For example a call to MPI_Bcast() may spend relatively little time in the MPI_Bcast() routine itself and most of its time in supporting function. 
  • Routines that execute quickly may register no observations, and therefore have a time estimate of 0.00 seconds, despite having an actual non-zero runtime.
  • Routines compiled without invocation profiling enabled (without the '-pg' option) will not have call counts reported and will not have identified parent routines. Parents will be labeled as <spontaneous>.
  • Routines with no observations and for which invocation profiling was not enabled (compiled without '-pg') will not be reported by gprof despite being called and having a (very small) runtime.
  • When attributing time from a child routine to a parent routine, it is assumed that all calls to that child routine, from the parent routine of interest or from any other caller of that child routine, take the same amount of time. If this assumption is incorrect, significant errors in child time estimates may occur.
  • gprof show only time spent in user code, not in system calls. Therefore, time spent performing IO and other tasks involving the operating system will not be reported. For codes spending significant time system call the total runtime reported by gprof will be noticeably lower than the actual runtime of the code.
  • Compiler optimization can produce errors in time estimates and call counts.
  • Recursive routines require special consideration in how their call graph analysis is generated. 
  • gprof is not really suited for multi-threaded application, absolute timings may hence become fairly erroneous. Relative counts are usually still correct (within the limits outlined above).