OProfile is an open source project that includes a statistical profiler for Linux systems, capable of profiling all running code at low overhead.
OProfile does not require instrumentation of the application. Compiling your code with debugging enabled ('-g') is sufficient to deduce quite a bit of the runtime behavior of the application. Simply run:
[host$]% operf my_application
operf collects profiling data in a separate folder name oprofile_data. Getting the basic number out of the profiling data is simply done as
[host$]% opannotate 10211553 100.000 my_application CPU_CLK_UNHALT...| samples| %| ------------------ 6916225 67.7294 my_application 2701236 26.4527 libm-2.12.so 285151 2.7924 libgfortran.so.3.0.0 111477 1.0917 libgcc_s-4.4.7-20120601.so.1 101353 0.9925 libopenblas-r0.2.15.so 47240 0.4626 libc-2.12.so 43762 0.4286 vmlinux 4963 0.0486 libarpack.so.2.0.0 61 6.0e-04 libpthread-2.12.so
It occasionally happens that operf complaints about too many lost counts recommending to lower the count frequency. This can for example be achieved by lengthening the time between two consecutive measurements:
[host$]% operf --events=CPU_CLK_UNHALTED:1000000 my_application
This would reduce the measurements by a factor of 10. Check the oprofile documentation for other event types.
If debug packages have been installed, things become more interesting:
[host$]% operf -t --vmlinux=/usr/lib/debug/lib/modules/`uname -r`/vmlinux --events=CPU_CLK_UNHALTED:1000000 my_application # or if an updated vmlinux has been installed try locate vmlinux and use [host$]% operf -t --vmlinux=/usr/lib/debug/lib/modules/2.6.32-573.7.1.el6.x86_64/vmlinux --events=CPU_CLK_UNHALTED:1000000 my_application [host$]% opannotate --source -t 0.5 > oprofile.test.out # extract shown! 239554 2.3627 : x(nu)=x(nu)*ap 251841 2.4838 : rout%tot(i,iaux)=rout%tot(i,iaux)-a*b*d 264598 2.6097 : ab=fun(1)/fun(2) 449400 4.4323 : cd=fun(3)**fun(4) 847925 8.3629 : coe(nu)=coe(nu)*ap 1234356 12.1741 : DO i=1,many 1876789 18.5103 : ap=dsqrt(cd/ab) * 367121 3.6208 routine1.f90 * 1660295 16.3751 routine2.f90 * 4800720 47.3482 routine3.f90
This should allow to pinpoint to the pieces of code consuming most of the CPU-time!
Another quick look at the profile can be done with opreport:
[host$]% opreport -l -t 1 -c -g -% # extract shown! 245419 29.9609 (no location information) libm-2.12.so __ieee754_hypot 168929 20.6230 routine1.f90:653 my_application __my_subroutine1 1136708 16.6894 routine2.f90:444 my_application __my_subroutine2 [...]
Interestingly, libm calls take the most time consuming part of the code after doing some optimization. http://www.ibm.com/developerworks/library/l-evaluatelinuxonpower/ has a nice example how to deduce the actual processes contributing. In that particular example, the expenses originate from floating point to integer conversions, which essentially pushes a value from the L1/2 caches into memory and back. The sample code used for profiling contains quite a number of (avoidable) integer to double conversions. Avoiding these type conversions results in a slight improvement:
[host$]% opreport -l -t 1 -c -g -% # extract shown! 293392 30.4990 routine1.f90:653 my_application __my_subroutine1 247423 25.7204 (no location information) libm-2.12.so __ieee754_hypot [...]