OProfile is an open source project that includes a statistical profiler for Linux systems, capable of profiling all running code at low overhead. 

There are quite a number of oprofile tutorials available on the web and it's much more powerful than in the example shown below. Have a look for example at http://www.ibm.com/developerworks/systems/library/es-oprofile/

OProfile does not require instrumentation of the application. Compiling your code with debugging enabled ('-g') is sufficient to deduce quite a bit of the runtime behavior of the application. Simply run:

[host$]% operf my_application

operf collects profiling data in a separate folder name oprofile_data. Getting the basic number out of the profiling data is simply done as

[host$]% opannotate
 
10211553 100.000 my_application
 CPU_CLK_UNHALT...| samples| %|
 ------------------
 6916225 67.7294 my_application
 2701236 26.4527 libm-2.12.so
  285151  2.7924 libgfortran.so.3.0.0
  111477  1.0917 libgcc_s-4.4.7-20120601.so.1
  101353  0.9925 libopenblas-r0.2.15.so
   47240  0.4626 libc-2.12.so
   43762  0.4286 vmlinux
    4963  0.0486 libarpack.so.2.0.0
      61  6.0e-04 libpthread-2.12.so

It occasionally happens that operf complaints about too many lost counts recommending to lower the count frequency. This can for example be achieved by lengthening the time between two consecutive measurements: 

[host$]% operf --events=CPU_CLK_UNHALTED:1000000 my_application 

This would reduce the measurements by a factor of 10. Check the oprofile documentation for other event types.  

If debug packages have been installed, things become more interesting:

[host$]% operf -t --vmlinux=/usr/lib/debug/lib/modules/`uname -r`/vmlinux --events=CPU_CLK_UNHALTED:1000000 my_application   # or if an updated vmlinux has been installed try locate vmlinux and use
[host$]% operf -t --vmlinux=/usr/lib/debug/lib/modules/2.6.32-573.7.1.el6.x86_64/vmlinux --events=CPU_CLK_UNHALTED:1000000 my_application
[host$]% opannotate --source -t 0.5 > oprofile.test.out
# extract shown!
239554  2.3627 :            x(nu)=x(nu)*ap
251841  2.4838 :          rout%tot(i,iaux)=rout%tot(i,iaux)-a*b*d
264598  2.6097 :            ab=fun(1)/fun(2)
449400  4.4323 :            cd=fun(3)**fun(4)
847925  8.3629 :            coe(nu)=coe(nu)*ap
1234356 12.1741 :         DO i=1,many
1876789 18.5103 :            ap=dsqrt(cd/ab)
 * 367121  3.6208    routine1.f90
 * 1660295 16.3751   routine2.f90
 * 4800720 47.3482   routine3.f90

This should allow to pinpoint to the pieces of code consuming most of the CPU-time!

Another quick look at the profile can be done with opreport: 

[host$]% opreport -l -t 1 -c -g -% 
# extract shown!
245419 29.9609 (no location information) libm-2.12.so __ieee754_hypot
168929 20.6230 routine1.f90:653 my_application __my_subroutine1
1136708 16.6894 routine2.f90:444 my_application __my_subroutine2
[...]

Interestingly, libm calls take the most time consuming part of the code after doing some optimization. http://www.ibm.com/developerworks/library/l-evaluatelinuxonpower/ has a nice example how to deduce the actual processes contributing. In that particular example, the expenses originate from floating point to integer conversions, which essentially pushes a value from the L1/2 caches into memory and back. The sample code used for profiling contains quite a number of (avoidable) integer to double conversions. Avoiding these type conversions results in a slight improvement:

[host$]% opreport -l -t 1 -c -g -% 
# extract shown!
293392 30.4990 routine1.f90:653 my_application __my_subroutine1
247423 25.7204 (no location information) libm-2.12.so __ieee754_hypot
[...]