Debugging with gdb
Prerequisites
- Install Python debug packages:
$ sudo apt install gdb python-dbg python3-dbg
How-To
Start gdb
- Start process with gdb
$ gdb python
...
(gdb) run <programname>.py <arguments>
OR
- Attach to running process
1. Get the pid:
2. Start gdbpgrep -af <name>
[sudo -u <user>] gdb -p <pid>
This will cause the process to stop
Useful commands
c
continues execution (Ctrl + C stops it again)bt
prints the C stack tracepy-bt
prints the Python stack tracepy-bt-full
prints bothpy-list
lists the current Python source codepy-up
/py-down
moves one Python frame up or downpy-locals
shows local variables in the current framepy-print <var>
prints the representation of a variable in the current frameinfo threads
displays all threadsthread <id>
switches current thread to thread number<id>
thread apply all <cmd>
applies<cmd>
to all threads, e.g.,thread apply all py-list
python MAX_OUTPUT_LEN=<N>
sets the length after which the output of a Python value is truncated (default is 1024)
Custom extensions
It is possible to add custom extensions to gdb. An example is pyfields.py. It can be loaded with source path/to/pyfields.py
and provides a new command:
py-fields obj.attr
prints the value of the attributeattr
of the Python objectobj
in the current framepy-fields obj.*
prints the values of all attributes ofobj
Troubleshouting
- Attaching the debugger to a running process halts its execution. This can cause problems with, e.g., heartbeats. To just execute one command and detach immediately, try, e.g.,
sudo -u <user> gdb --pid <pid> \
--ex "set pagination off" \
--ex "thread apply all py-bt-full" \
--ex "detach" --ex "quit"
Alternatively, create a core dump, see below.
- On some systems, attaching a debugger is only allowed for child processes. To allow attaching to all processes of the same user, execute
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
- There is/was a bug in the Python extensions on a standard Debian 9 installation. Apply the following change to
/usr/share/gdb/auto-load/usr/bin/python3.5-gdb.py
(see https://stackoverflow.com/questions/53084290/python-gdb-error-python-exception-class-runtimeerror-type-does-not-have-a-t)
1127c1127
< fields = gdb.lookup_type('PyUnicodeObject').target().fields()
---
> fields = gdb.lookup_type('PyUnicodeObject').fields()
- If the python macros return messages like
Unable to locate python frame
orUnable to locate gdb frame for python bytecode interpreter
, make sure you usedgdb -p <pid>
and notgdb python -p <pid>
. Furthermore check that there are no other running gdb instances attached to the same process.
Working with core dumps
This is useful if stopping the program causes problems, e.g., in other components in a distributed system.
- Find the pid:
pgrep -af <name>
- Generate core dump:
gcore <pid>
This will stop the program while creating the core dump, which might take a few seconds. - Start the debugger:
gdb python <core_file>
orgdb python3 <core_file>,
depending on the programs Python version - Use the usual gdb commands to investigate the state of the program, see above.
Example session
For this example session, an ascan
was executed in spock. In the middle of the scan, a core dump of the MacroServer
Tango server was created.
- Load the core dump with gdb:
$ gdb python core.26201 GNU gdb (Debian 7.12-6) 7.12.0.20161007-git ... [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/bin/python2'. #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 185 ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory. [Current thread is 1 (Thread 0x7f82f95355c0 (LWP 26201))] (gdb)
- Get an overview of all threads:
(gdb) thread apply all py-bt Thread 36 (Thread 0x7f82698df700 (LWP 4873)): Traceback (most recent call first): Thread 35 (Thread 0x7f826bc9c700 (LWP 4804)): ... Thread 22 (Thread 0x7f828ffff700 (LWP 26227)): Traceback (most recent call first): Waiting for the GIL File "/usr/lib/python2.7/threading.py", line 340, in wait waiter.acquire() File "/usr/lib/python2.7/dist-packages/taurus/core/util/event.py", line 659, in waitEvent self._cond.wait(timeout) File "/usr/lib/python2.7/dist-packages/sardana/taurus/core/tango/sardana/pool.py", line 491, in waitFinish timeout=timeout, retries=1) File "/usr/lib/python2.7/dist-packages/sardana/taurus/core/tango/sardana/pool.py", line 503, in go self.waitFinish(id=eid) File "/usr/lib/python2.7/dist-packages/sardana/taurus/core/tango/sardana/pool.py", line 260, in new_fn return fn(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/sardana/taurus/core/tango/sardana/pool.py", line 1819, in go PoolElement.go(self, *args, **kwargs) File "/usr/lib/python2.7/dist-packages/sardana/macroserver/scan/gscan.py", line 1113, in stepUp state, data_line = mg.count(integ_time) File "/usr/lib/python2.7/dist-packages/sardana/macroserver/scan/gscan.py", line 1029, in scan_loop self.stepUp(i, step, lstep) File "/usr/lib/python2.7/dist-packages/sardana/macroserver/scan/gscan.py", line 949, in step_scan for i in self.scan_loop(): File "/usr/lib/python2.7/dist-packages/sardana/macroserver/macros/scan.py", line 292, in run for step in self._gScan.step_scan(): File "/usr/lib/python2.7/dist-packages/sardana/macroserver/macro.py", line 2460, in exec_ thrown""" File "/usr/lib/python2.7/dist-packages/sardana/macroserver/msmacromanager.py", line 1593, in runMacro self.sendMacroStatusStop() File "/usr/lib/python2.7/dist-packages/sardana/macroserver/msmacromanager.py", line 1567, in __runXMLMacro exc_info = sys.exc_info() File "/usr/lib/python2.7/dist-packages/sardana/macroserver/msmacromanager.py", line 1549, in __runStatelessXML macro_exp, tb, result = None, None, None File "/usr/lib/python2.7/dist-packages/sardana/macroserver/msmacromanager.py", line 1533, in __runXML File "/usr/lib/python2.7/dist-packages/taurus/core/util/threadpool.py", line 133, in run callback(cmd(*args, **kw)) File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap self.__bootstrap_inner() ...
Here one can see, that the thread that executes the macro is thread 22. All other threads are not executing Python code or are idling threads in the Taurus thread pool. - Switch to the thread that executes the macro
(gdb) thread 22 [Switching to thread 22 (Thread 0x7f828ffff700 (LWP 26227))] #0 0x00007f82f912d556 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f828800cab0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 in ../sysdeps/unix/sysv/linux/futex-internal.h
Use
py-up
to get out of the deep library calls until a frame that executes Taurus code is reached:(gdb) py-up #6 Frame 0x7f829c099bb0, for file /usr/lib/python2.7/threading.py, line 340, in wait (self=<_Condition(_Condition__lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f826d64f110>, _RLock__count=0) at remote 0x7f826d740350>, acquire=<instancemethod at remote 0x7f826d790be0>, _is_owned=<instancemethod at remote 0x7f826d73f690>, _release_save=<instancemethod at remote 0x7f826d73f7d0>, release=<instancemethod at remote 0x7f826d7352d0>, _acquire_restore=<instancemethod at remote 0x7f826d73f730>, _Verbose__verbose=False, _Condition__waiters=[<thread.lock at remote 0x7f826d64f270>]) at remote 0x7f826d740490>, timeout=None, waiter=<thread.lock at remote 0x7f826d64f270>, saved_state=(2, 140198738392832)) waiter.acquire()
The output shows, that the first call to
py-up
went to line 340 in/usr/lib/python2.7/threading.py
inside thewait
function from the Python standard library.(gdb) py-up #10 Frame 0x7f829c017050, for file /usr/lib/python2.7/dist-packages/taurus/core/util/event.py, line 659, in waitEvent (self=<AttributeEventWait(_last_val=<DevState(_value_=6, __objclass__=<EnumMeta(__module__='taurus.core.tango.enums', __ne__=<method-wrapper at remote 0x7f82ebe66210>, STANDBY=<DevState(_value_=7, __objclass__=<...>, _name_='STANDBY') at remote 0x7f82ebe5fd88>, UNKNOWN=<DevState(_value_=13, __objclass__=<...>, _name_='UNKNOWN') at remote 0x7f82ebe69128>, ALARM=<DevState(_value_=11, __objclass__=<...>, _name_='ALARM') at remote 0x7f82ebe5ffc8>, MOVING=<...>, __lt__=<method-wrapper at remote 0x7f82ebe66110>, OPEN=<DevState(_value_=3, __objclass__=<...>, _name_='OPEN') at remote 0x7f82ebe5fb48>, INSERT=<DevState(_value_=4, __objclass__=<...>, _name_='INSERT') at remote 0x7f82ebe5fbd8>, __reduce_ex__=<instancemethod at remote 0x7f82ec0fc7d0>, __new__=<staticmethod at remote 0x7f82f0244fd8>, _value2member_map_={0: <DevState(_value_=0, __objclass__=<...>, _name_='ON') at remote 0x7f82ebe5f950>, 1: <DevS...(truncated) self._cond.wait(timeout)
After a second call to
py-up
, the first frame containing Taurus code is reached, i.e., line 659 in/usr/lib/python2.7/dist-packages/taurus/core/util/event.py
inside thewaitEvent
function.- Get the currently executed code:
(gdb) py-list 654 for v, t in s.items(): 655 if v == val: 656 continue 657 if t >= after: 658 return >659 self._cond.wait(timeout) 660 retries -= 1 661 except Exception, e: 662 sys.stderr.write( 663 "AttributeEventWait: Caught exception while waitting: %s\n" % str(e)) 664 raise e
Unfortunately, the number of lines shown around the current line is hardcoded to ±5 in the
py-list
function in/usr/share/gdb/auto-load/usr/bin/python3.5m-gdb.py
. To view the whole code of thewaitEvent
function, open the corresponding file in an external viewer. List all local variables:
(gdb) py-locals self = <AttributeEventWait(_last_val=<DevState(_value_=6, __objclass__=<EnumMeta(__module__='taurus.core.tango.enums', __ne__=<method-wrapper at remote 0x7f82ebe66210>, STANDBY=<DevState(_value_=7, __objclass__=<...>, _name_='STANDBY') at remote 0x7f82ebe5fd88>, UNKNOWN=<DevState(_value_=13, __objclass__=<...>, _name_='UNKNOWN') at remote 0x7f82ebe69128>, ALARM=<DevState(_value_=11, __objclass__=<...>, _name_='ALARM') at remote 0x7f82ebe5ffc8>, MOVING=<...>, __lt__=<method-wrapper at remote 0x7f82ebe66110>, OPEN=<DevState(_value_=3, __objclass__=<...>, _name_='OPEN') at remote 0x7f82ebe5fb48>, INSERT=<DevState(_value_=4, __objclass__=<...>, _name_='INSERT') at remote 0x7f82ebe5fbd8>, __reduce_ex__=<instancemethod at remote 0x7f82ec0fc7d0>, __new__=<staticmethod at remote 0x7f82f0244fd8>, _value2member_map_={0: <DevState(_value_=0, __objclass__=<...>, _name_='ON') at remote 0x7f82ebe5f950>, 1: <DevState(_value_=1, __objclass__=<...>, _name_='OFF') at remote 0x7f82ebe5fa70>, 2: <DevState(_value_=2, __objclass__=<...>,...(truncated) val = <DevState at remote 0x7f82f6a5a090> after = <float at remote 0x7f8264021230> equal = False timeout = None retries = 2 any = False s = {<DevState(_value_=0, __objclass__=<EnumMeta(__module__='taurus.core.tango.enums', __ne__=<method-wrapper at remote 0x7f82ebe66210>, STANDBY=<DevState(_value_=7, __objclass__=<...>, _name_='STANDBY') at remote 0x7f82ebe5fd88>, UNKNOWN=<DevState(_value_=13, __objclass__=<...>, _name_='UNKNOWN') at remote 0x7f82ebe69128>, ALARM=<DevState(_value_=11, __objclass__=<...>, _name_='ALARM') at remote 0x7f82ebe5ffc8>, MOVING=<DevState(_value_=6, __objclass__=<...>, _name_='MOVING') at remote 0x7f82ebe5fcf8>, __lt__=<method-wrapper at remote 0x7f82ebe66110>, OPEN=<DevState(_value_=3, __objclass__=<...>, _name_='OPEN') at remote 0x7f82ebe5fb48>, INSERT=<DevState(_value_=4, __objclass__=<...>, _name_='INSERT') at remote 0x7f82ebe5fbd8>, __reduce_ex__=<instancemethod at remote 0x7f82ec0fc7d0>, __new__=<staticmethod at remote 0x7f82f0244fd8>, _value2member_map_={0: <...>, 1: <DevState(_value_=1, __objclass__=<...>, _name_='OFF') at remote 0x7f82ebe5fa70>, 2: <DevState(_value_=2, __objclass__=<...>, _name_='CLOSE') at remot...(truncated) v = <DevState(_value_=6, __objclass__=<EnumMeta(__module__='taurus.core.tango.enums', __ne__=<method-wrapper at remote 0x7f82ebe66210>, STANDBY=<DevState(_value_=7, __objclass__=<...>, _name_='STANDBY') at remote 0x7f82ebe5fd88>, UNKNOWN=<DevState(_value_=13, __objclass__=<...>, _name_='UNKNOWN') at remote 0x7f82ebe69128>, ALARM=<DevState(_value_=11, __objclass__=<...>, _name_='ALARM') at remote 0x7f82ebe5ffc8>, MOVING=<...>, __lt__=<method-wrapper at remote 0x7f82ebe66110>, OPEN=<DevState(_value_=3, __objclass__=<...>, _name_='OPEN') at remote 0x7f82ebe5fb48>, INSERT=<DevState(_value_=4, __objclass__=<...>, _name_='INSERT') at remote 0x7f82ebe5fbd8>, __reduce_ex__=<instancemethod at remote 0x7f82ec0fc7d0>, __new__=<staticmethod at remote 0x7f82f0244fd8>, _value2member_map_={0: <DevState(_value_=0, __objclass__=<...>, _name_='ON') at remote 0x7f82ebe5f950>, 1: <DevState(_value_=1, __objclass__=<...>, _name_='OFF') at remote 0x7f82ebe5fa70>, 2: <DevState(_value_=2, __objclass__=<...>, _name_='CLOSE') at remote 0x7...(truncated) t = <float at remote 0x7f8264021230>
Here one can already see that
self
is an instance of typeAttributeEventWait
with an attribute_last_val
of the typeDevState
. Looking into the documentation reveals that this is an enum type where_value=6
corresponds toDevState.MOVING
. Some values are not shown, e.g., forval
orafter
. I didn't find an easy way to get the actual value ofafter
. Fortunately, the value forval
is hardcoded in the caller frame and can thus be obtained by:(gdb) py-up #14 Frame 0x7f829c086530, for file /usr/lib/python2.7/dist-packages/sardana/taurus/core/tango/sardana/pool.py, line 491, in waitFinish (self=<MeasurementGroup(log_full_name='haso232s.desy.de:10000.mntgrp02', _lock_info=<TaurusLockInfo(status=3, status_msg='Lock status unknown', host=None, id=None, klass=None) at remote 0x7f826d7acdd0>, log_name='mntgrp02', _configuration=<MGConfiguration(description='General purpose measurement configuration', tango_dev_channels_in_error=0, tango_channels_info_in_error=0, tango_dev_channels=<CaselessDict at remote 0x7f826d786740>, channels=<CaselessDict at remote 0x7f826d7ba3c8>, initialized=False, channel_list=[{'ndim': 0, 'index': 0, 'name': 'ct05', 'plot_type': 0, 'enabled': True, 'label': 'ct05', 'source': 'tango://haso232s.desy.de:10000/expchan/ctctrl02/1/Value', '_controller_name': 'tango://haso232s.desy.de:10000/controller/dummycountertimercontroller/ctctrl02', 'conditioning': '', 'full_name': 'tango://haso232s.desy.de:10000/expchan/ctctrl02/1', 'output': True, 'plot_axes'...(truncated) timeout=timeout, retries=1) (gdb) py-list 486 id = id[0] 487 evt_wait = self._getEventWait() 488 evt_wait.lock() 489 try: 490 evt_wait.waitEvent(DevState.MOVING, after=id, equal=False, >491 timeout=timeout, retries=1) 492 finally: 493 self.__go_end_time = time.time() 494 self.__go_time = self.__go_end_time - self.__go_start_time 495 evt_wait.unlock() 496 evt_wait.disconnect()
To go back to the previous frame, one can either call
py-down
or refer to it by number:(gdb) frame 10 #10 PyEval_EvalFrameEx () at ../Python/ceval.c:2989 2989 in ../Python/ceval.c
At this point one knows that the current thread is waiting for an attribute to be different from
DevState.MOVING
.Investigate the Python object
self
with the help of the custom extensionpyfields
, see above. First load the extension:(gdb) source ~/Public/pyfields.py
Now, one can either print a single attribute or all attributes of a given object:
(gdb) py-fields self.* local '_last_val' = <DevState(_value_=6, __objclass__=<EnumMeta(__module__='taurus.core.tango.enums', __ne__=<method-wrapper at remote 0x7f82ebe66210>, STANDBY=<DevState(_value_=7, __objclass__=<...>, _name_='STANDBY') at remote 0x7f82ebe5fd88>, UNKNOWN=<DevState(_value_=13, __objclass__=<...>, _name_='UNKNOWN') at remote 0x7f82ebe69128>, ALARM=<DevState(_value_=11, __objclass__=<...>, _name_='ALARM') at remote 0x7f82ebe5ffc8>, MOVING=<...>, __lt__=<method-wrapper at remote 0x7f82ebe66110>, OPEN=<DevState(_value_=3, __objclass__=<...>, _name_='OPEN') at remote 0x7f82ebe5fb48>, INSERT=<DevState(_value_=4, __objclass__=<...>, _name_='INSERT') at remote 0x7f82ebe5fbd8>, __reduce_ex__=<instancemethod at remote 0x7f82ec0fc7d0>, __new__=<staticmethod at remote 0x7f82f0244fd8>, _value2member_map_={0: <DevState(_value_=0, __objclass__=<...>, _name_='ON') at remote 0x7f82ebe5f950>, 1: <DevState(_value_=1, __objclass__=<...>, _name_='OFF') at remote 0x7f82ebe5fa70>, 2: <DevState(_value_=2, __objclass__=<...>, _name_='CLOSE') at remote 0x7...(truncated) local '_attr' = <TangoAttribute(_TangoAttribute__read_lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f826f1fca70>, _RLock__count=0) at remote 0x7f826d7acf10>, log_full_name='haso232s.desy.de:10000.mntgrp02.state', _TangoAttribute__dev_hw_obj=<DeviceProxy(_subscribed_events={32: (<__CallBackPushEvent(push_event=<function at remote 0x7f826d73a410>) at remote 0x7f826d741310>, <EventType at remote 0x7f82f6dc4550>, 'IntegrationTime'), 33: (<__CallBackPushEvent(push_event=<function at remote 0x7f826d73a488>) at remote 0x7f826d741260>, <EventType at remote 0x7f82f6dc4f50>, 'Moveable'), 34: (<__CallBackPushEvent(push_event=<function at remote 0x7f826d73a500>) at remote 0x7f826d741578>, <...>, 'Moveable'), 5: (<__CallBackPushEvent(push_event=<function at remote 0x7f826d7aad70>) at remote 0x7f826d7b25d0>, <...>, 'state'), 6: (<__CallBackPushEvent(push_event=<function at remote 0x7f826d767938>) at remote 0x7f826d76c788>, <...>, 'state'), 7: (<__CallBackPushEvent(push_event=<function at...(truncated) local '_cond' = <_Condition(_Condition__lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f826d64f110>, _RLock__count=0) at remote 0x7f826d740350>, acquire=<instancemethod at remote 0x7f826d790be0>, _is_owned=<instancemethod at remote 0x7f826d73f690>, _release_save=<instancemethod at remote 0x7f826d73f7d0>, release=<instancemethod at remote 0x7f826d7352d0>, _acquire_restore=<instancemethod at remote 0x7f826d73f730>, _Verbose__verbose=False, _Condition__waiters=[<thread.lock at remote 0x7f826d64f270>]) at remote 0x7f826d740490> local '_event_set' = {<DevState(_value_=0, __objclass__=<EnumMeta(__module__='taurus.core.tango.enums', __ne__=<method-wrapper at remote 0x7f82ebe66210>, STANDBY=<DevState(_value_=7, __objclass__=<...>, _name_='STANDBY') at remote 0x7f82ebe5fd88>, UNKNOWN=<DevState(_value_=13, __objclass__=<...>, _name_='UNKNOWN') at remote 0x7f82ebe69128>, ALARM=<DevState(_value_=11, __objclass__=<...>, _name_='ALARM') at remote 0x7f82ebe5ffc8>, MOVING=<DevState(_value_=6, __objclass__=<...>, _name_='MOVING') at remote 0x7f82ebe5fcf8>, __lt__=<method-wrapper at remote 0x7f82ebe66110>, OPEN=<DevState(_value_=3, __objclass__=<...>, _name_='OPEN') at remote 0x7f82ebe5fb48>, INSERT=<DevState(_value_=4, __objclass__=<...>, _name_='INSERT') at remote 0x7f82ebe5fbd8>, __reduce_ex__=<instancemethod at remote 0x7f82ec0fc7d0>, __new__=<staticmethod at remote 0x7f82f0244fd8>, _value2member_map_={0: <...>, 1: <DevState(_value_=1, __objclass__=<...>, _name_='OFF') at remote 0x7f82ebe5fa70>, 2: <DevState(_value_=2, __objclass__=<...>, _name_='CLOSE') at remot...(truncated)
This shows, that the
_attr
attribute points to an instance of theTangoAttribute
class. Usingpy-fields self._attr.*
or looking up the source code ofTangoAttribute
, one can find that is has an attribute named_full_name
:(gdb) py-fields self._attr._full_name local '_full_name' = 'tango://haso232s.desy.de:10000/mntgrp/pool_haso232s/mntgrp02/state'
As a result of the investigation so far, the macro server seems to be waiting for the Tango attribute
mntgrp/pool_haso232s/mntgrp02/state
to exit theMOVING
state. By going further up the stack trace, one could find out why the macro server is waiting for this particular event, or, if it was the device that is hanging, one could attach a new debugger session to the corresponding Tango server (or its core dump) and investigate the state of that device.