Page tree

Debugging with gdb

Prerequisites

  • Install Python debug packages: $ sudo apt install gdb python-dbg python3-dbg

How-To

Start gdb

  • Start process with gdb
    $ gdb python
     ...
    (gdb) run <programname>.py <arguments>

OR

  • Attach to running process

    1. Get the pid: pgrep -af <name>

    2. Start gdb [sudo -u <user>] gdb -p <pid>
    This will cause the process to stop

Useful commands

  • c continues execution (Ctrl + C stops it again)
  • bt prints the C stack trace
  • py-bt prints the Python stack trace
  • py-bt-full prints both
  • py-list lists the current Python source code
  • py-up/py-down moves one Python frame up or down
  • py-locals shows local variables in the current frame
  • py-print <var> prints the representation of a variable in the current frame
  • info threads displays all threads
  • thread <id> switches current thread to thread number <id>
  • thread apply all <cmd> applies <cmd> to all threads, e.g., thread apply all py-list
  • python MAX_OUTPUT_LEN=<N> sets the length after which the output of a Python value is truncated (default is 1024)

Custom extensions

It is possible to add custom extensions to gdb. An example is pyfields.py. It can be loaded with source path/to/pyfields.py and provides a new command:

  • py-fields obj.attr prints the value of the attribute attr of the Python object obj in the current frame
  • py-fields obj.* prints the values of all attributes of obj

Troubleshouting

  • Attaching the debugger to a running process halts its execution. This can cause problems with, e.g., heartbeats. To just execute one command and detach immediately, try, e.g.,
    sudo -u <user> gdb --pid <pid> \
       --ex "set pagination off" \

       --ex "thread apply all py-bt-full" \

       --ex "detach" --ex "quit"

    Alternatively, create a core dump, see below.
  • On some systems, attaching a debugger is only allowed for child processes. To allow attaching to all processes of the same user, execute echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
    1127c1127
      <             fields = gdb.lookup_type('PyUnicodeObject').target().fields()
      ---
      >             fields = gdb.lookup_type('PyUnicodeObject').fields()
  • If the python macros return messages like Unable to locate python frame or Unable to locate gdb frame for python bytecode interpreter, make sure you used gdb -p <pid> and not gdb python -p <pid>. Furthermore check that there are no other running gdb instances attached to the same process.

Working with core dumps

This is useful if stopping the program causes problems, e.g., in other components in a distributed system.

  1. Find the pid:
    pgrep -af <name>
  2. Generate core dump:
    gcore <pid>
    This will stop the program while creating the core dump, which might take a few seconds.
  3. Start the debugger:
    gdb python <core_file> or gdb python3 <core_file>, depending on the programs Python version
  4. Use the usual gdb commands to investigate the state of the program, see above.

Example session

For this example session, an ascan was executed in spock. In the middle of the scan, a core dump of the MacroServer Tango server was created.

  1. Load the core dump with gdb:
    $ gdb python core.26201
    GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
    ...
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
    Core was generated by `/usr/bin/python2'.
    #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    185	../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
    [Current thread is 1 (Thread 0x7f82f95355c0 (LWP 26201))]
    (gdb)

  2. Get an overview of all threads:
    (gdb) thread apply all py-bt
    
    Thread 36 (Thread 0x7f82698df700 (LWP 4873)):
    Traceback (most recent call first):
    
    Thread 35 (Thread 0x7f826bc9c700 (LWP 4804)):
    ...
    Thread 22 (Thread 0x7f828ffff700 (LWP 26227)):
    Traceback (most recent call first):
      Waiting for the GIL
      File "/usr/lib/python2.7/threading.py", line 340, in wait
        waiter.acquire()
      File "/usr/lib/python2.7/dist-packages/taurus/core/util/event.py", line 659, in waitEvent
        self._cond.wait(timeout)
      File "/usr/lib/python2.7/dist-packages/sardana/taurus/core/tango/sardana/pool.py", line 491, in waitFinish
        timeout=timeout, retries=1)
      File "/usr/lib/python2.7/dist-packages/sardana/taurus/core/tango/sardana/pool.py", line 503, in go
        self.waitFinish(id=eid)
      File "/usr/lib/python2.7/dist-packages/sardana/taurus/core/tango/sardana/pool.py", line 260, in new_fn
        return fn(*args, **kwargs)
      File "/usr/lib/python2.7/dist-packages/sardana/taurus/core/tango/sardana/pool.py", line 1819, in go
        PoolElement.go(self, *args, **kwargs)
      File "/usr/lib/python2.7/dist-packages/sardana/macroserver/scan/gscan.py", line 1113, in stepUp
        state, data_line = mg.count(integ_time)
      File "/usr/lib/python2.7/dist-packages/sardana/macroserver/scan/gscan.py", line 1029, in scan_loop
        self.stepUp(i, step, lstep)
      File "/usr/lib/python2.7/dist-packages/sardana/macroserver/scan/gscan.py", line 949, in step_scan
        for i in self.scan_loop():
      File "/usr/lib/python2.7/dist-packages/sardana/macroserver/macros/scan.py", line 292, in run
        for step in self._gScan.step_scan():
      File "/usr/lib/python2.7/dist-packages/sardana/macroserver/macro.py", line 2460, in exec_
        thrown"""
      File "/usr/lib/python2.7/dist-packages/sardana/macroserver/msmacromanager.py", line 1593, in runMacro
        self.sendMacroStatusStop()
      File "/usr/lib/python2.7/dist-packages/sardana/macroserver/msmacromanager.py", line 1567, in __runXMLMacro
        exc_info = sys.exc_info()
      File "/usr/lib/python2.7/dist-packages/sardana/macroserver/msmacromanager.py", line 1549, in __runStatelessXML
        macro_exp, tb, result = None, None, None
      File "/usr/lib/python2.7/dist-packages/sardana/macroserver/msmacromanager.py", line 1533, in __runXML
    
      File "/usr/lib/python2.7/dist-packages/taurus/core/util/threadpool.py", line 133, in run
        callback(cmd(*args, **kw))
      File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
        self.run()
      File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
        self.__bootstrap_inner()
    ...

    Here one can see, that the thread that executes the macro is thread 22. All other threads are not executing Python code or are idling threads in the Taurus thread pool.
  3. Switch to the thread that executes the macro
    (gdb) thread 22
    [Switching to thread 22 (Thread 0x7f828ffff700 (LWP 26227))]
    #0  0x00007f82f912d556 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f828800cab0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
    205	in ../sysdeps/unix/sysv/linux/futex-internal.h
  4. Use py-up to get out of the deep library calls until a frame that executes Taurus code is reached:

    (gdb) py-up
    #6 Frame 0x7f829c099bb0, for file /usr/lib/python2.7/threading.py, line 340, in wait (self=<_Condition(_Condition__lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f826d64f110>, _RLock__count=0) at remote 0x7f826d740350>, acquire=<instancemethod at remote 0x7f826d790be0>, _is_owned=<instancemethod at remote 0x7f826d73f690>, _release_save=<instancemethod at remote 0x7f826d73f7d0>, release=<instancemethod at remote 0x7f826d7352d0>, _acquire_restore=<instancemethod at remote 0x7f826d73f730>, _Verbose__verbose=False, _Condition__waiters=[<thread.lock at remote 0x7f826d64f270>]) at remote 0x7f826d740490>, timeout=None, waiter=<thread.lock at remote 0x7f826d64f270>, saved_state=(2, 140198738392832))
        waiter.acquire()

    The output shows, that the first call to py-up went to line 340 in /usr/lib/python2.7/threading.py inside the wait function from the Python standard library.

    (gdb) py-up
    #10 Frame 0x7f829c017050, for file /usr/lib/python2.7/dist-packages/taurus/core/util/event.py, line 659, in waitEvent (self=<AttributeEventWait(_last_val=<DevState(_value_=6, __objclass__=<EnumMeta(__module__='taurus.core.tango.enums', __ne__=<method-wrapper at remote 0x7f82ebe66210>, STANDBY=<DevState(_value_=7, __objclass__=<...>, _name_='STANDBY') at remote 0x7f82ebe5fd88>, UNKNOWN=<DevState(_value_=13, __objclass__=<...>, _name_='UNKNOWN') at remote 0x7f82ebe69128>, ALARM=<DevState(_value_=11, __objclass__=<...>, _name_='ALARM') at remote 0x7f82ebe5ffc8>, MOVING=<...>, __lt__=<method-wrapper at remote 0x7f82ebe66110>, OPEN=<DevState(_value_=3, __objclass__=<...>, _name_='OPEN') at remote 0x7f82ebe5fb48>, INSERT=<DevState(_value_=4, __objclass__=<...>, _name_='INSERT') at remote 0x7f82ebe5fbd8>, __reduce_ex__=<instancemethod at remote 0x7f82ec0fc7d0>, __new__=<staticmethod at remote 0x7f82f0244fd8>, _value2member_map_={0: <DevState(_value_=0, __objclass__=<...>, _name_='ON') at remote 0x7f82ebe5f950>, 1: <DevS...(truncated)
        self._cond.wait(timeout)

    After a second call to py-up, the first frame containing Taurus code is reached, i.e., line 659 in /usr/lib/python2.7/dist-packages/taurus/core/util/event.py inside the waitEvent function.

  5. Get the currently executed code:
    (gdb) py-list
     654                        for v, t in s.items():
     655                            if v == val:
     656                                continue
     657                            if t >= after:
     658                                return
    >659                    self._cond.wait(timeout)
     660                    retries -= 1
     661            except Exception, e:
     662                sys.stderr.write(
     663                    "AttributeEventWait: Caught exception while waitting: %s\n" % str(e))
     664                raise e

    Unfortunately, the number of lines shown around the current line is hardcoded to ±5 in the py-list function in /usr/share/gdb/auto-load/usr/bin/python3.5m-gdb.py. To view the whole code of the waitEvent function, open the corresponding file in an external viewer.

  6. List all local variables:

    (gdb) py-locals
    self = <AttributeEventWait(_last_val=<DevState(_value_=6, __objclass__=<EnumMeta(__module__='taurus.core.tango.enums', __ne__=<method-wrapper at remote 0x7f82ebe66210>, STANDBY=<DevState(_value_=7, __objclass__=<...>, _name_='STANDBY') at remote 0x7f82ebe5fd88>, UNKNOWN=<DevState(_value_=13, __objclass__=<...>, _name_='UNKNOWN') at remote 0x7f82ebe69128>, ALARM=<DevState(_value_=11, __objclass__=<...>, _name_='ALARM') at remote 0x7f82ebe5ffc8>, MOVING=<...>, __lt__=<method-wrapper at remote 0x7f82ebe66110>, OPEN=<DevState(_value_=3, __objclass__=<...>, _name_='OPEN') at remote 0x7f82ebe5fb48>, INSERT=<DevState(_value_=4, __objclass__=<...>, _name_='INSERT') at remote 0x7f82ebe5fbd8>, __reduce_ex__=<instancemethod at remote 0x7f82ec0fc7d0>, __new__=<staticmethod at remote 0x7f82f0244fd8>, _value2member_map_={0: <DevState(_value_=0, __objclass__=<...>, _name_='ON') at remote 0x7f82ebe5f950>, 1: <DevState(_value_=1, __objclass__=<...>, _name_='OFF') at remote 0x7f82ebe5fa70>, 2: <DevState(_value_=2, __objclass__=<...>,...(truncated)
    val = <DevState at remote 0x7f82f6a5a090>
    after = <float at remote 0x7f8264021230>
    equal = False
    timeout = None
    retries = 2
    any = False
    s = {<DevState(_value_=0, __objclass__=<EnumMeta(__module__='taurus.core.tango.enums', __ne__=<method-wrapper at remote 0x7f82ebe66210>, STANDBY=<DevState(_value_=7, __objclass__=<...>, _name_='STANDBY') at remote 0x7f82ebe5fd88>, UNKNOWN=<DevState(_value_=13, __objclass__=<...>, _name_='UNKNOWN') at remote 0x7f82ebe69128>, ALARM=<DevState(_value_=11, __objclass__=<...>, _name_='ALARM') at remote 0x7f82ebe5ffc8>, MOVING=<DevState(_value_=6, __objclass__=<...>, _name_='MOVING') at remote 0x7f82ebe5fcf8>, __lt__=<method-wrapper at remote 0x7f82ebe66110>, OPEN=<DevState(_value_=3, __objclass__=<...>, _name_='OPEN') at remote 0x7f82ebe5fb48>, INSERT=<DevState(_value_=4, __objclass__=<...>, _name_='INSERT') at remote 0x7f82ebe5fbd8>, __reduce_ex__=<instancemethod at remote 0x7f82ec0fc7d0>, __new__=<staticmethod at remote 0x7f82f0244fd8>, _value2member_map_={0: <...>, 1: <DevState(_value_=1, __objclass__=<...>, _name_='OFF') at remote 0x7f82ebe5fa70>, 2: <DevState(_value_=2, __objclass__=<...>, _name_='CLOSE') at remot...(truncated)
    v = <DevState(_value_=6, __objclass__=<EnumMeta(__module__='taurus.core.tango.enums', __ne__=<method-wrapper at remote 0x7f82ebe66210>, STANDBY=<DevState(_value_=7, __objclass__=<...>, _name_='STANDBY') at remote 0x7f82ebe5fd88>, UNKNOWN=<DevState(_value_=13, __objclass__=<...>, _name_='UNKNOWN') at remote 0x7f82ebe69128>, ALARM=<DevState(_value_=11, __objclass__=<...>, _name_='ALARM') at remote 0x7f82ebe5ffc8>, MOVING=<...>, __lt__=<method-wrapper at remote 0x7f82ebe66110>, OPEN=<DevState(_value_=3, __objclass__=<...>, _name_='OPEN') at remote 0x7f82ebe5fb48>, INSERT=<DevState(_value_=4, __objclass__=<...>, _name_='INSERT') at remote 0x7f82ebe5fbd8>, __reduce_ex__=<instancemethod at remote 0x7f82ec0fc7d0>, __new__=<staticmethod at remote 0x7f82f0244fd8>, _value2member_map_={0: <DevState(_value_=0, __objclass__=<...>, _name_='ON') at remote 0x7f82ebe5f950>, 1: <DevState(_value_=1, __objclass__=<...>, _name_='OFF') at remote 0x7f82ebe5fa70>, 2: <DevState(_value_=2, __objclass__=<...>, _name_='CLOSE') at remote 0x7...(truncated)
    t = <float at remote 0x7f8264021230>

    Here one can already see that self is an instance of type AttributeEventWait with an attribute _last_val of the type DevState. Looking into the documentation reveals that this is an enum type where _value=6 corresponds to DevState.MOVING. Some values are not shown, e.g., for val or after. I didn't find an easy way to get the actual value of after. Fortunately, the value for val is hardcoded in the caller frame and can thus be obtained by:

    (gdb) py-up
    #14 Frame 0x7f829c086530, for file /usr/lib/python2.7/dist-packages/sardana/taurus/core/tango/sardana/pool.py, line 491, in waitFinish (self=<MeasurementGroup(log_full_name='haso232s.desy.de:10000.mntgrp02', _lock_info=<TaurusLockInfo(status=3, status_msg='Lock status unknown', host=None, id=None, klass=None) at remote 0x7f826d7acdd0>, log_name='mntgrp02', _configuration=<MGConfiguration(description='General purpose measurement configuration', tango_dev_channels_in_error=0, tango_channels_info_in_error=0, tango_dev_channels=<CaselessDict at remote 0x7f826d786740>, channels=<CaselessDict at remote 0x7f826d7ba3c8>, initialized=False, channel_list=[{'ndim': 0, 'index': 0, 'name': 'ct05', 'plot_type': 0, 'enabled': True, 'label': 'ct05', 'source': 'tango://haso232s.desy.de:10000/expchan/ctctrl02/1/Value', '_controller_name': 'tango://haso232s.desy.de:10000/controller/dummycountertimercontroller/ctctrl02', 'conditioning': '', 'full_name': 'tango://haso232s.desy.de:10000/expchan/ctctrl02/1', 'output': True, 'plot_axes'...(truncated)
        timeout=timeout, retries=1)
    (gdb) py-list
     486                id = id[0]
     487            evt_wait = self._getEventWait()
     488            evt_wait.lock()
     489            try:
     490                evt_wait.waitEvent(DevState.MOVING, after=id, equal=False,
    >491                                   timeout=timeout, retries=1)
     492            finally:
     493                self.__go_end_time = time.time()
     494                self.__go_time = self.__go_end_time - self.__go_start_time
     495                evt_wait.unlock()
     496                evt_wait.disconnect()

    To go back to the previous frame, one can either call py-down or refer to it by number:

    (gdb) frame 10
    #10 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
    2989	in ../Python/ceval.c

    At this point one knows that the current thread is waiting for an attribute to be different from DevState.MOVING.

  7. Investigate the Python object self with the help of the custom extension pyfields, see above. First load the extension:

    (gdb) source ~/Public/pyfields.py

    Now, one can either print a single attribute or all attributes of a given object:

    (gdb) py-fields self.*
    local '_last_val' = <DevState(_value_=6, __objclass__=<EnumMeta(__module__='taurus.core.tango.enums', __ne__=<method-wrapper at remote 0x7f82ebe66210>, STANDBY=<DevState(_value_=7, __objclass__=<...>, _name_='STANDBY') at remote 0x7f82ebe5fd88>, UNKNOWN=<DevState(_value_=13, __objclass__=<...>, _name_='UNKNOWN') at remote 0x7f82ebe69128>, ALARM=<DevState(_value_=11, __objclass__=<...>, _name_='ALARM') at remote 0x7f82ebe5ffc8>, MOVING=<...>, __lt__=<method-wrapper at remote 0x7f82ebe66110>, OPEN=<DevState(_value_=3, __objclass__=<...>, _name_='OPEN') at remote 0x7f82ebe5fb48>, INSERT=<DevState(_value_=4, __objclass__=<...>, _name_='INSERT') at remote 0x7f82ebe5fbd8>, __reduce_ex__=<instancemethod at remote 0x7f82ec0fc7d0>, __new__=<staticmethod at remote 0x7f82f0244fd8>, _value2member_map_={0: <DevState(_value_=0, __objclass__=<...>, _name_='ON') at remote 0x7f82ebe5f950>, 1: <DevState(_value_=1, __objclass__=<...>, _name_='OFF') at remote 0x7f82ebe5fa70>, 2: <DevState(_value_=2, __objclass__=<...>, _name_='CLOSE') at remote 0x7...(truncated)
    local '_attr' = <TangoAttribute(_TangoAttribute__read_lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f826f1fca70>, _RLock__count=0) at remote 0x7f826d7acf10>, log_full_name='haso232s.desy.de:10000.mntgrp02.state', _TangoAttribute__dev_hw_obj=<DeviceProxy(_subscribed_events={32: (<__CallBackPushEvent(push_event=<function at remote 0x7f826d73a410>) at remote 0x7f826d741310>, <EventType at remote 0x7f82f6dc4550>, 'IntegrationTime'), 33: (<__CallBackPushEvent(push_event=<function at remote 0x7f826d73a488>) at remote 0x7f826d741260>, <EventType at remote 0x7f82f6dc4f50>, 'Moveable'), 34: (<__CallBackPushEvent(push_event=<function at remote 0x7f826d73a500>) at remote 0x7f826d741578>, <...>, 'Moveable'), 5: (<__CallBackPushEvent(push_event=<function at remote 0x7f826d7aad70>) at remote 0x7f826d7b25d0>, <...>, 'state'), 6: (<__CallBackPushEvent(push_event=<function at remote 0x7f826d767938>) at remote 0x7f826d76c788>, <...>, 'state'), 7: (<__CallBackPushEvent(push_event=<function at...(truncated)
    local '_cond' = <_Condition(_Condition__lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f826d64f110>, _RLock__count=0) at remote 0x7f826d740350>, acquire=<instancemethod at remote 0x7f826d790be0>, _is_owned=<instancemethod at remote 0x7f826d73f690>, _release_save=<instancemethod at remote 0x7f826d73f7d0>, release=<instancemethod at remote 0x7f826d7352d0>, _acquire_restore=<instancemethod at remote 0x7f826d73f730>, _Verbose__verbose=False, _Condition__waiters=[<thread.lock at remote 0x7f826d64f270>]) at remote 0x7f826d740490>
    local '_event_set' = {<DevState(_value_=0, __objclass__=<EnumMeta(__module__='taurus.core.tango.enums', __ne__=<method-wrapper at remote 0x7f82ebe66210>, STANDBY=<DevState(_value_=7, __objclass__=<...>, _name_='STANDBY') at remote 0x7f82ebe5fd88>, UNKNOWN=<DevState(_value_=13, __objclass__=<...>, _name_='UNKNOWN') at remote 0x7f82ebe69128>, ALARM=<DevState(_value_=11, __objclass__=<...>, _name_='ALARM') at remote 0x7f82ebe5ffc8>, MOVING=<DevState(_value_=6, __objclass__=<...>, _name_='MOVING') at remote 0x7f82ebe5fcf8>, __lt__=<method-wrapper at remote 0x7f82ebe66110>, OPEN=<DevState(_value_=3, __objclass__=<...>, _name_='OPEN') at remote 0x7f82ebe5fb48>, INSERT=<DevState(_value_=4, __objclass__=<...>, _name_='INSERT') at remote 0x7f82ebe5fbd8>, __reduce_ex__=<instancemethod at remote 0x7f82ec0fc7d0>, __new__=<staticmethod at remote 0x7f82f0244fd8>, _value2member_map_={0: <...>, 1: <DevState(_value_=1, __objclass__=<...>, _name_='OFF') at remote 0x7f82ebe5fa70>, 2: <DevState(_value_=2, __objclass__=<...>, _name_='CLOSE') at remot...(truncated)

    This shows, that the _attr attribute points to an instance of the TangoAttribute class. Using py-fields self._attr.* or looking up the source code of TangoAttribute, one can find that is has an attribute named _full_name:

    (gdb) py-fields self._attr._full_name
    local '_full_name' = 'tango://haso232s.desy.de:10000/mntgrp/pool_haso232s/mntgrp02/state'

    As a result of the investigation so far, the macro server seems to be waiting for the Tango attribute mntgrp/pool_haso232s/mntgrp02/state to exit the MOVING state. By going further up the stack trace, one could find out why the macro server is waiting for this particular event, or, if it was the device that is hanging, one could attach a new debugger session to the corresponding Tango server (or its core dump) and investigate the state of that device.