Monday, March 5, 2012

Debugging memory explosions with GDB

When writing large/complex/multithreaded software its inevitable that bugs will happen. Valgrind is a great tool for finding memory leaks, however it's not very useful when a bug causes a malloc "explosion" and the OS responds by killing the process. It is especially difficult to find such a "malloc bomb" when it happens randomly after running the code for many hours (which would be days with valgrind's overheads).

Synopsis of a malloc() bomb:
  • run the compiled program inside a debugger: eg. `gdb yourprogramname`. 
  • after watching the program run normally for many hours the bug occurs and memory usage spikes: the buggy program tries to allocate more memory from the heap than is available. 
  • the OS starts paging memory to swap and the hard disk thrashes, the system starts to freeze and becomes unresponsive.
  • you cant get keyboard focus to the console to hit Ctrl+C, as a result you cant break/stop your code inside the debugger.
  • eventually the OS kills the program in a panic and after more disk thrashing keyboard control returns to the debugger saying "Abnormal program exit". 
  • the program is killed with no opportunity to produce a core dump or break into the debugger. 

Even if you were sitting at the gdb console waiting for this to occur, you likely wouldn't catch it before the system becomes unresponsive to the keyboard Ctrl+C. Unfortunately gdb and other debuggers don't allow breakpoints on total memory usage.

Solution: One solution that works very nicely on Ubuntu/Linux is a simple bash script that loops checking memory usage. First run your program at a console inside the debugger `gdb yourprogramname`, and then run/paste this script into another console:
#set the process name here (make sure there is only one instance)
#set the process memory threshold in limit in KiB
PID=`pgrep $NAME`
while true
    MEM=`echo 0 $(cat /proc/$PID/smaps | grep Rss | awk '{print $2}' | sed 's#^#+#') | bc`
    echo "Memory usage: $MEM"
    if [ $MEM -gt $MEMLIMIT ]; then echo "***sending sigint***"; kill -INT $PID; fi
    sleep 1

Make sure you are running only one instance of your program and it is inside the debugger. The script first finds the PID of your running program and then loops forever checking how much memory it is using. If total memory usage passes a predefined threshold (2GiB default above) the script sends SIGINT to your process. This has the same effect as hitting Ctrl+C in the debugger, however the script is 99.9% more likely to send the SIGINT before the disk thrashing and system becomes unresponsive. The repeated SIGINTs are ignored.

Example: (returning to the computer after several hours of execution)
Memory usage: 1047948
Memory usage: 1049996
Memory usage: 1049996
Memory usage: 1049812
Memory usage: 1893104
Memory usage: 2692788
***sending sigint***
Memory usage: 2695596
***sending sigint***
Memory usage: 2695596
***sending sigint***
Memory usage: 2695596
***sending sigint***
Memory usage: 3116488
***sending sigint***

The program is now frozen with all 3.1GB of memory still allocated and the debugger waiting in the other console window. This particular OpenGL/Qt-based app had 4 worker threads and finding the cause of this particular bug took some time. I've highlighted the GDB commands that narrowed the problem down:

Normal program output...
Program received signal SIGINT, Interrupt.
[Switching to Thread 0x7ffff7f9a9e0 (LWP 8264)]
0x00007ffff3f11773 in __GI___poll (fds=, nfds=, timeout=) at ../sysdeps/unix/sysv/linux/poll.c:87
87 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
in ../sysdeps/unix/sysv/linux/poll.c
(gdb) info threads
Id Target Id Frame
4 Thread 0x7fffdd557700 (LWP 8269) "mapbuilder" __int64_t_decode_array (_buf=0x2ca8e130, offset=469, maxlen=82, p=0x7fff4cdfbc38, elements=1) at /usr/local/include/lcm/lcm_coretypes.h:278
3 Thread 0x7fffddd58700 (LWP 8268) "mapbuilder" 0x00007ffff645aa1b in ?? () from /usr/lib/nvidia-current/
2 Thread 0x7fffde559700 (LWP 8267) "mapbuilder" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216
* 1 Thread 0x7ffff7f9a9e0 (LWP 8264) "mapbuilder" 0x00007ffff3f11773 in __GI___poll (fds=, nfds=, timeout=) at ../sysdeps/unix/sysv/linux/poll.c:87
(gdb) thread 4
[Switching to thread 4 (Thread 0x7fffdd557700 (LWP 8269))]
#0 __int64_t_decode_array (_buf=0x2ca8e130, offset=469, maxlen=82, p=0x7fff4cdfbc38, elements=1) at /usr/local/include/lcm/lcm_coretypes.h:278
278 if (maxlen < total_size)
(gdb) bt
#0 __int64_t_decode_array (_buf=0x2ca8e130, offset=469, maxlen=82, p=0x7fff4cdfbc38, elements=1) at /usr/local/include/lcm/lcm_coretypes.h:278
#1 0x0000000000444db6 in __double_decode_array (elements=1, p=0x7fff4cdfbc38, maxlen=82, offset=469, _buf=0x2ca8e130) at /usr/local/include/lcm/lcm_coretypes.h:345
#2 gps_t::_decodeNoHash (this=0x7fff4cdfbc08, buf=0x2ca8e130, offset=421, maxlen=130) at /home/rob/mapbuilder/include/gps_t.hpp:160
#3 0x0000000000442805 in _decodeNoHash (maxlen=543, offset=8, buf=0x2ca8e130, this=0x7fff4cdfbbb0) at /home/rob/mapbuilder/include/robot_map_data_t.hpp:116
#4 decode (maxlen=543, offset=0, buf=0x2ca8e130, this=0x7fff4cdfbbb0) at /home/rob/mapbuilder/include/robot_map_data_t.hpp:62
#5 LogPlayThread::run (this=0x97ed60) at /home/rob/mapbuilder/src/logthread.cpp:419
#6 0x00007ffff69f0d05 in ?? () from /usr/lib/x86_64-linux-gnu/
#7 0x00007ffff2f2cefc in start_thread (arg=0x7fffdd557700) at pthread_create.c:304
#8 0x00007ffff3f1d89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#9 0x0000000000000000 in ?? ()

I stepped the execution out of the LCM library and back to my code (#5 in the backtrace). By setting a breakpoint there I was able to find the bug after a few loops. Let me know if this helps you out, it certainly saved me hours staring at the console waiting!

No comments:

Post a Comment