Thursday, March 16, 2006

Linode, Debian, libc6 and that stupid TLS problem (again)

After a recent "apt-get install package/testing" to setup some Apache/php4 based products, a number of system services (MySQL, Bind9) mysteriously segfaulted and refused to start. (Skip to the end for the quick-fix.)

tail /var/log/syslog
Mar 16 16:02:08 sapphire named[9240]: starting BIND 9.2.4
Mar 16 16:02:08 sapphire named[9240]: using 1 CPU
(end)

Lots of head-scratching... perhaps I had stuffed the versions up

apt-get install apt-show-versions

sapphire:/var/log# apt-show-versions | fgrep bind
bind9-doc/stable uptodate 1:9.2.4-1
libbind9-0/testing uptodate 1:9.3.2-2
webmin-bind/stable uptodate 1.180-4
bind9-host/stable uptodate 1:9.2.4-1
bind9/stable uptodate 1:9.2.4-1

Hmm, 'libbind9' is from testing? Thats a bit odd, okay, lets update the lot to 'testing':

sapphire:/var/log# apt-get install bind9-host/testing bind9/testing
...
Stopping domain name service: named/etc/init.d/bind9: line 60:
8923 Segmentation fault /usr/sbin/rndc stop
Starting domain name service...:.

Those dots didnt look good...

ps aux | grep bind
(nothing)
So Bind couldn't start even though i was sure there was no dependancy issues and all packages from the same 'testing' release. Odd. More thinking... How about downgrading to 'stable'?

sapphire:/var/log# apt-get install bind9/stable libisccc0/stable bind9-host/stable
Reading Package Lists... Done
Building Dependency Tree... Done
Selected version 1:9.2.4-1 (Debian:3.1r1/stable) for bind9
Selected version 1:9.2.4-1 (Debian:3.1r1/stable) for libisccc0
Selected version 1:9.2.4-1 (Debian:3.1r1/stable) for bind9-doc
Selected version 1:9.2.4-1 (Debian:3.1r1/stable) for bind9-host
The following packages will be DOWNGRADED: bind9 bind9-doc bind9-host libisccc0
0 upgraded, 0 newly installed, 4 downgraded, 0 to remove and 0 not upgraded.

ps aux | grep bind
(still nothing)
Crap! both testing and stable are broken... More thinking... what else did i upgrade to 'testing'

sapphire:/var/log# apt-show-versions | fgrep libc6
libc6/testing uptodate 2.3.5-13
Yes, in my rush I'd updated libc6 to Debian/testing. After briefly flirting with the idea of downgrading libc6 (not easy on a remote server) I was somewhat stumped for ideas.

Lets find out where Bind9 was segfaulting with 'strace'. Strace is a great tool, but often its hard to get any value out of the pages of output. But first, how do i start bind manually?

sapphire:/home/rob/# grep daemon /etc/init.d/bind9
if start-stop-daemon --start --quiet --exec /usr/sbin/named

Okay, /usr/sbin/named loads the Bind9 server, however, it daemonizes itself

man named
(searching... ahh, '-f Run the server in the foreground')

sapphire:/var/log# strace /usr/sbin/named -f
execve("/usr/sbin/named", ["/usr/sbin/named", "-f"], [/* 18 vars */]) = 0
uname({sys="Linux", node="sapphire", ...}) = 0
brk(0) = 0x808e000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40017000
---------------8<---------------8<----------------- open("/lib/tls/libpthread.so.0", O_RDONLY) = 4
read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\360G\0"..., 512) = 512
fstat64(4, {st_mode=S_IFREG|0755, st_size=85770, ...}) = 0
old_mmap(NULL, 70104, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0x40291000
---------------8<---------------8<----------------- clone(child_stack=0x405df4c4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND... +++ killed by SIGSEGV +++
Okay, there is the crash right at the end. Rather hard to get anything useful out of the 20 screen-fulls of trace... Until in the middle '/lib/tls' caught my eye. Of course, the libc6 update had replaced the /lib/tls folder that I had carefully renamed out of the way, many months ago.

mv /lib/tls /lib/tls-disabled-again

/etc/init.d/bind9 start

sapphire:/lib# ps aux | grep bind
bind 9596 0.0 1.7 11300 2652 ? Ss 17:03 0:00 /usr/sbin/named -u bind
bind 9600 0.0 1.7 11300 2652 ? S 17:03 0:00 /usr/sbin/named -u bind
bind 9601 0.0 1.7 11300 2652 ? S 17:03 0:00 /usr/sbin/named -u bind
bind 9602 0.0 1.7 11300 2652 ? S 17:03 0:00 /usr/sbin/named -u bind
bind 9603 0.0 1.7 11300 2652 ? S 17:03 0:00 /usr/sbin/named -u bind
Great, finally an answer that i already knew. From the Linode forum: "UML does not (yet) support Thread Local Storage (TLS) in either 2.4 or 2.6. TLS is required by the Native POSIX Thread Library (NPTL) so NPTL is also not supported by UML"

More info:
http://www.linode.com/forums/viewtopic.php?t=1082
http://www.linode.com/forums/viewtopic.php?t=1160

No comments:

Post a Comment