From owner-freebsd-questions@FreeBSD.ORG Tue Mar 30 07:42:38 2004 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1851916A4CE for ; Tue, 30 Mar 2004 07:42:38 -0800 (PST) Received: from parrot.aev.net (host29-15.pool8174.interbusiness.it [81.74.15.29]) by mx1.FreeBSD.org (Postfix) with ESMTP id C301843D55 for ; Tue, 30 Mar 2004 07:42:36 -0800 (PST) (envelope-from andrea.venturoli@netfence.it) Received: from netfence.it (adsl-76-23.37-151.net24.it [151.37.23.76]) (authenticated bits=0) by parrot.aev.net (8.12.11/8.12.11) with ESMTP id i2UFhVIL056855; Tue, 30 Mar 2004 17:43:31 +0200 (CEST) (envelope-from andrea.venturoli@netfence.it) Message-ID: <4069A381.4020206@netfence.it> Date: Tue, 30 Mar 2004 17:42:41 +0100 From: Andrea Venturoli Organization: NetFence User-Agent: Mozilla/5.0 (OS/2; U; Warp 4.5; en-US; rv:1.4.1) Gecko/20031014 X-Accept-Language: it,en,fr,de MIME-Version: 1.0 To: freebsd-questions@freebsd.org Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.41 Subject: A night with threads and gdb X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 30 Mar 2004 15:42:38 -0000 A night with threads and gdb or How I began to wonder whether 5.2.1 works or thread support is really broken It all started on Saturday 2004/3/27: the spring sun was shining hot and I was struggling in the effort to get apache working decently on a 5.2.1p3/i386 (more on this later). While portupgrading mod_php4, the system suddenly stopped working properly: no more "make install", no more "install", even "ls -l" would dump core!!! I wondered what could have caused this and thought that any changes to installed ports should not affect the stability of binaries from the base system; I tried moving /usr/local/lib out of the way and "ls -l" would work again. Logic or intuition lead me to blame nss_ldap, so I disabled it and everything would work fine again. To make it clear: with nss_ldap enabled, everything that accessed the user database would crash: so "ls -l", "id" and so on (but not, e.g., "ls" without "-l"). I recompiled ls and libc with -ggdb3 and found out that the problem was in nsdispatch.c, and precisely in the last line of the following function: nss_atexit(void) { (void)_pthread_rwlock_wrlock(&nss_lock); vector_free((void **)&_nsmap, &_nsmapsize, sizeof(*_nsmap), (vector_free_elem)ns_dbt_free); vector_free((void **)&_nsmod, &_nsmodsize, sizeof(*_nsmod), (vector_free_elem)ns_mod_free); (void)_pthread_rwlock_unlock(&nss_lock); } Once again Google turned out to be man's best friend, by providing me the following link: http://groups.google.it/groups?q=vector_free+nss_atexit&hl=it&lr=&ie=UTF-8&oe=UTF-8&selm=1080344625.82158.35.camel_server.mcneil.com%40ns.sol.net&rnum=1 Apart from the psychological help derived from knowing I'm not alone, this suggested to patch that file to look like: nss_atexit(void) { if (__isthreaded) (void)_pthread_rwlock_wrlock(&nss_lock); vector_free((void **)&_nsmap, &_nsmapsize, sizeof(*_nsmap), (vector_free_elem)ns_dbt_free); vector_free((void **)&_nsmod, &_nsmodsize, sizeof(*_nsmod), (vector_free_elem)ns_mod_free); if (__isthreaded) (void)_pthread_rwlock_unlock(&nss_lock); } I did, and did similarly for other pthread calls in that file, declaring __isthreaded as: extern int __isthreaded; That was one step ahead: now "ls -l /bin" would crash no more, but "ls -l /home" would still be problematic. Obviously the difference between the two is that in /bin everything is owned by system accounts, while listing /home would imply searching for users in the ldap database. I guessed the problem was that upgrading php had upgraded openldap too, so I looked at freshport and found out that the main difference was in the makefile, where "-with-threads" had been replaced with "-with-threads=posix". I decided to try the three alternatives: a) -without-threads would not do, as it would cause slapd to crash when ldapsearching with a filter (i.e. "ldapsearch -b 'dc=mydomain'" works fine, but "ldapsearch -b 'dc=mydomain' (objectClass=posixAccount)" not); b) -with-threads=posix would exhibit the above mentioned problem with ls; c) -with-threads would work best. Now I could even "ls -l /home" and see the correct usernames. However, I could not login or su anymore. (This forced me to go and ask for the keys to the server room and wait until Sunday). I ended up finding out (again by 'gdb su') that now using nss_ldap hampers the ability of a process to read from stdin. I can even provide this demonstrative program: #include int main(int argc,char**argv) { char ch; getpwent(); while (1) { ch=getchar(); putchar(ch); } } If I want it to work, I'll either need to comment the call to getpwent() or "ldap" in /etc/nsswitch.conf. ktracing su showed "resource temporarily unavailable" when it tried to read from descriptor 0. Also, telnetting to localhost:pop3 had qpopper say "I/O error". Afternoon was over, darkness was coming and the machine had to be up again before morning, so I decided to leave nss_ldap and migrate the user accounts to the system password files. This will not do in the long run, since it prevents web management, but has allow several mail domains to be up again before any message was lost! However, I was forced to increase the username length limit (MAXLOGNAME to 65 in /usr/src/sys/sys/param and UT_NAMESIZE=64 in utmp.h). This is a deviation from a standard system which I'd like to avoid, but it is needed until the day I can get nss_ldap back up. (Long base system recompile). Now I had pop3 back up, time to think about smtp. I tried recompiling /usr/ports/mail/sendmail-ldap but it hangs on t-event test, after the message: ./t-event This test may hang. If there is no output within twelve seconds, abort it and recompile with -DSM_CONF_SETITIMER=0 I tried make -DSM_CONF_SETITIMER=0, but it makes no difference. This test calls sleep(1) and program flow never gets out of it; if I use gdb and interrupt it, I see it's in poll(); if I single step into that function with gdb, it works fine, instead. Looks a lot like PR kern/56339, which is rather old (freebsd 4.8), but still open. I'm not sure however if it's really the same problem. Being already a little suspicious on ldap I tried /usr/ports/mail/sendmail instead and it doesn't exhibit this problem. It fails however on the test about shminit, but the suggested workaround does its job. I'm not so sure it should be needed, anyway. So, I also converted my sendmail maps to files and abandoned ldap completely for now. Later on I realized that sendmail wasn't using authentication, so I deinstalled sendmail and installed sendmail-sasl, instead: no problem at all this time (!!!). In the end, after a 40 km ride, a sleepless night, 20 consecutive hours of work and a couple pizzas, I finally managed to get my system up again, albeit with some more handicaps than before. As for apache, I hoped removing LDAP from PHP would help, but unfortunately nothing has changed: _ apache 1.3 will core dump on startup if php module mnogosearch is used (and I need it); _ apache 2.0 with default prefork MPM will start, but will chew up all cpu time after a while; using "httpd -DSSL -X" shows that the server dies when nocc is used to forward a mail; no need to say that it's a problem with threads, the exact message being Fatal error 'Unable to read from thread kernel pipe' at line 1100 in file /usr/src/lib/libc_r/uthread/uthread_kern.c (errno = 0) I guess that when started up without -X, one process dies and the manager httpd will not cope correctly (and start eating up every cycle). _ when using perchild MPM (and recompiling mod_php in a thread-safe manner) httpd doesn't die in the above case, but is very unstable anyway; _ worker MPM seems to be the best, but, although no process dies, often apache will stop responding all the same; furthermore SSL is painfully slow, the difference with plain http being more than tenfold. I have also verified that this same behaviour shows up on another 5.2.1 machine. From all the above, there are only to possible conclusion I can draw: either there is something really obvious that I'm so blindly missing or the beast is very broken down to the bones! This is at the same time my SOS to the world and an offer to provide the community with any small help I can give in improving this software's stability. If anyone has any hints, please tell me, and if anyone wants core dumps, ktraces or any other test result just ask! Please, HELP!!! bye av. Ceterum censeo SpamCop delendum esse