Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 Feb 2002 09:50:59 -0800 (PST)
From:      Sandeep Kumar <skumar@juniper.net>
To:        freebsd-gnats-submit@FreeBSD.org
Subject:   bin/35214: dump program hangs while exiting
Message-ID:  <200202221750.g1MHoxh45595@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help

>Number:         35214
>Category:       bin
>Synopsis:       dump program hangs while exiting
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Feb 22 10:00:03 PST 2002
>Closed-Date:
>Last-Modified:
>Originator:     Sandeep Kumar
>Release:        4.2
>Organization:
Juniper
>Environment:
>Description:
Backtrace of the hung dump process:
0x88055efc in nanosleep () from /usr/libexec/ld-elf.so.1
(gdb) bt
#0  0x88055efc in nanosleep () from /usr/libexec/ld-elf.so.1
#1  0x88054bd9 in wlock_acquire () from /usr/libexec/ld-elf.so.1
#2  0x880539fa in rtld_exit () from /usr/libexec/ld-elf.so.1
#3  0x880de0c8 in exit () from /usr/lib/libc.so.4
#4  0x804c9d3 in Exit ()
#5  0x804cb20 in enslave ()
#6  0x804c8f3 in startnewtape ()
#7  0x804a622 in main ()
#8  0x8049601 in _start ()

wlock_acquire waits for ever for the lock to be released by a reader. Since
dump is a non-threaded application, this has to be this process itself. Looking
at the lock and unlock invokations of this lock, they seemed paired. The
lock structure didn't look corrupted either and the fields were consistent.
So the only possibility was that process was interrupted by a signal while,
it had acquired the read lock. This looked like a possibility by looking at the
SIGUSR2 handler of dump. This handler calls longjmp, which can leave the read
lock locked, if done during the _rtld_bind operation. So, it was a matter of
confirming that, this is what had happened.

Taking the symbolic dump of the stack page, was able to locate the sigframe
structure in the stack. This structure is copied by the kernel on the user
stack and also contains info about the registers at the time of the trap,
when signal was delivered. Some of the signature items are, signal no., saved
return pointer to the signal trampoline code at the base of user stack. The
structure looked good, and the saved eip was symlook_list+27. This
function ends up geting called after a call to _rtld_bind, which does acquire
the read lock. So we did do a longjmp while holding the read lock.

The dump application makes extensive use of signalling to communicate between
different children of the dump program. SIGUSR2 is delivered 3-4 times a
second. So, its possible that once in a while, it gets delivered while we
are doing a _rtld_bind. Now, an obvious solution will be to mask the signals
during the duration the read lock is held. In fact, same signal blocking fix
was made, while acquiring the writer version of this lock, by jdp@polstra.com,
in BSD. When, I suggested to make the same change for the reader lock, I
received the following reply:
"It would hurt performance too much.  The rtld would have to do two
system calls for every symbol it resolved lazily."

 
>How-To-Repeat:
Run "dump -f - FS1 | restore -f - FS2"  in a infinite loop
>Fix:
May involve redesigning to not to use longjmp, as its not safe to call it from the signal handler.
>Release-Note:
>Audit-Trail:
>Unformatted:

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200202221750.g1MHoxh45595>