Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 19 Apr 2005 12:38:24 +0800
From:      David Xu <davidxu@freebsd.org>
To:        Peter Edwards <peadar.edwards@gmail.com>
Cc:        FreeBSD Current <current@freebsd.org>
Subject:   Re: Race condition in debugger?
Message-ID:  <42648B40.6040701@freebsd.org>
In-Reply-To: <34cb7c8405041717342891f2@mail.gmail.com>
References:  <20050214014217.GB85932@wantadilla.lemis.com> <34cb7c8405041717342891f2@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Peter Edwards wrote:

>[Very late response: I just experienced the same problem and
>remembered the issue had been brought up before]
>
>On 2/14/05, Greg 'groggy' Lehey <grog@freebsd.org> wrote:
>  
>
>>I'm having some problems with userland gdb on recent -CURRENT builds:
>>at some point it hangs.
>>
>>Specifically, I'm setting a conditional breakpoint like this:
>>
>>  b Minsert_blockletpointer if I->inode_num == 0x1f0bb
>>
>>inode_num increments for 1, so I hit this breakpoint about 100,000
>>times.  Or I should.  What happens is that the debugger hangs at some
>>point on the way.  ktrace shows multiple copies of:
>>
>> 12325 gdb      CALL  ptrace(12,0x3026,0xbfbfd5e0,0)
>> 12325 gdb      RET   ptrace 0
>> 12325 gdb      CALL  ptrace(PT_STEP,0x3026,0x1,0)
>> 12325 gdb      RET   ptrace 0
>> 12325 gdb      CALL  wait4(0xffffffff,0xbfbfd808,0,0)  <-- stops here
>> 12325 gdb      RET   wait4 12326/0x3026
>> 12325 gdb      CALL  kill(0x3026,0)
>> 12325 gdb      RET   kill 0
>> 12325 gdb      CALL  ptrace(PT_GETREGS,0x3026,0xbfbfd5c0,0)
>>
>>When it hangs, it's at the call to wait4, as shown.  It looks like the
>>completion of the ptrace request isn't being reported back.
>>    
>>
>
>I think I know what's going on with this, and I have a feeling that
>there's a couple of other wait()-related issues that were left open on
>the lists that might be explained by the issue.
>
>Here's my hypothesis: kern_wait() checks each child of the current
>process to see if they have exited, or should otherwise report status
>to wait/wait3/wait4/waitpid, If it finds that all candidate children
>have nothing to report, it goes asleep, waiting to be awoken by the/a
>child reporting status, and repeats the process: it looks a bit like
>this:
>
>kern_wait()
>{
>loop:
>    foreach child of self {
>        if (child has status to report)
>            return status;
>    }
>    lock self
>    msleep(on "self")
>    unlock self
>    goto loop;
>}
>
>Problem is, that there's no lock protecting that the conditions in the
>inner loop hold by the time the current process locks its own "struct
>proc" and invokes msleep(). (It's probably most likely the race will
>happen on an SMP machine or with PREEMPTION, but the aquiry of
>curproc's lock could possibly cause the issue if it needed to sleep.),
>i.e., you can miss the wakeup generated by a particular child between
>checking the process in the inner loop, and going to sleep.
>
>I can at least reproduce this for the ptrace/gdb case, but AFAICT, it
>could happen for the standard wait()/exit() path, too. I worked up a
>patch to fix the problem by having those parts of the kernel that wake
>the process up flag the fact in the parent's flags and doing the
>wakeup while holding tha parent process lock, and noticing if this
>flag has been set before sleeping. (A simpler solution would be to
>hold the parent lock across the bulk of kern_wait, but from what I can
>gather this will lead to at least one LOR)
>
>I've been unable to reproduce the problem with a kernel with this
>patch, and using a nice sprinkling of printfs can show that when GDB
>hangs, the race has just occurred.
>
>Anyone got opinions on this?
>Cheers,
>Peadar.
>  
>
I just found another case that if the parent masks SIGCHLD, then we will 
get the race
too. I have tested the patch, it works,  I will tweak the patch and 
commit it soon.

David Xu



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?42648B40.6040701>