From owner-freebsd-current@FreeBSD.ORG Mon Apr 18 05:26:07 2005 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 07A5216A4CE; Mon, 18 Apr 2005 05:26:07 +0000 (GMT) Received: from freefall.freebsd.org (freefall.freebsd.org [216.136.204.21]) by mx1.FreeBSD.org (Postfix) with ESMTP id CCC6C43D45; Mon, 18 Apr 2005 05:26:06 +0000 (GMT) (envelope-from davidxu@freebsd.org) Received: from [127.0.0.1] (davidxu@localhost [127.0.0.1]) by freefall.freebsd.org (8.13.3/8.13.3) with ESMTP id j3I5Q4Ni024018; Mon, 18 Apr 2005 05:26:05 GMT (envelope-from davidxu@freebsd.org) Message-ID: <42634514.2090902@freebsd.org> Date: Mon, 18 Apr 2005 13:26:44 +0800 From: David Xu User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050319 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Peter Edwards References: <20050214014217.GB85932@wantadilla.lemis.com> <34cb7c8405041717342891f2@mail.gmail.com> In-Reply-To: <34cb7c8405041717342891f2@mail.gmail.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit cc: Greg 'groggy' Lehey cc: FreeBSD Current Subject: Re: Race condition in debugger? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Apr 2005 05:26:07 -0000 Peter Edwards wrote: >[Very late response: I just experienced the same problem and >remembered the issue had been brought up before] > >On 2/14/05, Greg 'groggy' Lehey wrote: > > >>I'm having some problems with userland gdb on recent -CURRENT builds: >>at some point it hangs. >> >>Specifically, I'm setting a conditional breakpoint like this: >> >> b Minsert_blockletpointer if I->inode_num == 0x1f0bb >> >>inode_num increments for 1, so I hit this breakpoint about 100,000 >>times. Or I should. What happens is that the debugger hangs at some >>point on the way. ktrace shows multiple copies of: >> >> 12325 gdb CALL ptrace(12,0x3026,0xbfbfd5e0,0) >> 12325 gdb RET ptrace 0 >> 12325 gdb CALL ptrace(PT_STEP,0x3026,0x1,0) >> 12325 gdb RET ptrace 0 >> 12325 gdb CALL wait4(0xffffffff,0xbfbfd808,0,0) <-- stops here >> 12325 gdb RET wait4 12326/0x3026 >> 12325 gdb CALL kill(0x3026,0) >> 12325 gdb RET kill 0 >> 12325 gdb CALL ptrace(PT_GETREGS,0x3026,0xbfbfd5c0,0) >> >>When it hangs, it's at the call to wait4, as shown. It looks like the >>completion of the ptrace request isn't being reported back. >> >> > >I think I know what's going on with this, and I have a feeling that >there's a couple of other wait()-related issues that were left open on >the lists that might be explained by the issue. > >Here's my hypothesis: kern_wait() checks each child of the current >process to see if they have exited, or should otherwise report status >to wait/wait3/wait4/waitpid, If it finds that all candidate children >have nothing to report, it goes asleep, waiting to be awoken by the/a >child reporting status, and repeats the process: it looks a bit like >this: > >kern_wait() >{ >loop: > foreach child of self { > if (child has status to report) > return status; > } > lock self > msleep(on "self") > unlock self > goto loop; >} > >Problem is, that there's no lock protecting that the conditions in the >inner loop hold by the time the current process locks its own "struct >proc" and invokes msleep(). (It's probably most likely the race will >happen on an SMP machine or with PREEMPTION, but the aquiry of >curproc's lock could possibly cause the issue if it needed to sleep.), >i.e., you can miss the wakeup generated by a particular child between >checking the process in the inner loop, and going to sleep. > >I can at least reproduce this for the ptrace/gdb case, but AFAICT, it >could happen for the standard wait()/exit() path, too. I worked up a >patch to fix the problem by having those parts of the kernel that wake >the process up flag the fact in the parent's flags and doing the >wakeup while holding tha parent process lock, and noticing if this >flag has been set before sleeping. (A simpler solution would be to >hold the parent lock across the bulk of kern_wait, but from what I can >gather this will lead to at least one LOR) > >I've been unable to reproduce the problem with a kernel with this >patch, and using a nice sprinkling of printfs can show that when GDB >hangs, the race has just occurred. > >Anyone got opinions on this? >Cheers, >Peadar. > > If the parent has PS_NOCLDSTOP set, no SIGCHLD will be sent to parent, so there is race in the case, but if PS_NOCLDSTOP is set, the signal will be sent to parent, and parant should resume from msleep() immediately. I don't know why it still does have race, I am looking the code, I think stop() should be merged into thread_stopped(), there is no another caller at all. David Xu