Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 28 Oct 2004 15:54:07 -0400
From:      John Baldwin <jhb@FreeBSD.org>
To:        Daniel Eischen <deischen@FreeBSD.org>
Cc:        threads@FreeBSD.org
Subject:   Re: Infinite loop bug in libc_r on 4.x with condition variables and signals
Message-ID:  <200410281554.07222.jhb@FreeBSD.org>
In-Reply-To: <Pine.GSO.4.43.0410271826590.239-100000@sea.ntplx.net>
References:  <Pine.GSO.4.43.0410271826590.239-100000@sea.ntplx.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday 27 October 2004 06:30 pm, Daniel Eischen wrote:
> On Wed, 27 Oct 2004, John Baldwin wrote:
> > On Thursday 21 October 2004 07:04 pm, Daniel Eischen wrote:
> > > On Thu, 21 Oct 2004, John Baldwin wrote:
> > > > The behavior seems more to be this:
> > > >
> > > > - thread does pthread_cond_wait*(c1)
> > > > - thread enqueued on c1
> > > > - thread interrupted by a signal while on c1 but still in PS_RUNNING
> > >
> > > This shouldn't happen when signals are deferred.  It should
> > > only happen when the state is PS_COND_WAIT after we've
> > > context switched to the scheduler.
> > >
> > > > - thread saves state which excludes the PTHREAD_FLAGS_IN_CONDQ flag
> > > > (among others)
> > >
> > > Right, because it assumes that the thread will be backed out of
> > > any mutex or CV queues prior to invoking the signal handler.
> > >
> > > > - thread calls _cond_wait_backout() if state is PS_COND_WAIT (but
> > > > it's not in - this case, this is the normal case though, which is why
> > > > it's ok to not save the CONDQ flag in the saved state above)
> > >
> > > Right.  The problem is, how is the thread getting setup for a signal
> > > while signals are deferred and the state has not yet been changed
> > > from PS_RUNNING to PS_COND_WAIT?
> > >
> > > > - thread executes signal handler
> > > > - thread restores state
> > > > - pthread_condwait*() see that interrupted is 0, so don't try to
> > > > remove the thread from the condition variable (also,
> > > > PTHREAD_FLAGS_IN_CONDQ isn't set either, so we can't detect this case
> > > > that way)
> > > > - thread returns from pthread_cond_wait() (maybe due to timeout,
> > > > etc.) - thread calls pthread_cond_wait*(c2)
> > > > - thread enqueued on c2
> > > > - another thread does pthread_cond_broadcast(c2), and bewm
> > > >
> > > > My question is is it possible for the thread to get interrupted and
> > > > chosen to run a signal while it is on c1 somehow given my patch to
> > > > defer signals around the wait loops (and is that patch correct btw
> > > > given the above scenario?)
> > >
> > > Yes (and yes I think).  Defering signals just means that the signal
> > > handler won't try to install a signal frame on the current thread;
> > > instead it just queues the signal and the scheduler will pick it up and
> > > send it to the correct thread.
> > >
> > > I do think signals should be deferred for condition variables so
> > > that setting the thread state (to PS_COND_WAIT) is atomic.
> > >
> > > It's not obvious to be where the bug is.  If you had a simple
> > > test case to reproduce it that would help.
> >
> > FWIW, we are having (I think) the same problem on 5.3 with libpthread. 
> > The panic there is in the mutex code about an assertion failing because a
> > thread is on a syncq when it is not supposed to be.
>
> David and I recently fixed some races in pthread_join() and
> pthread_exit() in -current libpthread.  Don't know if those
> were responsible...
>
> Here's a test program that shows correct behavior with both
> libc_r and libpthread in -current.

We've started testing on -current and are seeing several problems with 
libpthread.  Using a UP kernel (machines have single processor with HTT) 
seems to make it better, but we seem to be getting SIG 11's in 
pthread_testcancel() as well as the failed lock assertions that were 
mentioned earlier on the list in the PR.  Just running monodevelop from the 
bsd-sharp stuff mentioned earlier can break in that one of the processes dies 
with the assertion failure.  If you let the other processes run, then you can 
run it again and get the window to pop up, but then clicking on any of the 
controls results in the pthread_testcancel() crash.  FWIW, I think the reason 
that the stack traces look weird in the PR's thread may be due to catching a 
signal.  When we were looking at the problems with libc_r on 4.x we would get 
some weird looking backtraces sometimes when the assertion in uthread_sig.c 
that I added failed.  Seems that gdb doesn't handle the signal frames very 
well.

-- 
John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200410281554.07222.jhb>