From owner-freebsd-threads@FreeBSD.ORG  Thu Oct 28 19:53:16 2004
Return-Path: <owner-freebsd-threads@FreeBSD.ORG>
Delivered-To: freebsd-threads@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 6F41B16A4CE
	for <threads@FreeBSD.org>; Thu, 28 Oct 2004 19:53:16 +0000 (GMT)
Received: from mail5.speakeasy.net (mail5.speakeasy.net [216.254.0.205])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 2845E43D3F
	for <threads@FreeBSD.org>; Thu, 28 Oct 2004 19:53:16 +0000 (GMT)
	(envelope-from jhb@FreeBSD.org)
Received: (qmail 28649 invoked from network); 28 Oct 2004 19:53:15 -0000
Received: from dsl027-160-063.atl1.dsl.speakeasy.net (HELO server.baldwin.cx)
	([216.27.160.63])          (envelope-sender <jhb@FreeBSD.org>)
	encrypted SMTP
	for <threads@FreeBSD.org>; 28 Oct 2004 19:53:14 -0000
Received: from [10.50.40.221] (gw1.twc.weather.com [216.133.140.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.12.11/8.12.11) with ESMTP id i9SJqvEI030410;
	Thu, 28 Oct 2004 15:52:59 -0400 (EDT)
	(envelope-from jhb@FreeBSD.org)
From: John Baldwin <jhb@FreeBSD.org>
To: Daniel Eischen <deischen@FreeBSD.org>
Date: Thu, 28 Oct 2004 15:54:07 -0400
User-Agent: KMail/1.6.2
References: <Pine.GSO.4.43.0410271826590.239-100000@sea.ntplx.net>
In-Reply-To: <Pine.GSO.4.43.0410271826590.239-100000@sea.ntplx.net>
MIME-Version: 1.0
Content-Disposition: inline
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <200410281554.07222.jhb@FreeBSD.org>
X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on server.baldwin.cx
cc: threads@FreeBSD.org
Subject: Re: Infinite loop bug in libc_r on 4.x with condition variables and
	signals
X-BeenThere: freebsd-threads@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Threading on FreeBSD <freebsd-threads.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-threads>,
	<mailto:freebsd-threads-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-threads>
List-Post: <mailto:freebsd-threads@freebsd.org>
List-Help: <mailto:freebsd-threads-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-threads>,
	<mailto:freebsd-threads-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 28 Oct 2004 19:53:16 -0000

On Wednesday 27 October 2004 06:30 pm, Daniel Eischen wrote:
> On Wed, 27 Oct 2004, John Baldwin wrote:
> > On Thursday 21 October 2004 07:04 pm, Daniel Eischen wrote:
> > > On Thu, 21 Oct 2004, John Baldwin wrote:
> > > > The behavior seems more to be this:
> > > >
> > > > - thread does pthread_cond_wait*(c1)
> > > > - thread enqueued on c1
> > > > - thread interrupted by a signal while on c1 but still in PS_RUNNING
> > >
> > > This shouldn't happen when signals are deferred.  It should
> > > only happen when the state is PS_COND_WAIT after we've
> > > context switched to the scheduler.
> > >
> > > > - thread saves state which excludes the PTHREAD_FLAGS_IN_CONDQ flag
> > > > (among others)
> > >
> > > Right, because it assumes that the thread will be backed out of
> > > any mutex or CV queues prior to invoking the signal handler.
> > >
> > > > - thread calls _cond_wait_backout() if state is PS_COND_WAIT (but
> > > > it's not in - this case, this is the normal case though, which is why
> > > > it's ok to not save the CONDQ flag in the saved state above)
> > >
> > > Right.  The problem is, how is the thread getting setup for a signal
> > > while signals are deferred and the state has not yet been changed
> > > from PS_RUNNING to PS_COND_WAIT?
> > >
> > > > - thread executes signal handler
> > > > - thread restores state
> > > > - pthread_condwait*() see that interrupted is 0, so don't try to
> > > > remove the thread from the condition variable (also,
> > > > PTHREAD_FLAGS_IN_CONDQ isn't set either, so we can't detect this case
> > > > that way)
> > > > - thread returns from pthread_cond_wait() (maybe due to timeout,
> > > > etc.) - thread calls pthread_cond_wait*(c2)
> > > > - thread enqueued on c2
> > > > - another thread does pthread_cond_broadcast(c2), and bewm
> > > >
> > > > My question is is it possible for the thread to get interrupted and
> > > > chosen to run a signal while it is on c1 somehow given my patch to
> > > > defer signals around the wait loops (and is that patch correct btw
> > > > given the above scenario?)
> > >
> > > Yes (and yes I think).  Defering signals just means that the signal
> > > handler won't try to install a signal frame on the current thread;
> > > instead it just queues the signal and the scheduler will pick it up and
> > > send it to the correct thread.
> > >
> > > I do think signals should be deferred for condition variables so
> > > that setting the thread state (to PS_COND_WAIT) is atomic.
> > >
> > > It's not obvious to be where the bug is.  If you had a simple
> > > test case to reproduce it that would help.
> >
> > FWIW, we are having (I think) the same problem on 5.3 with libpthread. 
> > The panic there is in the mutex code about an assertion failing because a
> > thread is on a syncq when it is not supposed to be.
>
> David and I recently fixed some races in pthread_join() and
> pthread_exit() in -current libpthread.  Don't know if those
> were responsible...
>
> Here's a test program that shows correct behavior with both
> libc_r and libpthread in -current.

We've started testing on -current and are seeing several problems with 
libpthread.  Using a UP kernel (machines have single processor with HTT) 
seems to make it better, but we seem to be getting SIG 11's in 
pthread_testcancel() as well as the failed lock assertions that were 
mentioned earlier on the list in the PR.  Just running monodevelop from the 
bsd-sharp stuff mentioned earlier can break in that one of the processes dies 
with the assertion failure.  If you let the other processes run, then you can 
run it again and get the window to pop up, but then clicking on any of the 
controls results in the pthread_testcancel() crash.  FWIW, I think the reason 
that the stack traces look weird in the PR's thread may be due to catching a 
signal.  When we were looking at the problems with libc_r on 4.x we would get 
some weird looking backtraces sometimes when the assertion in uthread_sig.c 
that I added failed.  Seems that gdb doesn't handle the signal frames very 
well.

-- 
John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org