From owner-freebsd-hackers@freebsd.org  Wed Aug  5 15:54:29 2015
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id F1E529B4B48
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Wed,  5 Aug 2015 15:54:29 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id CC30F16D7
 for <freebsd-hackers@freebsd.org>; Wed,  5 Aug 2015 15:54:29 +0000 (UTC)
 (envelope-from jhb@freebsd.org)
Received: from ralph.baldwin.cx (75-48-78-19.lightspeed.cncrca.sbcglobal.net
 [75.48.78.19])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id DF89CB91E;
 Wed,  5 Aug 2015 11:54:28 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Philippe Jalaber <pjalaber@gmail.com>
Cc: freebsd-hackers@freebsd.org
Subject: Re: adaptive rwlock deadlock
Date: Wed, 05 Aug 2015 08:41:15 -0700
Message-ID: <1902697.ny7xAkAVI4@ralph.baldwin.cx>
User-Agent: KMail/4.14.3 (FreeBSD/10.2-PRERELEASE; KDE/4.14.3; amd64; ; )
In-Reply-To: <CA+i3ByKHU33sMN3oM73-H5pCuj3qqNY_VUCRkWWyJ8_X9PvUew@mail.gmail.com>
References: <CA+i3ByK8TLb6cRCw3dJgGYCb81ENE=HrgsDX+MM-=yVn8P1hgg@mail.gmail.com>
 <2768515.JZVZhYiQVE@ralph.baldwin.cx>
 <CA+i3ByKHU33sMN3oM73-H5pCuj3qqNY_VUCRkWWyJ8_X9PvUew@mail.gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Wed, 05 Aug 2015 11:54:29 -0400 (EDT)
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 05 Aug 2015 15:54:30 -0000

On Wednesday, August 05, 2015 04:27:53 PM Philippe Jalaber wrote:
> 2015-08-04 22:10 GMT+02:00 John Baldwin <jhb@freebsd.org>:
> 
> > On Tuesday, July 07, 2015 12:10:19 PM Philippe Jalaber wrote:
> > > Hi,
> > >
> > > I am facing a strange problem using the network stack and adaptive
> > rwlocks
> > > running Freebsd 9.3.
> > > Basically I can reproduce the problem with 3 threads:
> > >
> > > 1) thread 1 has taken the rwlock of structure inpcb in exclusive mode in
> > > tcp_input.c. This thread also runs my own code and repeatedly takes a
> > > rwlock (called g_rwlock) in shared mode and releases it, until a shared
> > > object is marked not "busy" any more:
> > >
> > > rwlock(inp_lock);
> > > ....
> > > do { // thread is active waiting in the loop
> > >     rlock(g_rwlock);
> > >     o = find();
> > >     if ( o == NULL )
> > >         break;
> > >     busy = o.busy;
> > >     if (o != NULL && busy)
> > >         runlock(g_rwlock);
> > > } while ( busy );
> > >
> > > if ( o != NULL )
> > > {
> > >     // do something with o
> > >     ....
> > > }
> > > runlock(g_rwlock);
> > > ....
> > >
> > > 2) thread 2 wants to set the shared object as "ready". So it tries to
> > take
> > > g_rwlock in exclusive mode and is blocked in _rw_wlock_hard@kern_rwlock.c
> > :815
> > > "turnstile_wait(ts, rw_owner(rw), TS_EXCLUSIVE_QUEUE)" because thread 1
> > has
> > > already taken it in shared mode:
> > >
> > > wlock(g_rwlock);
> > > o = find();
> > > if ( o != NULL )
> > >     o.busy = 1;
> > > wunlock(g_rwlock);
> > >
> > > // o is busy so work on it without any lock
> > > ....
> > >
> > > wlock(g_rwlock); // thread is blocked here
> > > o.busy = 0;
> > > maybe_delete(o);
> > > wunlock(g_rwlock);
> > >
> > > 3) thread 3 spins on the same inpcb rwlock than thread 1 in
> > > _rw_wlock_hard@kern_rwlock.c:721 "while ((struct
> > > thread*)RW_OWNER(rw->rw_lock) == owner && TD_IS_RUNNING(owner)) "
> > >
> > >
> > > My target machine has two cpus.
> > > Thread 1 is pinned to cpu 0.
> > > Thread 2 and Thread 3 are pinned to cpu 1.
> > > Thread 1 and Thread 2 have a priority of 28.
> > > Thread 3 has a priority of 127
> > >
> > > Now what seems to happen is that when thread 1 calls runlock(g_rwlock),
> > it
> > > calls turnstile_broadcast@kern_rwlock.c:650, but thread 2 never regains
> > > control because thread 3 is spinning on the inpcb rwlock. Also the
> > > condition TD_IS_RUNNING(owner) is always true because thread 1 is active
> > > waiting in a loop. So the 3 threads deadlock.
> > > Note that if I compile the kernel without adaptive rwlocks it works
> > without
> > > any problem.
> > > A workaround is to add a call to "sched_relinquish(curthread)" in thread
> > 1
> > > in the loop just after the call to runlock.
> >
> > It sounds like we are not forcing a preemption on CPU 1 in this case via
> > sched_add().
> >
> > For SCHED_4BSD you could try the 'FULL_PREEMPTION' kernel option.
> > For ULE you can adjust 'preempt_thresh' on the fly, though I think the
> > default setting should actually still work.
> >
> > Can you use KTR or some such to determine if IPI_PREEMPT is being sent by
> > CPU 0 to CPU 1 in this case?
> >
> > > I am also wondering about the code in _rw_runlock after
> > > "turnstile_broadcast(ts, queue)". Isn't the flag RW_LOCK_WRITE_WAITERS
> > > definitely lost if the other thread which is blocked in turnstile_wait
> > > never regains control ?
> >
> > All the write waiters are awakened by a broadcast (as opposed to a signal
> > operation).  They are on the run queue, not on the turnstile queue anymore,
> > so there aren't any write waiters left (the bit only tracks if there are
> > waiters on the turnstile).
> >
> > --
> > John Baldwin
> >
> 
> I Use ULE scheduler.
> Here's the KTR output using ktrdump on a vmcore after watchdog.
> 
>   75447 ipi_selected: cpu: 1 ipi: fc
>   75446 stop_cpus() with 252 type
>   75445 ipi_cpu: cpu: 1 ipi: 2
>   75444 ipi_cpu: cpu: 1 ipi: 2
>   75443 ipi_cpu: cpu: 1 ipi: 2
>   75442 ipi_cpu: cpu: 1 ipi: 2
>   75441 ipi_cpu: cpu: 1 ipi: 2
>   ....
>   3862 ipi_cpu: cpu: 1 ipi: 2
>   3861 ipi_cpu: cpu: 1 ipi: 2
>   3860 ipi_cpu: cpu: 1 ipi: 2
>   3859 ipi_cpu: cpu: 1 ipi: 2
>   3858 ipi_cpu: cpu: 1 ipi: 2
>   3857 ipi_selected: cpu: 1 ipi: f3
>   3856 ipi_cpu: cpu: 1 ipi: 2
>   3855 ipi_cpu: cpu: 1 ipi: 2
>   3854 ipi_cpu: cpu: 1 ipi: 2
>   3853 ipi_selected: cpu: 0 ipi: f3
>   3852 ipi_cpu: cpu: 1 ipi: 2
>   3851 ipi_selected: cpu: 1 ipi: f3
>   3850 ipi_cpu: cpu: 1 ipi: 2
>   3849 ipi_cpu: cpu: 1 ipi: 2
>   3848 ipi_selected: cpu: 0 ipi: f3
>   3847 ipi_cpu: cpu: 1 ipi: 2
>   3846 ipi_cpu: cpu: 1 ipi: 2
>   3845 ipi_cpu: cpu: 1 ipi: 2
>   3844 ipi_cpu: cpu: 1 ipi: 2
>   3843 ipi_cpu: cpu: 1 ipi: 2
>   3842 ipi_cpu: cpu: 1 ipi: 2
>   3841 ipi_cpu: cpu: 1 ipi: 2
>   3840 ipi_cpu: cpu: 1 ipi: 2
>   3839 ipi_cpu: cpu: 1 ipi: 2
>   3838 ipi_cpu: cpu: 1 ipi: 2
>   3837 ipi_cpu: cpu: 1 ipi: 2
>   3836 ipi_cpu: cpu: 1 ipi: 2
>   3835 ipi_cpu: cpu: 0 ipi: 1
>   3834 ipi_cpu: cpu: 0 ipi: 1
>   3833 ipi_cpu: cpu: 0 ipi: 1
>   3832 ipi_cpu: cpu: 0 ipi: 1
>   3831 ipi_cpu: cpu: 0 ipi: 1
>   3830 ipi_cpu: cpu: 0 ipi: 1

Unfortunately this has a lot of other noise.  Can you add some
traces specifically in sched_ule in tdq_notify to note that it
is deciding to notify a CPU due to scheduling a thread?

-- 
John Baldwin