From owner-freebsd-current  Sun Jun 27  1:44:21 1999
Delivered-To: freebsd-current@freebsd.org
Received: from overcee.netplex.com.au (overcee.netplex.com.au [202.12.86.7])
	by hub.freebsd.org (Postfix) with ESMTP id 86EC614C96
	for <current@FreeBSD.ORG>; Sun, 27 Jun 1999 01:44:14 -0700 (PDT)
	(envelope-from peter@netplex.com.au)
Received: from netplex.com.au (localhost [127.0.0.1])
	by overcee.netplex.com.au (Postfix) with ESMTP
	id 24D4D81; Sun, 27 Jun 1999 16:44:14 +0800 (WST)
	(envelope-from peter@netplex.com.au)
X-Mailer: exmh version 2.0.2 2/24/98
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: current@FreeBSD.ORG, mckusick@mckusick.com
Subject: Re: BUF_LOCK() related panic.. 
In-reply-to: Your message of "Sun, 27 Jun 1999 01:15:43 MST."
             <199906270815.BAA10773@apollo.backplane.com> 
Date: Sun, 27 Jun 1999 16:44:14 +0800
From: Peter Wemm <peter@netplex.com.au>
Message-Id: <19990627084414.24D4D81@overcee.netplex.com.au>
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Matthew Dillon wrote:
>     Ah, yes, some of us were just discussing this in a small mailing list.
>     Hopefully Kirk will pick up on it soon.  Ah well.. someone else gets to b
    e
>     the brunt of it for a change :-).  Kirk doesn't have an SMP box so he
>     didn't see the bug.
> 
>     I have tentitively tracked the problem down to the apparent inability of
>     lockmgr() locks to function from interrupts, even when used in a
>     non-blocking manner, due to the simplelock's it uses internally.  The
>     new buffer cache code Kirk committed switched from B_BUSY (manually
>     implemented locks) to lockmgr() locks.  I think what is going on is
>     that mainline code is getting a simplelock and then an interrupt is
>     coming along and also trying to get the same lock, but I can't be sure
>     because my DDB backtraces are somewhat munged.

It seems to me the main problem (so far) is the buftimelock..

        simple_lock(&buftimelock);
        bp->b_lock.lk_wmesg = buf_wmesg;
        bp->b_lock.lk_prio = PRIBIO + 4;
        bp->b_lock.lk_timo = 0; 
        return (lockmgr(&(bp)->b_lock, locktype, &buftimelock, curproc));

Inside lockmgr():

        simple_lock(&lkp->lk_interlock);
        if (flags & LK_INTERLOCK)
                simple_unlock(interlkp);
                              ^^^^^^^^ <--- &buftimelock;

Note that there is no LK_INTERLOCK in any of the calls to lockmgr()..  On
UP, simplelocks are noops.  On SMP, they are real and nothing is ever
freeing buftimelock.

But that doesn't fix the UP problem where cluster_wbuild() tries to
recursively re-lock a buf that the current process already owns.  I have a
few ideas about that one though, I just don't understand the clustering
well enough yet to fix it.

Speaking of SMP and simple locks, I'd like to turn on the debugging
simplelocks that keep a reference count and check before switching to make
sure that a process doesn't sleep holding a lock.  This is a pretty
fundamental sanity check and would have found the LK_INTERLOCK problem
above before it got committed.

Cheers,
-Peter
--
Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message