Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 15 Mar 2001 11:32:07 +1000 (EST)
From:      Tony Griffiths <tonyg@OntheNet.com.au>
To:        John Baldwin <jhb@FreeBSD.ORG>
Cc:        Andrew Gallatin <gallatin@cs.duke.edu>, <alpha@FreeBSD.ORG>
Subject:   Re: Deadlocks, whee!
Message-ID:  <Pine.BSF.4.30.0103151055520.84004-100000@lancia.onthenet.com.au>
In-Reply-To: <XFMail.010314164809.jhb@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 14 Mar 2001, John Baldwin wrote:

>
> On 15-Mar-01 Andrew Gallatin wrote:
> >
> > John Baldwin writes:
> >  > Hi all,
> >  >
> >  > I managed to deadlock my alpha yesterday with a -j 4 buildworld.
> > Previously it
> >  > would die when it trapped with a raised IPL as a blockable mtx_lock() of
> > lockmgr
> >  > in trap().  I'm not sure if these two things are related or not.  I'll try
> > a
> >  > normal world without -j X today to see if it fairs better.  Just FYI for
> > those
> >  > running current that heavy load may deadlock right now. :(
> >
> > The machine is really deadlocked, or just one process is wedged and
> > the buildworld stalled?
>
> Well, no messages on the console, no ddb (I have vidconsole), no pings, etc.
> So interrupts aren't getting through, or if they are their threads aren't
> running, and since I use preemption on this alpha, that is very, very unlikely.
> I'm assuming it is genuinely deadlocked or possibly spinning somewhere with a
> raised IPL.

Looks like a "deadlock" to me!

Actually, I'm surprised that the 'fine-grained' SMP project in FreeBSD has
managed to get as far as it has without implementing some form of "sanity"
checking.  I worked for DEC (Digital Equipment Corp) in the Networking Group at
the time Ultrix (BSD 4.2/4.3) was doing fine-grained SMP and we had the
following sanity checks in the locking code as an aide to maintaining our own
sanity!  ;-)

1) Logging of request/release calls
2) Lock hierarchy (ie. take-out ordering)
3) Spin-lock timeout (ie. panic() after 5000000 failed attempts to gain lock)
4) something else that I can't remember 'cause it was too long ago!!!

The lock hierarchy was a BIG WIN in detecting/preventing deadlock conditions
since it forced an order in lock acquisition although it didn't stop deadlocks
from occurring when the locks were at the same level.  The spin count exceeded
picked those up.

We also found a few problems on tri/quad-cpu systems that didn't occur on
dual-cpu systems.

Of course the amount of checking was a compile-time setting so that production
code didn't suffer too badly.

We learnt a lot of hard lessons on Ultrix, the main one being that we were too
ambitious in trying for a VERY FINE-GRAINED locking strategy (especially in the
networking code) than was warrented by any possible payback.  Our OSF/Tru-64
implementation was much cleaner with pretty much a single lock at each layer of
the network code (eg. socket, tcp/ip, driver).  The locking hierarchy caused a
few problems between the socket layer and transport but we could get around by
using reference counts on objects that needed to stick around even when there
was no 'lock' on them!

Hope you have more 'fun' then we did (NOT) ...

Tony


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-alpha" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.30.0103151055520.84004-100000>