Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 19 Feb 2005 12:43:38 +0000 (GMT)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        Peter Losher <Peter_Losher@isc.org>
Cc:        stable@freebsd.org
Subject:   Re: Hard lockups using 5.3-RELEASE..
Message-ID:  <Pine.NEB.3.96L.1050219123717.67347O-100000@fledge.watson.org>
In-Reply-To: <4217170A.2030106@isc.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 19 Feb 2005, Peter Losher wrote:

> We have a Celestica dual-Opteron system w/ 4GB RAM running
> 5.3-RELEASE/i386 (32-bit), and a SMP-aware kernel, which is experiencing
> hard lockups.  Debugging results below. 

Hmm.  So just to summarize:

- The system appears to wedge
- Serial break can get into the debugger

Have you tried updating to the latest RELENG_5_3 patch level?  That
includes at least one significant SMP stability fix.  You can rebuild
along the RELENG_5_3 branch, or just use freebsd-update to pull it in.

> It looks like it's trying to lock Giant while it already has Giant.  In
> any case, we have rebuilt a uniprocessor kernel for now.  If this is
> already fixed in 5-STABLE, then let me know. ;) 

Generally speaking, recursing Giant is fine, as Giant is a recursible
mutex; however, an ithread shouldn't already hold Giant at that point.

This may be fixed in 5-STABLE, but it's hard to say.  I think the order of
operations here is:

- First, slide to RELENG_5_3 head (p5?) to make sure you have the IPI
  stability fix.  See if the problem goes away.

- Generate the following information: when the box is wedged, does it...

  (1) Respond to pings
  (2) Does the num lock light go on and off when the num lock key is hit
  (3) If it responds to pings, what happens when you build a new TCP
      connection to an open TCP port (a) once (b) twice (c) the 100'd
      (or so) time.

- Generate the following DDB output using your serial console:

  show pcpu
  show pcpu 0
  show pcpu 1
  ps
  show lockedvnods

  I may then ask you to generate stack traces of the processes that appear
  "interesting".  The definition of interesting is a little bit
  context-specifi so it's hard to say what it is just now.  If there are a
  lot of processes wedged in VM and VFS, then I'll ask you to trace each
  process that appears in the lockedvnods output. 

- Next, recompile with INVARIANTS and see if the problem triggers an
  assertion failure when it occurs.

- Next, recompile with WITNESS and see if WITNESS creates a warning or
  assertion failure when it occurs.

  Break to the debugger and generate the above DDB output, but also "show
  allocks" (5-STABLE only), or "show locks" for interesting processes if
  5-RELEASE-*.

Also, I don't think you mentioned what sort of workload is present on the
box.

Thanks!

Robert N M Watson



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.NEB.3.96L.1050219123717.67347O-100000>