Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 14 May 2010 06:29:25 -0700
From:      "Matthew Fleming" <matthew.fleming@isilon.com>
To:        "John Baldwin" <jhb@FreeBSD.org>, "Terry Kennedy" <TERRY@tmk.com>
Cc:        freebsd-stable@freebsd.org
Subject:   RE: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write
Message-ID:  <06D5F9F6F655AD4C92E28B662F7F853E021D4D5D@seaxch09.desktop.isilon.com>
References:  <01NN32EOXMYC006UN1@tmk.com> <4BED3912.9080509@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
> > The crash was a "page fault while in kernel mode" with the current =
process=20
> > being the interrupt service routine for the bce0 GigE. Things =
progressed=20
> > reasonably until partway through the dump, when the system locked up =
with a=20
> > "Sleeping thread (tid 100028, pid 12) owns a non-sleepable lock". =
Thats the=20
> > same PID as reported in the main crash.
>=20
> Hmm.  You could try changing the code to not do a nested panic in that =

> case.  You would update subr_turnstile.c to just return if panicstr is =

> not NULL rather than calling panic.  However, there is still a good=20
> chance you will end up deadlocking in that case.  I have another patch =
I=20
> can send you next week that prevents blocking on mutexes duing a panic =

> which may also help.

It would be instructive to know exactly why we were in turnstile(9) but =
its likely due to mtx contention.

AIX has some code at the beginning of all the locking operations to =
avoid taking locks if we were running code out of kdb, though getting =
that worked out was slightly tricky with our variant of mtx_assert(9).  =
I seem to recall there was also some "lockbusting" code that forcibly =
reset all owned locks to have no owner, at least in some paths.

Given that the system is single-cpu and should be single-threaded when =
dumping, this seems to me to be something worth working through to get =
more reliable dumps.  Except for mtx_assert(9) I cant think of a reason =
to take locks once we start dumping or when in the debugger.

As an aside, with terribly corrupted locks Ive seen double panics when =
the attempt to print the lock name faulted in strlen(9) called for =
printf(9), due to a bad lockname pointer.  We have been able to get =
enough info off these crashes to debug them, but its useful to remember =
that the system may be in a very unstable state depending on why it =
panics.

Thanks,
matthew



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?06D5F9F6F655AD4C92E28B662F7F853E021D4D5D>