Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 29 Jul 2013 13:44:39 -0700
From:      Michael Tratz <michael@esosoft.com>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        freebsd-stable@freebsd.org, Rick Macklem <rmacklem@uoguelph.ca>, Steven Hartland <killing@multiplay.co.uk>
Subject:   Re: NFS deadlock on 9.2-Beta1
Message-ID:  <F20E755D-EE01-4411-8790-1E2BC7D8CD5D@esosoft.com>
In-Reply-To: <20130728062545.GE4972@kib.kiev.ua>
References:  <780BC2DB-3BBA-4396-852B-0EBDF30BF985@esosoft.com> <806421474.2797338.1374956449542.JavaMail.root@uoguelph.ca> <20130727205815.GC4972@kib.kiev.ua> <602747E8-0EBE-4BB1-8019-C02C25B75FA1@esosoft.com> <20130728062545.GE4972@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help

On Jul 27, 2013, at 11:25 PM, Konstantin Belousov <kostikbel@gmail.com> =
wrote:

> On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
>> Let's assume the pid which started the deadlock is 14001 (it will be =
a different pid when we get the results, because the machine has been =
restarted)
>>=20
>> I type:
>>=20
>> show proc 14001
>>=20
>> I get the thread numbers from that output and type:
>>=20
>> show thread xxxxx
>>=20
>> for each one.
>>=20
>> And a trace for each thread with the command?
>>=20
>> tr xxxx
>>=20
>> Anything else I should try to get or do? Or is that not the data at =
all you are looking for?
>>=20
> Yes, everything else which is listed in the 'debugging deadlocks' page
> must be provided, otherwise the deadlock cannot be tracked.
>=20
> The investigator should be able to see the whole deadlock chain (loop)
> to make any useful advance.

Ok, I have made some excellent progress in debugging the NFS deadlock.

Rick! You are genius. :-) You found the right commit r250907 (dated May =
22) is the definitely the problem.

Here is how I did the testing: One machine received a kernel before =
r250907, the second machine received a kernel after r250907. Sure enough =
within a few hours the machine with r250907 went into the usual deadlock =
state. The machine without that commit kept on working fine. Then I went =
back to the latest revision (r253726), but leaving r250907 out. The =
machines have been running happy and rock solid without any deadlocks. I =
have expanded the testing to 3 machines now and no reports of any =
issues.

I guess now Konstantin has to figure out why that commit is causing the =
deadlock. Lovely! :-) I will get that information as soon as possible. =
I'm a little behind with normal work load, but I expect to have the data =
by Tuesday evening or Wednesday.

Thanks again!!

Michael




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F20E755D-EE01-4411-8790-1E2BC7D8CD5D>