From owner-freebsd-current@FreeBSD.ORG  Tue Apr 13 14:44:32 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9B79C1065673
	for <freebsd-current@freebsd.org>; Tue, 13 Apr 2010 14:44:32 +0000 (UTC)
	(envelope-from yanefbsd@gmail.com)
Received: from mail-qy0-f175.google.com (mail-qy0-f175.google.com
	[209.85.221.175])
	by mx1.freebsd.org (Postfix) with ESMTP id 00BAD8FC13
	for <freebsd-current@freebsd.org>; Tue, 13 Apr 2010 14:44:31 +0000 (UTC)
Received: by qyk5 with SMTP id 5so7854036qyk.3
	for <multiple recipients>; Tue, 13 Apr 2010 07:44:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:received:in-reply-to:references
	:date:received:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	bh=TxqVmICc2y9yejiyemRoddl0uXCPXMW3NIAzJhkBGu8=;
	b=jj9wHz36yI3MLKZmtnItXK3TF8s2+AsDS60o0HVkGA/Yt1R5ibKLpN+F9D7/0pmOHb
	f8jyO9SIt5ppW33zFZQzJw59+31b+EhSRERDBXWsvdkauDKr+bkquBMlHugzCpN1uB9l
	5xpHxsGlfx6ZwcgD1sZyUgOFlDRz8+h0HHTfU=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type:content-transfer-encoding;
	b=s6nknyBj3McSnQencS9Na6xJ9Ilxd9i/SK7I5vqqHGXFHitiJ0NAaBOize0flFJslm
	EzeEHqlr4Er7oxBQ+ZwBcM/+YyWk5DmGPgZmYm0lUtwJiimrEeq+HnRfQhuGD+FxaY2s
	ERO22mxgyNKR836UwvtfwVP5a6iyKTqDom2pg=
MIME-Version: 1.0
Received: by 10.229.28.85 with HTTP; Tue, 13 Apr 2010 07:44:27 -0700 (PDT)
In-Reply-To: <h2q3bbf2fe11004130350td06cbb37ha564ba8cf11dd96a@mail.gmail.com>
References: <7d6fde3d1003070207q621e69ado2cb64e431feacd76@mail.gmail.com>
	<7d6fde3d1003070224k3626a9b5y98c11a43eef1bed4@mail.gmail.com>
	<4e6cba831003101356i534341ffr2961b983854ab788@mail.gmail.com>
	<7dc40bd01003101407m605e41ey2d8ace0049cf5e61@mail.gmail.com>
	<7d6fde3d1003102158o7834ca67lce3eca23aa723fd1@mail.gmail.com>
	<7d6fde3d1003121933s4ba7b57fw6542628c16edf723@mail.gmail.com>
	<x2t3bbf2fe11004130348gdeb3003ela5f17acd0b0b780c@mail.gmail.com>
	<h2q3bbf2fe11004130350td06cbb37ha564ba8cf11dd96a@mail.gmail.com>
Date: Tue, 13 Apr 2010 07:44:27 -0700
Received: by 10.229.35.80 with SMTP id o16mr1728203qcd.93.1271169868074; Tue, 
	13 Apr 2010 07:44:28 -0700 (PDT)
Message-ID: <i2k7d6fde3d1004130744u64d927d7x663489813665e95@mail.gmail.com>
From: Garrett Cooper <yanefbsd@gmail.com>
To: Attilio Rao <attilio@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: Tom Couch <tom.couch.storage@gmail.com>,
	Giovanni Trematerra <giovanni.trematerra@gmail.com>,
	FreeBSD Current <freebsd-current@freebsd.org>,
	Scott Long <scottl@freebsd.org>, Pyun YongHyeon <yongari@freebsd.org>
Subject: Re: Removing USB keyboard after filesystems synced causes panic
	with destroyed mutex twa(4)?
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 13 Apr 2010 14:44:32 -0000

On Tue, Apr 13, 2010 at 3:50 AM, Attilio Rao <attilio@freebsd.org> wrote:
> 2010/4/13 Attilio Rao <attilio@freebsd.org>:
>> 2010/3/13 Garrett Cooper <yanefbsd@gmail.com>:
>>> On Wed, Mar 10, 2010 at 9:58 PM, Garrett Cooper <yanefbsd@gmail.com> wr=
ote:
>>>> On Wed, Mar 10, 2010 at 2:07 PM, Tom Couch <tom.couch.storage@gmail.co=
m> wrote:
>>>>> Hi FreeBSD-current,
>>>>> =A0 =A0 My name is Tom Couch,
>>>>> I am part of the 3ware driver team recently acquired by LSI.
>>>>> I believe Giovanni's patch, below, is the correct fix for this bug.
>>>>>
>>>>> I am available to maintain the twa driver, now that I am on this list=
.
>>>>> Let me know how I can help,
>>>>>
>>>>> Tom
>>>>>
>>>>>
>>>>> On Wed, Mar 10, 2010 at 1:56 PM, Giovanni Trematerra <
>>>>> giovanni.trematerra@gmail.com> wrote:
>>>>>
>>>>>> On Sun, Mar 7, 2010 at 11:24 AM, Garrett Cooper <yanefbsd@gmail.com>
>>>>>> wrote:
>>>>>> > On Sun, Mar 7, 2010 at 2:07 AM, Garrett Cooper <yanefbsd@gmail.com=
>
>>>>>> wrote:
>>>>>> >> Hi Alexander and Hans,
>>>>>> >> =A0 =A0I recently did the following which generated a panic on a
>>>>>> >> 9-CURRENT kernel compiled on the 26th:
>>>>>> >>
>>>>>> >> 1. Executed reboot
>>>>>> >> 2. Removed keyboard.
>>>>>> >> 3. Some time after `All buffers synced\nUptime: ...' was displaye=
d,
>>>>>> >> the keyboard was registered disconnected.
>>>>>> >> 4. The interrupt was delivered to my twa(4) enabled card and the
>>>>>> >> kernel panicked, like so:
>>>>>> >>
>>>>>> >> ugen2.2: <Mitsumi Electric> at usbus2 (disconnected)
>>>>>> >> uhub8: at uhub2, port 1, addr 2 (disconnected)
>>>>>> >> ugen2.3: <Mitsumi Electric> at usbus2 (disconnected)
>>>>>> >> ukbd0: at uhub8, port 3, addr 3 (disconnected)
>>>>>> >> uhid0: at uhub8, port 3, addr 3 (disconnected)
>>>>>> >> panic: mtx_lock_spin() of destroyed mutex @
>>>>>> /usr/src/sys/dev/twa/tw_cl_intr.c:88
>>>>>> >>
>>>>>> >> cpuid =3D 1
>>>>>> >> KDB: enter: panic
>>>>>> >> [thread pid 12 tid 100025 ]
>>>>>> >> Stopped at =A0 =A0 =A0 =A0 kdb_enter+0x3d: movq =A0 =A0 $0,0x4028=
9c(%rip)
>>>>>> >> db>
>>>>>> >>
>>>>>> >> =A0 =A0I wish I could provide you with more details, but unfortun=
ately I
>>>>>> >> the USB bus isn't registering the fact that I'm reattaching the
>>>>>> >> keyboard right now and the box won't reboot automatically :( (did=
n't
>>>>>> >> set the right sysctl beforehand to panic automatically). I'll try=
 and
>>>>>> >> reproduce the issue again, but I was just wondering whether or no=
t you
>>>>>> >> guys had seen this problem before.
>>>>>> >
>>>>>> > =A0 =A0Phew... it's reproducible with that kernel. Here's what I d=
id
>>>>>> > exactly (because my original directions were incorrect):
>>>>>> > =A0 =A01. Hit power button (for S5).
>>>>>> > =A0 =A02. Disconnect keyboard RIGHT as `Uptime: ...' is displayed.
>>>>>> > =A0 =A0Kernel panicked on my system again. Now to figure out if it=
 still
>>>>>> > exists with a kernel compiled today, and also how to debug it if i=
t
>>>>>> > still does exist :/...
>>>>>> > Thanks,
>>>>>> > -Garrett
>>>>>>
>>>>>> Hi Garrett,
>>>>>> Could you please try the patch below and report back?
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>> diff -r cab6489de66d sys/dev/twa/tw_cl_intr.c
>>>>>> --- a/sys/dev/twa/tw_cl_intr.c =A0 =A0 =A0 =A0Wed Mar 03 04:51:13 20=
10 -0500
>>>>>> +++ b/sys/dev/twa/tw_cl_intr.c =A0 =A0 =A0 =A0Wed Mar 10 06:29:05 20=
10 -0500
>>>>>> @@ -75,9 +75,12 @@ tw_cl_interrupt(struct tw_cl_ctlr_handle
>>>>>> =A0 =A0 =A0if (ctlr =3D=3D NULL)
>>>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
>>>>>>
>>>>>> - =A0 =A0 /* If we get an interrupt while resetting, it is a shared
>>>>>> - =A0 =A0 =A0 =A0one for another device, so just bail */
>>>>>> - =A0 =A0 if (ctlr->state & TW_CLI_CTLR_STATE_RESET_IN_PROGRESS)
>>>>>> + =A0 =A0 /*
>>>>>> + =A0 =A0 =A0* =A0If we get an interrupt while resetting or shutting=
 down,
>>>>>> + =A0 =A0 =A0* =A0it is a shared one for another device, so just bai=
l
>>>>>> + =A0 =A0 =A0*/
>>>>>> + =A0 =A0 if (ctlr->state & TW_CLI_CTLR_STATE_RESET_IN_PROGRESS ||
>>>>>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (ctrl->state & TW_CLI_CTLR=
_STATE_ACTIVE) =3D=3D 0)
>>>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0goto out;
>>>>>>
>>>>>> =A0 =A0 =A0/*
>>>
>>> Apart from the typo above (s/ctrl/ctlr/), things work appropriately
>>> now at reboot. The only problem is that bootup is really wonky now,
>>> because the RAID had a LOT of issues attaching to cam(4) (failed in
>>> 2/3 cold boot attempts); an additional branch condition may need to be
>>> added to the above if-statement if this change didn't take that into
>>> account. However, if the old behavior was incorrect and the new
>>> behavior is correct, s.t. the RAID controller demonstrating bus
>>> detection timeout issue that is occurring with a lot of USB devices
>>> and some RAID controllers today, this could be extremely problematic.
>>>
>>> So, while it looks better than before at reboot, it's not ready yet
>>> for prime time; I'd rather that the bug was filed with the patch you
>>> provided after the typo fixed, with the caveat mentioned and NOT
>>> committed, because the adverse affect(s) seem a bit more annoying than
>>> the previous panic issue described.
>>
>> I looked briefly at it and I think there are 2 bugs, one in
>> twa_detach() and another one in twa_shutdown().
>> Basically, locks get destroyed in tw_cl_shutdown_ctlr() which is
>> called by twa_shutdown() while interrupts are teared down in
>> tw_osli_free_resource(). twa_shutdown() is called in twa_detach()
>> before than tw_osli_free_resource().
>> tw_cl_shutdown_ctlr() also takes care to disable the interrupts for
>> twa but a problem can arise with shared IRQ. Infact, the handler will
>> remain on the IRQ until the bus_intr_teardown() takes place and it may
>> run, trying to acknowledge the interrupt, but with destroyed lock, if
>> an interrupt is sent by a shared source between twa_shutdown() and
>> tw_osli_free_resource() call in twa_detach() or just after a simple
>> call to twa_shutdown().
>>
>> Problems I see here:
>> - twa_shutdown() should not destroy the mutex at all, it is not
>> something our shutdown handlers generally do and it might be kept in
>> sync
>> - twa_detach() might do a first half of tw_cl_shutdown_ctlr(), do the
>> resource deallocation and just at the end destroy mutexes. That is how
>> generally our detach handler works.
>>
>> All these solutions would mean refactoring the tw_osli_free_resource()
>> and tw_cl_shutdown_ctlr(). I don't know very well the twa code, but it
>> seems to me that we want to keep the driver very compatible with any
>> vendor version or such? If yes this may be a problem because the
>> failing patterns are all located into the shared code and an ideal
>> solution could be more difficult to find out. Otherwise a fix might be
>> simple to hammer down.
>
> Forgot to tell: twe might have the same problem even if it doesn't
> expose just for luckiness.

    Hmmm... ok. I don't have a twe enabled card so I can't verify
whether or not this problem exists :/.
Thanks for the comments!
-Garrett