From owner-freebsd-current@FreeBSD.ORG Tue Apr 13 14:44:32 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9B79C1065673 for ; Tue, 13 Apr 2010 14:44:32 +0000 (UTC) (envelope-from yanefbsd@gmail.com) Received: from mail-qy0-f175.google.com (mail-qy0-f175.google.com [209.85.221.175]) by mx1.freebsd.org (Postfix) with ESMTP id 00BAD8FC13 for ; Tue, 13 Apr 2010 14:44:31 +0000 (UTC) Received: by qyk5 with SMTP id 5so7854036qyk.3 for ; Tue, 13 Apr 2010 07:44:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=TxqVmICc2y9yejiyemRoddl0uXCPXMW3NIAzJhkBGu8=; b=jj9wHz36yI3MLKZmtnItXK3TF8s2+AsDS60o0HVkGA/Yt1R5ibKLpN+F9D7/0pmOHb f8jyO9SIt5ppW33zFZQzJw59+31b+EhSRERDBXWsvdkauDKr+bkquBMlHugzCpN1uB9l 5xpHxsGlfx6ZwcgD1sZyUgOFlDRz8+h0HHTfU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=s6nknyBj3McSnQencS9Na6xJ9Ilxd9i/SK7I5vqqHGXFHitiJ0NAaBOize0flFJslm EzeEHqlr4Er7oxBQ+ZwBcM/+YyWk5DmGPgZmYm0lUtwJiimrEeq+HnRfQhuGD+FxaY2s ERO22mxgyNKR836UwvtfwVP5a6iyKTqDom2pg= MIME-Version: 1.0 Received: by 10.229.28.85 with HTTP; Tue, 13 Apr 2010 07:44:27 -0700 (PDT) In-Reply-To: References: <7d6fde3d1003070207q621e69ado2cb64e431feacd76@mail.gmail.com> <7d6fde3d1003070224k3626a9b5y98c11a43eef1bed4@mail.gmail.com> <4e6cba831003101356i534341ffr2961b983854ab788@mail.gmail.com> <7dc40bd01003101407m605e41ey2d8ace0049cf5e61@mail.gmail.com> <7d6fde3d1003102158o7834ca67lce3eca23aa723fd1@mail.gmail.com> <7d6fde3d1003121933s4ba7b57fw6542628c16edf723@mail.gmail.com> Date: Tue, 13 Apr 2010 07:44:27 -0700 Received: by 10.229.35.80 with SMTP id o16mr1728203qcd.93.1271169868074; Tue, 13 Apr 2010 07:44:28 -0700 (PDT) Message-ID: From: Garrett Cooper To: Attilio Rao Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Tom Couch , Giovanni Trematerra , FreeBSD Current , Scott Long , Pyun YongHyeon Subject: Re: Removing USB keyboard after filesystems synced causes panic with destroyed mutex twa(4)? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Apr 2010 14:44:32 -0000 On Tue, Apr 13, 2010 at 3:50 AM, Attilio Rao wrote: > 2010/4/13 Attilio Rao : >> 2010/3/13 Garrett Cooper : >>> On Wed, Mar 10, 2010 at 9:58 PM, Garrett Cooper wr= ote: >>>> On Wed, Mar 10, 2010 at 2:07 PM, Tom Couch wrote: >>>>> Hi FreeBSD-current, >>>>> =A0 =A0 My name is Tom Couch, >>>>> I am part of the 3ware driver team recently acquired by LSI. >>>>> I believe Giovanni's patch, below, is the correct fix for this bug. >>>>> >>>>> I am available to maintain the twa driver, now that I am on this list= . >>>>> Let me know how I can help, >>>>> >>>>> Tom >>>>> >>>>> >>>>> On Wed, Mar 10, 2010 at 1:56 PM, Giovanni Trematerra < >>>>> giovanni.trematerra@gmail.com> wrote: >>>>> >>>>>> On Sun, Mar 7, 2010 at 11:24 AM, Garrett Cooper >>>>>> wrote: >>>>>> > On Sun, Mar 7, 2010 at 2:07 AM, Garrett Cooper >>>>>> wrote: >>>>>> >> Hi Alexander and Hans, >>>>>> >> =A0 =A0I recently did the following which generated a panic on a >>>>>> >> 9-CURRENT kernel compiled on the 26th: >>>>>> >> >>>>>> >> 1. Executed reboot >>>>>> >> 2. Removed keyboard. >>>>>> >> 3. Some time after `All buffers synced\nUptime: ...' was displaye= d, >>>>>> >> the keyboard was registered disconnected. >>>>>> >> 4. The interrupt was delivered to my twa(4) enabled card and the >>>>>> >> kernel panicked, like so: >>>>>> >> >>>>>> >> ugen2.2: at usbus2 (disconnected) >>>>>> >> uhub8: at uhub2, port 1, addr 2 (disconnected) >>>>>> >> ugen2.3: at usbus2 (disconnected) >>>>>> >> ukbd0: at uhub8, port 3, addr 3 (disconnected) >>>>>> >> uhid0: at uhub8, port 3, addr 3 (disconnected) >>>>>> >> panic: mtx_lock_spin() of destroyed mutex @ >>>>>> /usr/src/sys/dev/twa/tw_cl_intr.c:88 >>>>>> >> >>>>>> >> cpuid =3D 1 >>>>>> >> KDB: enter: panic >>>>>> >> [thread pid 12 tid 100025 ] >>>>>> >> Stopped at =A0 =A0 =A0 =A0 kdb_enter+0x3d: movq =A0 =A0 $0,0x4028= 9c(%rip) >>>>>> >> db> >>>>>> >> >>>>>> >> =A0 =A0I wish I could provide you with more details, but unfortun= ately I >>>>>> >> the USB bus isn't registering the fact that I'm reattaching the >>>>>> >> keyboard right now and the box won't reboot automatically :( (did= n't >>>>>> >> set the right sysctl beforehand to panic automatically). I'll try= and >>>>>> >> reproduce the issue again, but I was just wondering whether or no= t you >>>>>> >> guys had seen this problem before. >>>>>> > >>>>>> > =A0 =A0Phew... it's reproducible with that kernel. Here's what I d= id >>>>>> > exactly (because my original directions were incorrect): >>>>>> > =A0 =A01. Hit power button (for S5). >>>>>> > =A0 =A02. Disconnect keyboard RIGHT as `Uptime: ...' is displayed. >>>>>> > =A0 =A0Kernel panicked on my system again. Now to figure out if it= still >>>>>> > exists with a kernel compiled today, and also how to debug it if i= t >>>>>> > still does exist :/... >>>>>> > Thanks, >>>>>> > -Garrett >>>>>> >>>>>> Hi Garrett, >>>>>> Could you please try the patch below and report back? >>>>>> >>>>>> Thank you >>>>>> >>>>>> diff -r cab6489de66d sys/dev/twa/tw_cl_intr.c >>>>>> --- a/sys/dev/twa/tw_cl_intr.c =A0 =A0 =A0 =A0Wed Mar 03 04:51:13 20= 10 -0500 >>>>>> +++ b/sys/dev/twa/tw_cl_intr.c =A0 =A0 =A0 =A0Wed Mar 10 06:29:05 20= 10 -0500 >>>>>> @@ -75,9 +75,12 @@ tw_cl_interrupt(struct tw_cl_ctlr_handle >>>>>> =A0 =A0 =A0if (ctlr =3D=3D NULL) >>>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out; >>>>>> >>>>>> - =A0 =A0 /* If we get an interrupt while resetting, it is a shared >>>>>> - =A0 =A0 =A0 =A0one for another device, so just bail */ >>>>>> - =A0 =A0 if (ctlr->state & TW_CLI_CTLR_STATE_RESET_IN_PROGRESS) >>>>>> + =A0 =A0 /* >>>>>> + =A0 =A0 =A0* =A0If we get an interrupt while resetting or shutting= down, >>>>>> + =A0 =A0 =A0* =A0it is a shared one for another device, so just bai= l >>>>>> + =A0 =A0 =A0*/ >>>>>> + =A0 =A0 if (ctlr->state & TW_CLI_CTLR_STATE_RESET_IN_PROGRESS || >>>>>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (ctrl->state & TW_CLI_CTLR= _STATE_ACTIVE) =3D=3D 0) >>>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0goto out; >>>>>> >>>>>> =A0 =A0 =A0/* >>> >>> Apart from the typo above (s/ctrl/ctlr/), things work appropriately >>> now at reboot. The only problem is that bootup is really wonky now, >>> because the RAID had a LOT of issues attaching to cam(4) (failed in >>> 2/3 cold boot attempts); an additional branch condition may need to be >>> added to the above if-statement if this change didn't take that into >>> account. However, if the old behavior was incorrect and the new >>> behavior is correct, s.t. the RAID controller demonstrating bus >>> detection timeout issue that is occurring with a lot of USB devices >>> and some RAID controllers today, this could be extremely problematic. >>> >>> So, while it looks better than before at reboot, it's not ready yet >>> for prime time; I'd rather that the bug was filed with the patch you >>> provided after the typo fixed, with the caveat mentioned and NOT >>> committed, because the adverse affect(s) seem a bit more annoying than >>> the previous panic issue described. >> >> I looked briefly at it and I think there are 2 bugs, one in >> twa_detach() and another one in twa_shutdown(). >> Basically, locks get destroyed in tw_cl_shutdown_ctlr() which is >> called by twa_shutdown() while interrupts are teared down in >> tw_osli_free_resource(). twa_shutdown() is called in twa_detach() >> before than tw_osli_free_resource(). >> tw_cl_shutdown_ctlr() also takes care to disable the interrupts for >> twa but a problem can arise with shared IRQ. Infact, the handler will >> remain on the IRQ until the bus_intr_teardown() takes place and it may >> run, trying to acknowledge the interrupt, but with destroyed lock, if >> an interrupt is sent by a shared source between twa_shutdown() and >> tw_osli_free_resource() call in twa_detach() or just after a simple >> call to twa_shutdown(). >> >> Problems I see here: >> - twa_shutdown() should not destroy the mutex at all, it is not >> something our shutdown handlers generally do and it might be kept in >> sync >> - twa_detach() might do a first half of tw_cl_shutdown_ctlr(), do the >> resource deallocation and just at the end destroy mutexes. That is how >> generally our detach handler works. >> >> All these solutions would mean refactoring the tw_osli_free_resource() >> and tw_cl_shutdown_ctlr(). I don't know very well the twa code, but it >> seems to me that we want to keep the driver very compatible with any >> vendor version or such? If yes this may be a problem because the >> failing patterns are all located into the shared code and an ideal >> solution could be more difficult to find out. Otherwise a fix might be >> simple to hammer down. > > Forgot to tell: twe might have the same problem even if it doesn't > expose just for luckiness. Hmmm... ok. I don't have a twe enabled card so I can't verify whether or not this problem exists :/. Thanks for the comments! -Garrett