From owner-freebsd-current@FreeBSD.ORG Tue Apr 13 10:50:31 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E09991065677; Tue, 13 Apr 2010 10:50:31 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-ew0-f224.google.com (mail-ew0-f224.google.com [209.85.219.224]) by mx1.freebsd.org (Postfix) with ESMTP id 185EE8FC14; Tue, 13 Apr 2010 10:50:30 +0000 (UTC) Received: by ewy24 with SMTP id 24so1378379ewy.33 for ; Tue, 13 Apr 2010 03:50:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:received:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=knXy/Y55AU9UfjG2BJO/JGG+NTu1joN2wuoAeG49BcI=; b=vp44Vju17pSUjwhaf7UUvcInT/bzllmv0Sl87B0UhK29nQdwTttzoVpzyZBcftoIGT G+7mhxC/2LLl0MJ6+0sVtIGqZBOVsCuRvns0In1TLq8NvV+CEYL0nrwdLKjs2nUMxPnU tMlR/V5vuqe7+36t5KtvttWx8xPrVw2fSqml8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=iMrDjbgTWe8F9EhFh3x0cWVLC7/quYz4jusVY5jfwMJZZqZoCxyIHbiHuhIUhW4hsM PskwCTOPlUtF/HKq/K4KxccyBWpAc4xihZ7SyZ+2Hw1QeW6sg77IWC6aqB9VlfUqJMQZ Uxda+Nx7vwSKo0E7/ejWxdSKOu4KhgBJALWvs= MIME-Version: 1.0 Sender: asmrookie@gmail.com Received: by 10.103.193.19 with HTTP; Tue, 13 Apr 2010 03:50:29 -0700 (PDT) In-Reply-To: References: <7d6fde3d1003070207q621e69ado2cb64e431feacd76@mail.gmail.com> <7d6fde3d1003070224k3626a9b5y98c11a43eef1bed4@mail.gmail.com> <4e6cba831003101356i534341ffr2961b983854ab788@mail.gmail.com> <7dc40bd01003101407m605e41ey2d8ace0049cf5e61@mail.gmail.com> <7d6fde3d1003102158o7834ca67lce3eca23aa723fd1@mail.gmail.com> <7d6fde3d1003121933s4ba7b57fw6542628c16edf723@mail.gmail.com> Date: Tue, 13 Apr 2010 12:50:29 +0200 X-Google-Sender-Auth: 5fc5e29874d86b21 Received: by 10.102.166.8 with SMTP id o8mr2985914mue.13.1271155829760; Tue, 13 Apr 2010 03:50:29 -0700 (PDT) Message-ID: From: Attilio Rao To: Garrett Cooper Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: Tom Couch , Giovanni Trematerra , FreeBSD Current , Scott Long , Pyun YongHyeon Subject: Re: Removing USB keyboard after filesystems synced causes panic with destroyed mutex twa(4)? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Apr 2010 10:50:32 -0000 2010/4/13 Attilio Rao : > 2010/3/13 Garrett Cooper : >> On Wed, Mar 10, 2010 at 9:58 PM, Garrett Cooper wro= te: >>> On Wed, Mar 10, 2010 at 2:07 PM, Tom Couch wrote: >>>> Hi FreeBSD-current, >>>> =C2=A0 =C2=A0 My name is Tom Couch, >>>> I am part of the 3ware driver team recently acquired by LSI. >>>> I believe Giovanni's patch, below, is the correct fix for this bug. >>>> >>>> I am available to maintain the twa driver, now that I am on this list. >>>> Let me know how I can help, >>>> >>>> Tom >>>> >>>> >>>> On Wed, Mar 10, 2010 at 1:56 PM, Giovanni Trematerra < >>>> giovanni.trematerra@gmail.com> wrote: >>>> >>>>> On Sun, Mar 7, 2010 at 11:24 AM, Garrett Cooper >>>>> wrote: >>>>> > On Sun, Mar 7, 2010 at 2:07 AM, Garrett Cooper >>>>> wrote: >>>>> >> Hi Alexander and Hans, >>>>> >> =C2=A0 =C2=A0I recently did the following which generated a panic = on a >>>>> >> 9-CURRENT kernel compiled on the 26th: >>>>> >> >>>>> >> 1. Executed reboot >>>>> >> 2. Removed keyboard. >>>>> >> 3. Some time after `All buffers synced\nUptime: ...' was displayed= , >>>>> >> the keyboard was registered disconnected. >>>>> >> 4. The interrupt was delivered to my twa(4) enabled card and the >>>>> >> kernel panicked, like so: >>>>> >> >>>>> >> ugen2.2: at usbus2 (disconnected) >>>>> >> uhub8: at uhub2, port 1, addr 2 (disconnected) >>>>> >> ugen2.3: at usbus2 (disconnected) >>>>> >> ukbd0: at uhub8, port 3, addr 3 (disconnected) >>>>> >> uhid0: at uhub8, port 3, addr 3 (disconnected) >>>>> >> panic: mtx_lock_spin() of destroyed mutex @ >>>>> /usr/src/sys/dev/twa/tw_cl_intr.c:88 >>>>> >> >>>>> >> cpuid =3D 1 >>>>> >> KDB: enter: panic >>>>> >> [thread pid 12 tid 100025 ] >>>>> >> Stopped at =C2=A0 =C2=A0 =C2=A0 =C2=A0 kdb_enter+0x3d: movq =C2=A0= =C2=A0 $0,0x40289c(%rip) >>>>> >> db> >>>>> >> >>>>> >> =C2=A0 =C2=A0I wish I could provide you with more details, but unf= ortunately I >>>>> >> the USB bus isn't registering the fact that I'm reattaching the >>>>> >> keyboard right now and the box won't reboot automatically :( (didn= 't >>>>> >> set the right sysctl beforehand to panic automatically). I'll try = and >>>>> >> reproduce the issue again, but I was just wondering whether or not= you >>>>> >> guys had seen this problem before. >>>>> > >>>>> > =C2=A0 =C2=A0Phew... it's reproducible with that kernel. Here's wha= t I did >>>>> > exactly (because my original directions were incorrect): >>>>> > =C2=A0 =C2=A01. Hit power button (for S5). >>>>> > =C2=A0 =C2=A02. Disconnect keyboard RIGHT as `Uptime: ...' is displ= ayed. >>>>> > =C2=A0 =C2=A0Kernel panicked on my system again. Now to figure out = if it still >>>>> > exists with a kernel compiled today, and also how to debug it if it >>>>> > still does exist :/... >>>>> > Thanks, >>>>> > -Garrett >>>>> >>>>> Hi Garrett, >>>>> Could you please try the patch below and report back? >>>>> >>>>> Thank you >>>>> >>>>> diff -r cab6489de66d sys/dev/twa/tw_cl_intr.c >>>>> --- a/sys/dev/twa/tw_cl_intr.c =C2=A0 =C2=A0 =C2=A0 =C2=A0Wed Mar 03 = 04:51:13 2010 -0500 >>>>> +++ b/sys/dev/twa/tw_cl_intr.c =C2=A0 =C2=A0 =C2=A0 =C2=A0Wed Mar 10 = 06:29:05 2010 -0500 >>>>> @@ -75,9 +75,12 @@ tw_cl_interrupt(struct tw_cl_ctlr_handle >>>>> =C2=A0 =C2=A0 =C2=A0if (ctlr =3D=3D NULL) >>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 goto out; >>>>> >>>>> - =C2=A0 =C2=A0 /* If we get an interrupt while resetting, it is a sh= ared >>>>> - =C2=A0 =C2=A0 =C2=A0 =C2=A0one for another device, so just bail */ >>>>> - =C2=A0 =C2=A0 if (ctlr->state & TW_CLI_CTLR_STATE_RESET_IN_PROGRESS= ) >>>>> + =C2=A0 =C2=A0 /* >>>>> + =C2=A0 =C2=A0 =C2=A0* =C2=A0If we get an interrupt while resetting = or shutting down, >>>>> + =C2=A0 =C2=A0 =C2=A0* =C2=A0it is a shared one for another device, = so just bail >>>>> + =C2=A0 =C2=A0 =C2=A0*/ >>>>> + =C2=A0 =C2=A0 if (ctlr->state & TW_CLI_CTLR_STATE_RESET_IN_PROGRESS= || >>>>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 (ctrl->state & TW_CLI_CTLR_STATE_ACTIVE) =3D=3D 0) >>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto out; >>>>> >>>>> =C2=A0 =C2=A0 =C2=A0/* >> >> Apart from the typo above (s/ctrl/ctlr/), things work appropriately >> now at reboot. The only problem is that bootup is really wonky now, >> because the RAID had a LOT of issues attaching to cam(4) (failed in >> 2/3 cold boot attempts); an additional branch condition may need to be >> added to the above if-statement if this change didn't take that into >> account. However, if the old behavior was incorrect and the new >> behavior is correct, s.t. the RAID controller demonstrating bus >> detection timeout issue that is occurring with a lot of USB devices >> and some RAID controllers today, this could be extremely problematic. >> >> So, while it looks better than before at reboot, it's not ready yet >> for prime time; I'd rather that the bug was filed with the patch you >> provided after the typo fixed, with the caveat mentioned and NOT >> committed, because the adverse affect(s) seem a bit more annoying than >> the previous panic issue described. > > I looked briefly at it and I think there are 2 bugs, one in > twa_detach() and another one in twa_shutdown(). > Basically, locks get destroyed in tw_cl_shutdown_ctlr() which is > called by twa_shutdown() while interrupts are teared down in > tw_osli_free_resource(). twa_shutdown() is called in twa_detach() > before than tw_osli_free_resource(). > tw_cl_shutdown_ctlr() also takes care to disable the interrupts for > twa but a problem can arise with shared IRQ. Infact, the handler will > remain on the IRQ until the bus_intr_teardown() takes place and it may > run, trying to acknowledge the interrupt, but with destroyed lock, if > an interrupt is sent by a shared source between twa_shutdown() and > tw_osli_free_resource() call in twa_detach() or just after a simple > call to twa_shutdown(). > > Problems I see here: > - twa_shutdown() should not destroy the mutex at all, it is not > something our shutdown handlers generally do and it might be kept in > sync > - twa_detach() might do a first half of tw_cl_shutdown_ctlr(), do the > resource deallocation and just at the end destroy mutexes. That is how > generally our detach handler works. > > All these solutions would mean refactoring the tw_osli_free_resource() > and tw_cl_shutdown_ctlr(). I don't know very well the twa code, but it > seems to me that we want to keep the driver very compatible with any > vendor version or such? If yes this may be a problem because the > failing patterns are all located into the shared code and an ideal > solution could be more difficult to find out. Otherwise a fix might be > simple to hammer down. Forgot to tell: twe might have the same problem even if it doesn't expose just for luckiness. Attilio --=20 Peace can only be achieved by understanding - A. Einstein