Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 12 Aug 2011 23:26:49 +0200
From:      Hans Petter Selasky <hselasky@c2i.net>
To:        Andrew Boyer <aboyer@averesystems.com>
Cc:        freebsd-stable@freebsd.org, Eugene Grosbein <egrosbein@rdtc.ru>, Vishal.Shah@netapp.com, Andriy Gapon <avg@freebsd.org>, Jeremiah Lott <jlott@averesystems.com>, Steven Hartland <killing@multiplay.co.uk>
Subject:   Re: USB/coredump hangs in 8 and 9
Message-ID:  <201108122326.49597.hselasky@c2i.net>
In-Reply-To: <DA1FD6FD-2E57-4EC4-899D-2C1CBB769456@averesystems.com>
References:  <DA1FD6FD-2E57-4EC4-899D-2C1CBB769456@averesystems.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Friday 12 August 2011 21:59:21 Andrew Boyer wrote:
> Re: panic: bufwrite: buffer is not busy??? (originally on freebsd-net)
> Re: debugging frequent kernel panics on 8.2-RELEASE (originally on
> freebsd-stable) Re: System hang in USB umass module while processing panic
>  (originally on freebsd-usb)
> 
> Hello Andriy and Hans,
> 
> Sorry for tying in so many discussions on this topic, but I think I have an
> explanation for the problems we have been reporting* with hanging
> coredumps on multicore systems on 8.2-RELEASE, and it has implications for
> Andriy's proposed scheduler patch** and for USB.
> 
> In today's 8.X and 9.X branches, nothing that I can find stops the other
> CPUs when the kernel panics, but many parts of the locking code get
> disabled (grep on 'panicstr').  The 'bufwrite: buffer is not busy???'
> panic is caused by the syncer encountering an error.  If that happens when
> it's on the dumping CPU everything hangs.  If it's running on a different
> CPU, it will be blocked and hidden by the panic_cpu spinlock in panic(),
> and the dump continues, polling every attached keyboard for a Ctl-C.
> 
> But, the new 8.X USB stack relies on multithreading.  (The new stack is the
> variable that broke coredumps for us in the 7.1->8.2 transition, I think.)
>  SVN 224223 fixes a hang that would happen when dumpsys() polls the USB
> keyboard (IPMI KVM, in our case).  That helps, but it only gets as far as
> usb_process(), where it hangs in a loop around a cv_wait() call.  This is
> easy to reproduce by adding code to the watchdog to break into the
> debugger if panicstr is set.
> 
> I am experimenting with Andriy's patch** to stop the scheduler and it seems
> to be most of the way there, stopping the CPUs and disabling the rest of
> locking.  There are a few places that still reference panicstr, but that's
> minor.  These are the changes I made to the patch: * Changed
> ukbd_do_poll() to return immediately if SCHEDULER_STOPPED() is true, so
> that we don't hang up in USB.  ukbd_yield()  locks up in DROP_GIANT(), and
> if you skip ukbd_yield(), usbd_transfer_poll() locks up trying to drop
> mutexes. * Changed the call to spinlock_enter() back to critical_enter(),
> so that interrupts stay enabled and the hardclock still functions. * Added
> code in the beginning of panic() to switch to CPU 0, so that we're able to
> service the hardclock interrupts and so that watchdog panics get through.
> 
> This has worked 100% for me so far, although anyone using a USB keyboard or
> dump device would still be out of luck.
> 
> Thoughts?  It seems like stopping all of the other CPUs is the right thing
> to do on a panic (what are they doing otherwise?).  Are the USB issues
> fixable?  If Andriy's patch get committed it might just involve
> short-circuiting all of the locking in the polling path, but I haven't
> gotten that far yet.  I bet dumping to NFS will have the same problem.

Hi.

USB does not rely on multithreading when doing polling. It bypasses the 
processing thread and calls the function directly. Also I can add the USB has 
recursive checking flags, so that if important functions are already called, 
the code will simply return.

USB does not rely on locking after panic, except maybe mtx_owned() returning 
the correct value. Your approaching having the mtx_lock() / mtx_unlock() 
functions simply do nothing will affect the USB polling ability if mtx_owned() 
does not return true when the lock is locked. So maybe in case of SCHEDULER 
stopped we should just steal the lock instead of just returning. Also I sssume  
that all interrupts and all other processes are blocked at the moment of panic 
or dump.

--HPS



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201108122326.49597.hselasky>