Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 07 Jan 2016 15:28:37 -0700
From:      Ian Lepore <ian@freebsd.org>
To:        Mark Millard <markmi@dsl-only.net>, Hans Petter Selasky <hps@selasky.org>
Cc:        freebsd-arm <freebsd-arm@freebsd.org>
Subject:   Re: FYI: various 11.0-CURRENT -r293227 (and older) hangs on arm (rpi2): a description of sorts
Message-ID:  <1452205717.1215.25.camel@freebsd.org>
In-Reply-To: <B7E8D0FD-B3A9-40DF-B0ED-9D3041F8B2A2@dsl-only.net>
References:  <E0379BE9-308A-4219-A8AE-A5FFE828BA93@dsl-only.net> <1452183170.1215.4.camel@freebsd.org> <FB0D5486-AD27-44A7-86CA-68989AE08EC7@dsl-only.net> <1452196099.1215.12.camel@freebsd.org> <568EC4D8.7010106@selasky.org> <8B728C93-9C90-4821-A607-5D157F028812@dsl-only.net> <568ED810.8010309@selasky.org> <568ED92C.9070602@selasky.org> <B7E8D0FD-B3A9-40DF-B0ED-9D3041F8B2A2@dsl-only.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 2016-01-07 at 14:04 -0800, Mark Millard wrote:
> On 2016-Jan-7, at 1:31 PM, Hans Petter Selasky <hps at selasky.org>
> wrote:
> > 
> > On 01/07/16 22:26, Hans Petter Selasky wrote:
> > > On 01/07/16 21:20, Mark Millard wrote:
> > > > 
> > > > On 2016-Jan-7, at 12:04 PM, Hans Petter Selasky <hps at
> > > > selasky.org>
> > > > wrote:
> > > > > 
> > > > > On 01/07/16 20:48, Ian Lepore wrote:
> > > > > > If the filesystems and swap space are on a usb drive, then
> > > > > > maybe it's
> > > > > > the usb subsystem that's hanging.  The wait states you
> > > > > > showed for those
> > > > > > processes are consistant with what I've seen when all
> > > > > > buffers get
> > > > > > backed up in a queue on one non-responsive or slow device. 
> > > > > >  It may be
> > > > > > that there's a way to get the system deadlocked when it's
> > > > > > low on
> > > > > > buffers and there is memory pressure causing the swap to be
> > > > > > used (I
> > > > > > generally run arms systems without any swap configured).
> > > > > > 
> > > > > > Running gstat in another window while this is going on may
> > > > > > give you
> > > > > > some insight into the situation.  Beyond that I don't know
> > > > > > what to look
> > > > > > at, especially since you generally can't launch any new
> > > > > > tools once the
> > > > > > system gets into this kind of state.
> > > > > > 
> > > > > > -- Ian
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > All USB transfers towards disk devices have timeouts, so if
> > > > > something
> > > > > is hanging at USB level, you'll get a printout eventually.
> > > > 
> > > > What sort of timescale after deadlock/live-lock is observed to
> > > > apparently have started does one have to wait in order to
> > > > conclude
> > > > that the timeouts would have happened and so they do not apply
> > > > to the
> > > > deadlock/live-lock?
> > > > 
> > > > > The USB kernel processes needed for doing I/O transfers are
> > > > > not
> > > > > pinned to RAM. Can it happen if a USB process is swapped to
> > > > > disk,
> > > > > that the system cannot wakeup a swapped out process to get
> > > > > more swap?
> > > > > 
> > > > > --HPS
> > > > 
> > > 
> > > Hi,
> > > 
> > > > Wow. Could I use ddb to somehow check on the "USB kernel
> > > > processes"
> > > > swap status when the overall context is deadlocked/live-locked?
> > > 
> > > Are you able to run something like:
> > > 
> > > ps auxwwH | grep usb
> > > 
> > > > If yes, how? Otherwise something in top or some such display
> > > > that I'd
> > > left running over the serial console would have to present useful
> > > information on the subject. Is there anything that would?
> > > 
> > 
> > Are you able to SSH into the box or ping it?
> > 
> > --HPS
> 
> Once the live-lock condition is reached no new processes can be
> created as far as I can tell: the attempt will hang any process that
> attempts the creation.
> 
> I'd need "ps auxwwH" to be internally repeating to even get that
> much: I'd have to start it before the live-lock happened and it would
> have to be still running when the hang occurs, no on-going process
> creations involved.
> 
> I'm not so sure that two communicating processes (ps and grep over a
> pipe) would work but I can not get to even one new process so far.
> 
> ssh sessions also hang, input and output stop for them fairly
> generally. (Sometimes the context is such that ^t still works but
> shows no progress in what it reports.) No new ssh connections are
> possible: "Operation timed out".
> 
> ping does respond normally: it is more of a live-lock status then a
> true deadlock one overall.
> 
> The serial console still outputs what it was already running if that
> process does nothing that locks up. Changing what it is doing
> generally locks it up too.
> 
> Doing something like unplugging a usb keyboard or mouse or plugging
> one in does show the expected messages via the console: it is more of
> a live-lock status then a true deadlock one overall.
> 
> I can get to ddb after the hang. But I do not know what I'd do with
> it to find any useful information.
> 
> 
> As noted in another message: I used gstat instead of top on the
> serial console:
> 
> > gstat shows everything zero during a hang, even L(q) column.
> > (Length of queue?)
> > 
> > I used:
> > 
> > gstat -cod
> > 
> > and had it running over the serial console port during the
> > attempted portmaster activity.

All of those symptoms sound consistant with the deadlock being IO
-related.  You can't ssh in because creating an ssh session for you
requires reading a variety of files and it locks at that point.  USB
insert/remove events lead to devd events which can lead to doing IO (to
load driver modules for example) so that might lead to lockups or not.

Since ddb is still usable when the hangs occur, you can break into that
and use its 'ps' command (no args) to find out what various threads are
waiting for (wmesg column).  The fact that your original output
included processes in a 'wswbuf' state is what makes me think it's swap
-related IO that's causing everything else to back up behind it. 
 (Unfortunately, there are 'wswbuf0' and 'wswbuf1' waits in the kernel
that really should be named "wsw0buf' and 'wsw1buf' to allow for the 6
-char truncation of the display).

There are probably ddb commands to look at a variety of other
interesting things (the 'show' command has a lot of options), but I
don't know what to look at really, other than some guesses (show pageq
might be interesting, show freepages maybe?).

-- Ian




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1452205717.1215.25.camel>