Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 7 Jan 2016 16:10:26 -0800
From:      Mark Millard <markmi@dsl-only.net>
To:        Ian Lepore <ian@freebsd.org>, Hans Petter Selasky <hps@selasky.org>
Cc:        freebsd-arm <freebsd-arm@freebsd.org>
Subject:   Re: FYI: various 11.0-CURRENT -r293227 (and older) hangs on arm (rpi2): a description of sorts
Message-ID:  <A46F1EFD-C8B9-4665-A36C-F161BE45CC2F@dsl-only.net>
In-Reply-To: <1452205717.1215.25.camel@freebsd.org>
References:  <E0379BE9-308A-4219-A8AE-A5FFE828BA93@dsl-only.net> <1452183170.1215.4.camel@freebsd.org> <FB0D5486-AD27-44A7-86CA-68989AE08EC7@dsl-only.net> <1452196099.1215.12.camel@freebsd.org> <568EC4D8.7010106@selasky.org> <8B728C93-9C90-4821-A607-5D157F028812@dsl-only.net> <568ED810.8010309@selasky.org> <568ED92C.9070602@selasky.org> <B7E8D0FD-B3A9-40DF-B0ED-9D3041F8B2A2@dsl-only.net> <1452205717.1215.25.camel@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On 2016-Jan-7, at 2:28 PM, Ian Lepore <ian at freebsd.org> wrote:
> 
> On Thu, 2016-01-07 at 14:04 -0800, Mark Millard wrote:
>> On 2016-Jan-7, at 1:31 PM, Hans Petter Selasky <hps at selasky.org>
>> wrote:
>>> 
>>> On 01/07/16 22:26, Hans Petter Selasky wrote:
>>>> On 01/07/16 21:20, Mark Millard wrote:
>>>>> 
>>>>> On 2016-Jan-7, at 12:04 PM, Hans Petter Selasky <hps at
>>>>> selasky.org>
>>>>> wrote:
>>>>>> 
>>>>>> On 01/07/16 20:48, Ian Lepore wrote:
>>>>>>> If the filesystems and swap space are on a usb drive, then
>>>>>>> maybe it's
>>>>>>> the usb subsystem that's hanging.  The wait states you
>>>>>>> showed for those
>>>>>>> processes are consistant with what I've seen when all
>>>>>>> buffers get
>>>>>>> backed up in a queue on one non-responsive or slow device. 
>>>>>>> It may be
>>>>>>> that there's a way to get the system deadlocked when it's
>>>>>>> low on
>>>>>>> buffers and there is memory pressure causing the swap to be
>>>>>>> used (I
>>>>>>> generally run arms systems without any swap configured).
>>>>>>> 
>>>>>>> Running gstat in another window while this is going on may
>>>>>>> give you
>>>>>>> some insight into the situation.  Beyond that I don't know
>>>>>>> what to look
>>>>>>> at, especially since you generally can't launch any new
>>>>>>> tools once the
>>>>>>> system gets into this kind of state.
>>>>>>> 
>>>>>>> -- Ian
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> All USB transfers towards disk devices have timeouts, so if
>>>>>> something
>>>>>> is hanging at USB level, you'll get a printout eventually.
>>>>> 
>>>>> What sort of timescale after deadlock/live-lock is observed to
>>>>> apparently have started does one have to wait in order to
>>>>> conclude
>>>>> that the timeouts would have happened and so they do not apply
>>>>> to the
>>>>> deadlock/live-lock?
>>>>> 
>>>>>> The USB kernel processes needed for doing I/O transfers are
>>>>>> not
>>>>>> pinned to RAM. Can it happen if a USB process is swapped to
>>>>>> disk,
>>>>>> that the system cannot wakeup a swapped out process to get
>>>>>> more swap?
>>>>>> 
>>>>>> --HPS
>>>>> 
>>>> 
>>>> Hi,
>>>> 
>>>>> Wow. Could I use ddb to somehow check on the "USB kernel
>>>>> processes"
>>>>> swap status when the overall context is deadlocked/live-locked?
>>>> 
>>>> Are you able to run something like:
>>>> 
>>>> ps auxwwH | grep usb
>>>> 
>>>>> If yes, how? Otherwise something in top or some such display
>>>>> that I'd
>>>> left running over the serial console would have to present useful
>>>> information on the subject. Is there anything that would?
>>>> 
>>> 
>>> Are you able to SSH into the box or ping it?
>>> 
>>> --HPS
>> 
>> Once the live-lock condition is reached no new processes can be
>> created as far as I can tell: the attempt will hang any process that
>> attempts the creation.
>> 
>> I'd need "ps auxwwH" to be internally repeating to even get that
>> much: I'd have to start it before the live-lock happened and it would
>> have to be still running when the hang occurs, no on-going process
>> creations involved.
>> 
>> I'm not so sure that two communicating processes (ps and grep over a
>> pipe) would work but I can not get to even one new process so far.
>> 
>> ssh sessions also hang, input and output stop for them fairly
>> generally. (Sometimes the context is such that ^t still works but
>> shows no progress in what it reports.) No new ssh connections are
>> possible: "Operation timed out".
>> 
>> ping does respond normally: it is more of a live-lock status then a
>> true deadlock one overall.
>> 
>> The serial console still outputs what it was already running if that
>> process does nothing that locks up. Changing what it is doing
>> generally locks it up too.
>> 
>> Doing something like unplugging a usb keyboard or mouse or plugging
>> one in does show the expected messages via the console: it is more of
>> a live-lock status then a true deadlock one overall.
>> 
>> I can get to ddb after the hang. But I do not know what I'd do with
>> it to find any useful information.
>> 
>> 
>> As noted in another message: I used gstat instead of top on the
>> serial console:
>> 
>>> gstat shows everything zero during a hang, even L(q) column.
>>> (Length of queue?)
>>> 
>>> I used:
>>> 
>>> gstat -cod
>>> 
>>> and had it running over the serial console port during the
>>> attempted portmaster activity.
> 
> All of those symptoms sound consistant with the deadlock being IO
> -related.  You can't ssh in because creating an ssh session for you
> requires reading a variety of files and it locks at that point.  USB
> insert/remove events lead to devd events which can lead to doing IO (to
> load driver modules for example) so that might lead to lockups or not.
> 
> Since ddb is still usable when the hangs occur, you can break into that
> and use its 'ps' command (no args) to find out what various threads are
> waiting for (wmesg column).  The fact that your original output
> included processes in a 'wswbuf' state is what makes me think it's swap
> -related IO that's causing everything else to back up behind it. 
> (Unfortunately, there are 'wswbuf0' and 'wswbuf1' waits in the kernel
> that really should be named "wsw0buf' and 'wsw1buf' to allow for the 6
> -char truncation of the display).
> 
> There are probably ddb commands to look at a variety of other
> interesting things (the 'show' command has a lot of options), but I
> don't know what to look at really, other than some guesses (show pageq
> might be interesting, show freepages maybe?).
> 
> -- Ian

FYI. . .

ddb's "ps" showed (my presentation order and formating):

[pagedaemon] had wmesg wswbuf0 and state D
[swapper]    had wmesg vmwait  and state D
[md0]        had wmesg vmwait  and state DL

[usb]'s threads:
  [usb0]     had wmesg -       and state D (all 5 such lines did)
  [smsc0]    had wmesg -       and state D

"show pageq" listed:

pq_free 2 pq_cache 0
dom 0 page_cnt 234761 free 2 pq_act 164873 pq_inact 18563 pass 2

"show freepages" listed only one non-zero "NUMBER POOL 0":

ORDER (SIZE) NUMBER
             POOL 0
01 (000008k) 000001



===
Mark Millard
markmi at dsl-only.net




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A46F1EFD-C8B9-4665-A36C-F161BE45CC2F>