From owner-freebsd-arm@freebsd.org Thu Jan 7 22:28:40 2016 Return-Path: Delivered-To: freebsd-arm@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id AC109A673E6 for ; Thu, 7 Jan 2016 22:28:40 +0000 (UTC) (envelope-from ian@freebsd.org) Received: from outbound1b.ore.mailhop.org (outbound1b.ore.mailhop.org [54.200.247.200]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 811341A94 for ; Thu, 7 Jan 2016 22:28:40 +0000 (UTC) (envelope-from ian@freebsd.org) Received: from ilsoft.org (unknown [73.34.117.227]) by outbound1.ore.mailhop.org (Halon Mail Gateway) with ESMTPSA; Thu, 7 Jan 2016 22:29:09 +0000 (UTC) Received: from rev (rev [172.22.42.240]) by ilsoft.org (8.14.9/8.14.9) with ESMTP id u07MSbtW005205; Thu, 7 Jan 2016 15:28:37 -0700 (MST) (envelope-from ian@freebsd.org) Message-ID: <1452205717.1215.25.camel@freebsd.org> Subject: Re: FYI: various 11.0-CURRENT -r293227 (and older) hangs on arm (rpi2): a description of sorts From: Ian Lepore To: Mark Millard , Hans Petter Selasky Cc: freebsd-arm Date: Thu, 07 Jan 2016 15:28:37 -0700 In-Reply-To: References: <1452183170.1215.4.camel@freebsd.org> <1452196099.1215.12.camel@freebsd.org> <568EC4D8.7010106@selasky.org> <8B728C93-9C90-4821-A607-5D157F028812@dsl-only.net> <568ED810.8010309@selasky.org> <568ED92C.9070602@selasky.org> Content-Type: text/plain; charset="us-ascii" X-Mailer: Evolution 3.16.5 FreeBSD GNOME Team Port Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jan 2016 22:28:40 -0000 On Thu, 2016-01-07 at 14:04 -0800, Mark Millard wrote: > On 2016-Jan-7, at 1:31 PM, Hans Petter Selasky > wrote: > > > > On 01/07/16 22:26, Hans Petter Selasky wrote: > > > On 01/07/16 21:20, Mark Millard wrote: > > > > > > > > On 2016-Jan-7, at 12:04 PM, Hans Petter Selasky > > > selasky.org> > > > > wrote: > > > > > > > > > > On 01/07/16 20:48, Ian Lepore wrote: > > > > > > If the filesystems and swap space are on a usb drive, then > > > > > > maybe it's > > > > > > the usb subsystem that's hanging. The wait states you > > > > > > showed for those > > > > > > processes are consistant with what I've seen when all > > > > > > buffers get > > > > > > backed up in a queue on one non-responsive or slow device. > > > > > > It may be > > > > > > that there's a way to get the system deadlocked when it's > > > > > > low on > > > > > > buffers and there is memory pressure causing the swap to be > > > > > > used (I > > > > > > generally run arms systems without any swap configured). > > > > > > > > > > > > Running gstat in another window while this is going on may > > > > > > give you > > > > > > some insight into the situation. Beyond that I don't know > > > > > > what to look > > > > > > at, especially since you generally can't launch any new > > > > > > tools once the > > > > > > system gets into this kind of state. > > > > > > > > > > > > -- Ian > > > > > > > > > > Hi, > > > > > > > > > > All USB transfers towards disk devices have timeouts, so if > > > > > something > > > > > is hanging at USB level, you'll get a printout eventually. > > > > > > > > What sort of timescale after deadlock/live-lock is observed to > > > > apparently have started does one have to wait in order to > > > > conclude > > > > that the timeouts would have happened and so they do not apply > > > > to the > > > > deadlock/live-lock? > > > > > > > > > The USB kernel processes needed for doing I/O transfers are > > > > > not > > > > > pinned to RAM. Can it happen if a USB process is swapped to > > > > > disk, > > > > > that the system cannot wakeup a swapped out process to get > > > > > more swap? > > > > > > > > > > --HPS > > > > > > > > > > Hi, > > > > > > > Wow. Could I use ddb to somehow check on the "USB kernel > > > > processes" > > > > swap status when the overall context is deadlocked/live-locked? > > > > > > Are you able to run something like: > > > > > > ps auxwwH | grep usb > > > > > > > If yes, how? Otherwise something in top or some such display > > > > that I'd > > > left running over the serial console would have to present useful > > > information on the subject. Is there anything that would? > > > > > > > Are you able to SSH into the box or ping it? > > > > --HPS > > Once the live-lock condition is reached no new processes can be > created as far as I can tell: the attempt will hang any process that > attempts the creation. > > I'd need "ps auxwwH" to be internally repeating to even get that > much: I'd have to start it before the live-lock happened and it would > have to be still running when the hang occurs, no on-going process > creations involved. > > I'm not so sure that two communicating processes (ps and grep over a > pipe) would work but I can not get to even one new process so far. > > ssh sessions also hang, input and output stop for them fairly > generally. (Sometimes the context is such that ^t still works but > shows no progress in what it reports.) No new ssh connections are > possible: "Operation timed out". > > ping does respond normally: it is more of a live-lock status then a > true deadlock one overall. > > The serial console still outputs what it was already running if that > process does nothing that locks up. Changing what it is doing > generally locks it up too. > > Doing something like unplugging a usb keyboard or mouse or plugging > one in does show the expected messages via the console: it is more of > a live-lock status then a true deadlock one overall. > > I can get to ddb after the hang. But I do not know what I'd do with > it to find any useful information. > > > As noted in another message: I used gstat instead of top on the > serial console: > > > gstat shows everything zero during a hang, even L(q) column. > > (Length of queue?) > > > > I used: > > > > gstat -cod > > > > and had it running over the serial console port during the > > attempted portmaster activity. All of those symptoms sound consistant with the deadlock being IO -related. You can't ssh in because creating an ssh session for you requires reading a variety of files and it locks at that point. USB insert/remove events lead to devd events which can lead to doing IO (to load driver modules for example) so that might lead to lockups or not. Since ddb is still usable when the hangs occur, you can break into that and use its 'ps' command (no args) to find out what various threads are waiting for (wmesg column). The fact that your original output included processes in a 'wswbuf' state is what makes me think it's swap -related IO that's causing everything else to back up behind it. (Unfortunately, there are 'wswbuf0' and 'wswbuf1' waits in the kernel that really should be named "wsw0buf' and 'wsw1buf' to allow for the 6 -char truncation of the display). There are probably ddb commands to look at a variety of other interesting things (the 'show' command has a lot of options), but I don't know what to look at really, other than some guesses (show pageq might be interesting, show freepages maybe?). -- Ian