From owner-freebsd-current@FreeBSD.ORG Tue Jun 17 19:33:12 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E0E2537B401 for ; Tue, 17 Jun 2003 19:33:11 -0700 (PDT) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0CE2C43F75 for ; Tue, 17 Jun 2003 19:33:09 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.12.9/8.12.9) with ESMTP id h5I2WxM7053350; Tue, 17 Jun 2003 19:33:03 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <200306180233.h5I2WxM7053350@gw.catspoiler.org> Date: Tue, 17 Jun 2003 19:32:59 -0700 (PDT) From: Don Lewis To: chris@shenton.org In-Reply-To: <8765n4b22w.fsf@PECTOPAH.shenton.org> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii cc: current@FreeBSD.org Subject: Re: 5.1-CURRENT hangs on disk i/o? sysctl_old_user() non-sleepable locks X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 02:33:12 -0000 On 17 Jun, Chris Shenton wrote: > Don Lewis writes: > >> I doubt it. I checked in a fix for this problem today so you should get >> the fix when you next cvsup. > > Yup, many thanks. > >> Can you break into ddb and do a ps to find out what state all the >> processes are in? > > I'm a newbie to ddb. Was able to get a ps from a hung system but > didn't know how to capture it to send to you. Any hints? If you have another machine and a null modem cable you can redirect the system console of the machine to be debugged to a serial port and run some comm software on the other machine so that you can capture all the output from ddb. Lacking that, there's the pencil and paper method that I used for far too long. > >> You might want to try adding the DEBUG_VFS_LOCKS options to your >> kernel config to see if that turns up anything. > > Oh, man, I'm getting killed here now. Rebuilt the kernel with that > option (not found in GENERIC or other examples in /usr/src/sys/i386/conf/). > > Now the system is dropping into ddb ever minute or so with complaints > like the following on the screen, and in /var/log/messages: > > Jun 17 21:06:08 PECTOPAH kernel: VOP_GETVOBJECT: 0xc584eb68 is not locked but should be > Jun 17 21:08:04 PECTOPAH last message repeated 3 times > ... > Jun 17 21:18:55 PECTOPAH kernel: VOP_GETVOBJECT: 0xc59346d8 is not locked but should be > Jun 17 21:18:59 PECTOPAH last message repeated 5 times > > Lots 'n' lots of 'em, with a few of the same hex value then another > set for a different hex value. Been there, but that was quite a while ago. I run this way all the time and hardly ever see problems these days. You must be exercising some file system code that I don't. At the ddb prompt, you can do a "tr" command to get a stack trace, which is likely to be very helpful in pointing out the offending code. If you're getting a lot of VFS lock violation reports, the underlying locking violations could be the reason that your machine deadlocks. Post some representative stack traces. These problems are generally easy to fix. >> There is also ddb command to list the locked vnodes "show >> lockedvnods". > > After I type "cont" at ddb a few times the system runs for a while > again, only to repeat. When it drops to ddb again that show command > doesn't list anything. > > I may have to remove that option from my kernel just to get to run a > bit, even tho eventually the system will hang. It's (of course) my > main box which the other systems NFS off, mail server, etc. :-( At the ddb prompt you should be able to use the write command tweak a couple of variables to modify this behavior. If you set the vfs_badlock_panic variable to zero, the kernel will no longer drop into DDB when one of these lock violations occurs. If you set the vfs_badlock_print variable to zero, the kernel will stop printing the warnings. If you are running the NFS *client* code on this machine, there is one lock assertion that is easy to trigger. The stack trace will show the nfsiod process calling nfssvc_iod(), which calls nfs_doio(), which complains about a lock not being held. If you run into that problem, just comment out the line: ASSERT_VOP_LOCKED(vp, "nfs_doio"); in nfs_doio(), in the file sys/nfsclient/nfs_bio.c. I haven't been able to figure out the correct fix for this problem, and so far I haven't encountered any problems with the problem being unfixed. > >> Are you using nullfs or unionfs which are a bit fragile? > > Nope. I'd be happy to mail you my kernel config if you want. I've > posted it to http://chris.shenton.org/PECTOPAH but if the system's > hung again, naturally it won't be available :-( > > > Thanks for your help. Any other things I might try? > > Dunno if this matters, but I'm using an DELL CERC ATA RAID card with > disks showing up as amrd* if that matters. Was flawless at > 5.0-{CURRENT,RELEASE}.