From owner-freebsd-current@FreeBSD.ORG  Tue Jun 17 19:33:12 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id E0E2537B401
	for <current@FreeBSD.org>; Tue, 17 Jun 2003 19:33:11 -0700 (PDT)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 0CE2C43F75
	for <current@FreeBSD.org>; Tue, 17 Jun 2003 19:33:09 -0700 (PDT)
	(envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.12.9/8.12.9) with ESMTP id h5I2WxM7053350;
	Tue, 17 Jun 2003 19:33:03 -0700 (PDT)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <200306180233.h5I2WxM7053350@gw.catspoiler.org>
Date: Tue, 17 Jun 2003 19:32:59 -0700 (PDT)
From: Don Lewis <truckman@FreeBSD.org>
To: chris@shenton.org
In-Reply-To: <8765n4b22w.fsf@PECTOPAH.shenton.org>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
cc: current@FreeBSD.org
Subject: Re: 5.1-CURRENT hangs on disk i/o? sysctl_old_user() non-sleepable
 locks
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Jun 2003 02:33:12 -0000

On 17 Jun, Chris Shenton wrote:
> Don Lewis <truckman@FreeBSD.org> writes:
> 
>> I doubt it.  I checked in a fix for this problem today so you should get
>> the fix when you next cvsup.
> 
> Yup, many thanks.
> 
>> Can you break into ddb and do a ps to find out what state all the
>> processes are in?
> 
> I'm a newbie to ddb.  Was able to get a ps from a hung system but
> didn't know how to capture it to send to you.  Any hints?

If you have another machine and a null modem cable you can redirect the
system console of the machine to be debugged to a serial port and run
some comm software on the other machine so that you can capture all the
output from ddb.

Lacking that, there's the pencil and paper method that I used for far
too long.

> 
>> You might want to try adding the DEBUG_VFS_LOCKS options to your
>> kernel config to see if that turns up anything.
> 
> Oh, man, I'm getting killed here now. Rebuilt the kernel with that
> option (not found in GENERIC or other examples in /usr/src/sys/i386/conf/).
> 
> Now the system is dropping into ddb ever minute or so with complaints
> like the following on the screen, and in /var/log/messages:
> 
> Jun 17 21:06:08 PECTOPAH kernel: VOP_GETVOBJECT: 0xc584eb68 is not locked but should be
> Jun 17 21:08:04 PECTOPAH last message repeated 3 times
> ...
> Jun 17 21:18:55 PECTOPAH kernel: VOP_GETVOBJECT: 0xc59346d8 is not locked but should be
> Jun 17 21:18:59 PECTOPAH last message repeated 5 times
> 
> Lots 'n' lots of 'em, with a few of the same hex value then another
> set for a different hex value.

Been there, but that was quite a while ago.  I run this way all the time
and hardly ever see problems these days.  You must be exercising some
file system code that I don't.  At the ddb prompt, you can do a "tr"
command to get a stack trace, which is likely to be very helpful in
pointing out the offending code.

If you're getting a lot of VFS lock violation reports, the underlying
locking violations could be the reason that your machine deadlocks.

Post some representative stack traces.  These problems are generally
easy to fix.

>> There is also ddb command to list the locked vnodes "show
>> lockedvnods".
> 
> After I type "cont" at ddb a few times the system runs for a while
> again, only to repeat.  When it drops to ddb again that show command
> doesn't list anything. 
> 
> I may have to remove that option from my kernel just to get to run a
> bit, even tho eventually the system will hang.  It's (of course) my
> main box which the other systems NFS off, mail server, etc. :-(

At the ddb prompt you should be able to use the write command tweak a
couple of variables to modify this behavior.  If you set the
vfs_badlock_panic variable to zero, the kernel will no longer drop into
DDB when one of these lock violations occurs.  If you set the
vfs_badlock_print variable to zero, the kernel will stop printing the
warnings.

If you are running the NFS *client* code on this machine, there is one
lock assertion that is easy to trigger.  The stack trace will show the
nfsiod process calling nfssvc_iod(), which calls nfs_doio(), which
complains about a lock not being held.  If you run into that problem,
just comment out the line:
	 ASSERT_VOP_LOCKED(vp, "nfs_doio");
in nfs_doio(), in the file sys/nfsclient/nfs_bio.c.  I haven't been able
to figure out the correct fix for this problem, and so far I haven't
encountered any problems with the problem being unfixed.

> 
>> Are you using nullfs or unionfs which are a bit fragile?
> 
> Nope.  I'd be happy to mail you my kernel config if you want. I've
> posted it to http://chris.shenton.org/PECTOPAH but if the system's
> hung again, naturally it won't be available :-(
> 
> 
> Thanks for your help.  Any other things I might try?
> 
> Dunno if this matters, but I'm using an DELL CERC ATA RAID card with
> disks showing up as amrd* if that matters.  Was flawless at
> 5.0-{CURRENT,RELEASE}.