From owner-freebsd-current@FreeBSD.ORG  Wed Nov 12 16:17:56 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 52AF716A4CE
	for <freebsd-current@freebsd.org>;
	Wed, 12 Nov 2003 16:17:56 -0800 (PST)
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 6243F43FBF
	for <freebsd-current@freebsd.org>;
	Wed, 12 Nov 2003 16:17:55 -0800 (PST)
	(envelope-from robert@fledge.watson.org)
Received: from fledge.watson.org (localhost [127.0.0.1])
	by fledge.watson.org (8.12.9p2/8.12.9) with ESMTP id hAD0G4Mg096195
	for <freebsd-current@freebsd.org>;
	Wed, 12 Nov 2003 19:16:04 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Received: from localhost (robert@localhost)hAD0G4No096192
	for <freebsd-current@freebsd.org>;
	Wed, 12 Nov 2003 19:16:04 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Date: Wed, 12 Nov 2003 19:16:04 -0500 (EST)
From: Robert Watson <rwatson@freebsd.org>
X-Sender: robert@fledge.watson.org
To: freebsd-current@freebsd.org
In-Reply-To: <200311121738.13783.x@Vex.Net>
Message-ID: <Pine.NEB.3.96L.1031112190842.96006C-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: Re: Still getting NFS client locking up
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Nov 2003 00:17:56 -0000


On Wed, 12 Nov 2003, Tim Middleton wrote:

> On November 11, 2003 11:36 pm, Janet Sullivan wrote:
> > So far I only have problems in a mixed -STABLE/-CURRENT environment.
> > When the client & server are both -CURRENT I haven't had any problems.
> 
> I just installed another -STABLE box to see if keeping them both -STABLE
> helps. I haven't really tested the NFS yet as I didn't want to risk
> locking the box up in the middle of a buildworld.

If we can demonstrate the problem with both systems as -STABLE, that rules
out a lot of things, and might also raise some questions about the
hardware.

> So i just mounted the NFS drive on the new test box and left it.... 
> Within an hour the NFS server box doing the build world was locked up
> solid. I can't say if it was NFS mount related or not; nfsd wasn't
> really doing anything. Doesn't seem like it would have been. Beginning
> to wonder if it is some strange hardware problem on this box; which
> coincidentally only shows up when there's an nfs mount! But that doesn't
> explain why my normally rock solid desktop system tanked when being
> tested as an NFS client to that STABLE box. Hmmm...

One of the problems that can occur in -STABLE is a cascading failure when
one file system is jammed up (i.e., an NFS mount from another system). 
Processes hang holding locks in NFS because the NFS session is stalled;
other processes try to aquire the hold locks while holding additional
locks, and before you know it a lot of very useful locks are held and
can't be released due to an inability to free up locks at the cause.  Many
aspects of this problem are believed to be resolved in -CURRENT, but it's
a touch cookies to crack without redoing VFS locking.

If you have a spare system, it might be really interesting to install
-STABLE on it, replicate data from your file server, point the client at
that, and see if the problem still occurs there with the same load.  You
might also try swapping network cards: perhaps we're looking at a network
device driver problem where loss of key packets, or packets over a certain
size, is causing an unrecoverable failure.

> Back to testing.  I'm doing heavy disk I/O tests without any NFS mounts
> now.  If they go okay, back to the NFS mounting and testing... 
> 
> It seems to me there is something desperately wrong with NFS is mixing 
> -CURRENT and -STABLE NFS server/clients causes either side (in my case both 
> sides) to lock up solid. I mean, problems are problems... but solid lockups 
> with no crash messages or anything is ... nasty.

Clearly there's a substantial problem, but it sounds like we're still
having a lot of trouble identifying the circumstances that trigger the
problem, and attempting to narrow things down.  One of the problem with
distributed system debugging is that it's often hard to track the problem
down to a particular source when you catch it partway through a cascading
failure.  For example, it could well be that a server problem is
triggering client symptoms, or it could be that a serious client problem
might consume resources on the server such that other clients couldn't
operate.  Under these circumstances, it can be very difficult to track it
down to a particular cause (a missing unlock on the server, for example). 

> > Are the folks seeing hangs getting any kind of console error messages?
> 
> I see nothing. My server is completely locks up. Nothing responds. The
> drive light (the times i've noticed) is frozen "on". On my desktop box
> the mouse is dead as well.

I can't help but wonder if the server isn't suffering an under-reported
hardare failure.  It might be interesting to see how quickly the problem
vanishes when exchanging various elements.

> > I don't see anything - performance just tanks to the point of being
> > unusable.
> 
> When testing with my desktop box as client, i noticed just before or
> just when the NFS locked up the mouse and keyboard response would be
> very erratic ... slow and jerky.

This might suggest a high RPC load, deep queues in processing, or key
locks held for extended periods of time.

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert@fledge.watson.org      Network Associates Laboratories