From owner-freebsd-current@FreeBSD.ORG Wed Nov 12 16:17:56 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 52AF716A4CE for ; Wed, 12 Nov 2003 16:17:56 -0800 (PST) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6243F43FBF for ; Wed, 12 Nov 2003 16:17:55 -0800 (PST) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.9p2/8.12.9) with ESMTP id hAD0G4Mg096195 for ; Wed, 12 Nov 2003 19:16:04 -0500 (EST) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)hAD0G4No096192 for ; Wed, 12 Nov 2003 19:16:04 -0500 (EST) (envelope-from robert@fledge.watson.org) Date: Wed, 12 Nov 2003 19:16:04 -0500 (EST) From: Robert Watson X-Sender: robert@fledge.watson.org To: freebsd-current@freebsd.org In-Reply-To: <200311121738.13783.x@Vex.Net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: Re: Still getting NFS client locking up X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Nov 2003 00:17:56 -0000 On Wed, 12 Nov 2003, Tim Middleton wrote: > On November 11, 2003 11:36 pm, Janet Sullivan wrote: > > So far I only have problems in a mixed -STABLE/-CURRENT environment. > > When the client & server are both -CURRENT I haven't had any problems. > > I just installed another -STABLE box to see if keeping them both -STABLE > helps. I haven't really tested the NFS yet as I didn't want to risk > locking the box up in the middle of a buildworld. If we can demonstrate the problem with both systems as -STABLE, that rules out a lot of things, and might also raise some questions about the hardware. > So i just mounted the NFS drive on the new test box and left it.... > Within an hour the NFS server box doing the build world was locked up > solid. I can't say if it was NFS mount related or not; nfsd wasn't > really doing anything. Doesn't seem like it would have been. Beginning > to wonder if it is some strange hardware problem on this box; which > coincidentally only shows up when there's an nfs mount! But that doesn't > explain why my normally rock solid desktop system tanked when being > tested as an NFS client to that STABLE box. Hmmm... One of the problems that can occur in -STABLE is a cascading failure when one file system is jammed up (i.e., an NFS mount from another system). Processes hang holding locks in NFS because the NFS session is stalled; other processes try to aquire the hold locks while holding additional locks, and before you know it a lot of very useful locks are held and can't be released due to an inability to free up locks at the cause. Many aspects of this problem are believed to be resolved in -CURRENT, but it's a touch cookies to crack without redoing VFS locking. If you have a spare system, it might be really interesting to install -STABLE on it, replicate data from your file server, point the client at that, and see if the problem still occurs there with the same load. You might also try swapping network cards: perhaps we're looking at a network device driver problem where loss of key packets, or packets over a certain size, is causing an unrecoverable failure. > Back to testing. I'm doing heavy disk I/O tests without any NFS mounts > now. If they go okay, back to the NFS mounting and testing... > > It seems to me there is something desperately wrong with NFS is mixing > -CURRENT and -STABLE NFS server/clients causes either side (in my case both > sides) to lock up solid. I mean, problems are problems... but solid lockups > with no crash messages or anything is ... nasty. Clearly there's a substantial problem, but it sounds like we're still having a lot of trouble identifying the circumstances that trigger the problem, and attempting to narrow things down. One of the problem with distributed system debugging is that it's often hard to track the problem down to a particular source when you catch it partway through a cascading failure. For example, it could well be that a server problem is triggering client symptoms, or it could be that a serious client problem might consume resources on the server such that other clients couldn't operate. Under these circumstances, it can be very difficult to track it down to a particular cause (a missing unlock on the server, for example). > > Are the folks seeing hangs getting any kind of console error messages? > > I see nothing. My server is completely locks up. Nothing responds. The > drive light (the times i've noticed) is frozen "on". On my desktop box > the mouse is dead as well. I can't help but wonder if the server isn't suffering an under-reported hardare failure. It might be interesting to see how quickly the problem vanishes when exchanging various elements. > > I don't see anything - performance just tanks to the point of being > > unusable. > > When testing with my desktop box as client, i noticed just before or > just when the NFS locked up the mouse and keyboard response would be > very erratic ... slow and jerky. This might suggest a high RPC load, deep queues in processing, or key locks held for extended periods of time. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Network Associates Laboratories