From owner-freebsd-stable@FreeBSD.ORG  Wed Dec 14 11:03:22 2005
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
X-Original-To: freebsd-stable@freebsd.org
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8D75116A41F
	for <freebsd-stable@freebsd.org>; Wed, 14 Dec 2005 11:03:22 +0000 (GMT)
	(envelope-from ob@gruft.de)
Received: from obh.snafu.de (obh.snafu.de [213.73.92.34])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 1A47743D45
	for <freebsd-stable@freebsd.org>; Wed, 14 Dec 2005 11:03:21 +0000 (GMT)
	(envelope-from ob@gruft.de)
Received: from ob by obh.snafu.de with local (Exim 4.60 (FreeBSD))
	(envelope-from <ob@gruft.de>) id 1EmUPp-0000om-3Q
	for freebsd-stable@freebsd.org; Wed, 14 Dec 2005 12:03:21 +0100
Date: Wed, 14 Dec 2005 12:03:21 +0100
From: Oliver Brandmueller <ob@e-Gitt.NET>
To: freebsd-stable@freebsd.org
Message-ID: <20051214110321.GC34429@e-Gitt.NET>
Mail-Followup-To: freebsd-stable@freebsd.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.11
Sender: Oliver Brandmueller <ob@gruft.de>
Subject: NFS locking problem with RELENG_6 client on RELENG_5 server
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Dec 2005 11:03:22 -0000

Hi,

I have a setup with an 5.4-STABLE (July, 10th 2005) NFS server and about 
10 FreeBSD clients. Most of the clients are still running on RELENG_5, 
but I recently started updating to RELENG_6. Shortly after updating the 
first client I ran into a problem with a spinning rpc.lockd on the NFS 
servers. While rpc.lockd in normal circumstances runs at about 0.1% to 
0.7% CPU it the starts using more and more CPU (about 1% more CPU per 
minute in my setup, when it's using about 20 to 25 percent I get 
problems with locking). If I restart rpc.lockd on the server it starts 
spinning again immediately. If I restart rpc.lockd on the RELENG_6 
client everything is fine again for some time. I cannot reproduce the 
behaviour by certain actions, it seems to be related to load. We have to 
weekdays, where workload is high and filesystem load on the NFS server 
is also high due to long running backup processes. I only saw the lockd 
problem on these days ("load" means about 60 MBit/s Traffic from the 
NFS clients to the server, about 30 MBit/s for the backup [which is 
writing with dump to a NFS mounted partition]).

I looked through the sources and updated my RELENG_6 clients with 
downgraded versions of:

src/sys/nfsclient/nfs_lock.c	(1.40 now instead of 1.40.2.1)
src/sys/nfsclient/nlminfo.h	(1.2  now instead of 1.2.14.1)
src/sys/sys/lockf.h		(1.18 now instead of 1.18.2.1)

since these seem to be the changes from RELENG_5 on the NFS clients that 
make a difference for the locking.

We had the problem about once or twice a week. Now everything is fine 
for about one week (the second "high load" day is today). I'm not a 
programmer and especially I can only do very limited debugging on the 
prod systems (and I did not manage to produce the load in NFS and 
locking on our test systems). This means: I cannot be sure 100% that 
this commit is the root of the problem, but I have enough evidence to 
believe so.

If someones willing and interested in debugging, I have (from the NFS 
server) a few minutes of debugging output after a restart from rpc.lockd 
- since it is long and I don't know for what to look exactly it's not 
attached, but I can grep (or even make it available) if it's of any 
help. I don't have debugging output of the NFS client rpc.lockd, though, 
because I cannot let it run with debugging on all the time and 
restarting the client fixed the problem :-/

Thanx,

	Oliver

-- 
| Oliver Brandmueller | Offenbacher Str. 1  | Germany       D-14197 Berlin |
| Fon +49-172-3130856 | Fax +49-172-3145027 | WWW:   http://the.addict.de/ |
|               Ich bin das Internet. Sowahr ich Gott helfe.               |
| Eine gewerbliche Nutzung aller enthaltenen Adressen ist nicht gestattet! |