From owner-freebsd-stable@FreeBSD.ORG Wed Dec 14 11:03:22 2005 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8D75116A41F for ; Wed, 14 Dec 2005 11:03:22 +0000 (GMT) (envelope-from ob@gruft.de) Received: from obh.snafu.de (obh.snafu.de [213.73.92.34]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1A47743D45 for ; Wed, 14 Dec 2005 11:03:21 +0000 (GMT) (envelope-from ob@gruft.de) Received: from ob by obh.snafu.de with local (Exim 4.60 (FreeBSD)) (envelope-from ) id 1EmUPp-0000om-3Q for freebsd-stable@freebsd.org; Wed, 14 Dec 2005 12:03:21 +0100 Date: Wed, 14 Dec 2005 12:03:21 +0100 From: Oliver Brandmueller To: freebsd-stable@freebsd.org Message-ID: <20051214110321.GC34429@e-Gitt.NET> Mail-Followup-To: freebsd-stable@freebsd.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.11 Sender: Oliver Brandmueller Subject: NFS locking problem with RELENG_6 client on RELENG_5 server X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Dec 2005 11:03:22 -0000 Hi, I have a setup with an 5.4-STABLE (July, 10th 2005) NFS server and about 10 FreeBSD clients. Most of the clients are still running on RELENG_5, but I recently started updating to RELENG_6. Shortly after updating the first client I ran into a problem with a spinning rpc.lockd on the NFS servers. While rpc.lockd in normal circumstances runs at about 0.1% to 0.7% CPU it the starts using more and more CPU (about 1% more CPU per minute in my setup, when it's using about 20 to 25 percent I get problems with locking). If I restart rpc.lockd on the server it starts spinning again immediately. If I restart rpc.lockd on the RELENG_6 client everything is fine again for some time. I cannot reproduce the behaviour by certain actions, it seems to be related to load. We have to weekdays, where workload is high and filesystem load on the NFS server is also high due to long running backup processes. I only saw the lockd problem on these days ("load" means about 60 MBit/s Traffic from the NFS clients to the server, about 30 MBit/s for the backup [which is writing with dump to a NFS mounted partition]). I looked through the sources and updated my RELENG_6 clients with downgraded versions of: src/sys/nfsclient/nfs_lock.c (1.40 now instead of 1.40.2.1) src/sys/nfsclient/nlminfo.h (1.2 now instead of 1.2.14.1) src/sys/sys/lockf.h (1.18 now instead of 1.18.2.1) since these seem to be the changes from RELENG_5 on the NFS clients that make a difference for the locking. We had the problem about once or twice a week. Now everything is fine for about one week (the second "high load" day is today). I'm not a programmer and especially I can only do very limited debugging on the prod systems (and I did not manage to produce the load in NFS and locking on our test systems). This means: I cannot be sure 100% that this commit is the root of the problem, but I have enough evidence to believe so. If someones willing and interested in debugging, I have (from the NFS server) a few minutes of debugging output after a restart from rpc.lockd - since it is long and I don't know for what to look exactly it's not attached, but I can grep (or even make it available) if it's of any help. I don't have debugging output of the NFS client rpc.lockd, though, because I cannot let it run with debugging on all the time and restarting the client fixed the problem :-/ Thanx, Oliver -- | Oliver Brandmueller | Offenbacher Str. 1 | Germany D-14197 Berlin | | Fon +49-172-3130856 | Fax +49-172-3145027 | WWW: http://the.addict.de/ | | Ich bin das Internet. Sowahr ich Gott helfe. | | Eine gewerbliche Nutzung aller enthaltenen Adressen ist nicht gestattet! |