Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 03 Jul 2006 15:40:01 -0700
From:      Michael Collette <Michael.Collette@TestEquity.com>
To:        User Freebsd <freebsd@hub.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: NFS Locking Issue
Message-ID:  <44A99CC1.7070501@TestEquity.com>
In-Reply-To: <20060702162942.D1103@ganymede.hub.org>
References:  <20060629230309.GA12773@lpthe.jussieu.fr>	<20060630041733.GA4941@zibbi.meraka.csir.co.za>	<cone.1151802806.162227.42680.1000@zoraida.natserv.net> <20060702162942.D1103@ganymede.hub.org>

next in thread | previous in thread | raw e-mail | index | archive | help
User Freebsd wrote:
> On Sat, 1 Jul 2006, Francisco Reyes wrote:
> 
>> John Hay writes:
>>
>>> I only started to see the lockd problems when upgrading the server side
>>> to FreeBSD 6.x and later. I had various FreeBSD clients, between 4.x
>>> and 7-current and the lockd problem only showed up when upgrading the
>>> server from 5.x to 6.x.
>>
>> It confirms the same we are experiencing.. constant freezing/locking 
>> issues.
>> I guess no more 6.X for us.. for the foreseable future..
> 
> Since there are several of us experiencing what looks to be the same 
> sort of deadlock issue, I beseech you not to give up

Honestly trying not to.  To tell ya the truth, I've been giving a real 
hard look at Ubuntu for my serving needs.  This NFS thing has got me 
seriously questioning FreeBSD right at the moment.

>... right now, all 
> we've been able to get to the developers is virtually useless 
> information (vmstat and such shows the problem, but it doesn't allow 
> developers to identify the problem) ...
> 
> Is this a problem that you can easily recreate, even on a non-production 
> machine?

Oh yeah.  I've got a couple of ways I'm able to get this to fail.

Method #1:
---------------------------------------------------------------------
Let's start with the simplest.  The scenario here involves 2 machines, 
mach01 and mach02.  Both are running 6-STABLE, and both are running 
rpcbind, rpc.statd, and rpc.lockd.  mach01 has exported /documents and 
mach02 is mounting that export under /mnt.  Simple enough?

The /documents directory has multiple subdirectories and files of 
various sizes.  The actual amount of data doesn't really matter to 
produce a failure.  All you need to do at this point is to try to copy 
files from that mount point to somewhere else on the hard drive.

cp -Rp /mnt/* /tmp/documents/

You may, or not, see that a couple of subdirectories were created, but 
no files actually moved over.  The cp command is now locked up, and no 
traffic moves.  This usually takes a second or two to show up as a 
problem.  I can repeat this with multiple 6-STABLE boxes.

Turn off rpc.lockd on either the server or client before the cp command, 
and things work.

Method #2:
---------------------------------------------------------------------
Booting to a diskless work station.  The server (mach01) has exported 
/usr, /usr/local, /usr/X11R6 and enough other stuff to get a diskless 
workstation up and running.  Not going to get into all the details here 
other than to say that I have a fully functioning setup like this on 5.4 
boxes now.

I've knocked the boot up of the diskless client (mach02) down to console 
only.  Once at the console I startx with a regular user, taking me in to 
twm.  From there I try to launch a KDE application, which in my test 
case is kwrite.  The same situation is true with launching a GTK app, 
such as Gimp.

X and twm start up.  I've got all the rest of the system reasonably 
functional.  When I try to run kwrite, none of the KDE subsystems start 
up.  kwrite just sits there in a lockd state.  Same is true of Gimp.

If I shutdown rpc.lockd on either machine I'm able to bring up a full 
KDE desktop, with all applications able to run.

Other Testing:
---------------------------------------------------------------------
At one point we had in our test network a 6.1 NFS server providing files 
to 5.4 diskless clients without any problems.  We first got to noticing 
the bulk of the glitches when I moved the diskless setup to use a 6.1 
kernel.

As I said, I've been looking at Linux alternatives.  Especially after 
reading about Michel Talon's experiences with Fedora.  I initially tried 
CentOS, but wasn't able to get NFS working properly on that thing.  I 
had an Ubuntu CD handy, so I installed it on a test box.  Wow, does that 
NFS server boogie!

Using Ubuntu as the server I connected a FreeBSD 5.4 and 6-stable box as 
clients on a 100Mb/s network.  The time trial used a dummy 100Meg file 
transfered from the server to the client.  We measured 90Mb/s transfer, 
which was FAR faster than I had ever been able to get 2 FreeBSD boxes to 
perform doing similar tests.

I then used Ubuntu to connect to a 5.4 server we have in production.  I 
don't recall the exact stats, but it was close to 10x slower.  No 
lockups here though.

After the 4th of July I intend to test Ubuntu as a client to a FreeBSD 
6-STABLE server on a gigabit lan to run similar time trials.  I'm 
looking to confirm what I can only suspect at this point, which is that 
the NFS server on FreeBSD is mucked up, but the client is okay.

As time allows I hope to run similar tests between two Ubuntu boxes, 
then run it all again with Fedora.  Seriously debating whether to move 
some or all of our infrastructure to Linux after all this.  A 3-4 month 
old known bug like this gives me a great deal of concern about FreeBSD. 
  That, and Ubuntu's NFS server speed just about knocked me over!

>  In my case, I have one machine fully configured for debugging, 
> but, of course, since re-configuring it, it hasn't exhibited the problem 
> ... if most of us get our machines configured properly to give useful 
> information to the developers to debug this, the faster it will get 
> fixed ...
> 
> My experience with most of the developers is that if you can get into 
> DDB and give them 'internal traces' of the code, bugs tend to get fixed 
> very quickly ... vmstat/ps give "external views", more summaries then 
> anything ... its the details "under the hood" that they need ... its not 
> much different then your auto-mechanic ... try telling him there is a 
> 'knocking under the hood, please tell me how to fix it, but you can't 
> have my car', and he'll brush you off ... give him 30 minutes under the 
> hood, and not only will he have identified it, but he'll probably fix it 
> too ...

Marc, the car is starting but won't move at all.  I don't know if this 
is the transmission, the steering wheel, or the radio.  I am feeling 
pretty certain that this car should never have left the lot in this 
condition though.

Again, these are problems that have been around for a while...
http://www.freebsd.org/cgi/query-pr.cgi?pr=84953
http://www.freebsd.org/cgi/query-pr.cgi?pr=80389

Later on,
-- 
Michael Collette
IT Manager
TestEquity Inc
Michael.Collette@TestEquity.com



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?44A99CC1.7070501>