Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 28 Jun 2010 20:06:01 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        "Rick C. Petty" <rick-freebsd2009@kiwi-computer.com>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: Why is NFSv4 so slow?
Message-ID:  <Pine.GSO.4.63.1006281950260.13834@muncher.cs.uoguelph.ca>
In-Reply-To: <20100628140054.GA52174@kay.kiwi-computer.com>
References:  <20100627221607.GA31646@kay.kiwi-computer.com> <Pine.GSO.4.63.1006271949220.3233@muncher.cs.uoguelph.ca> <20100628031401.GA45282@kay.kiwi-computer.com> <Pine.GSO.4.63.1006280017190.2680@muncher.cs.uoguelph.ca> <20100628140054.GA52174@kay.kiwi-computer.com>

next in thread | previous in thread | raw e-mail | index | archive | help


On Mon, 28 Jun 2010, Rick C. Petty wrote:

>
>> Make sure you don't have multiple entries for the same uid, such as "root"
>> and "toor" both for uid 0 in your /etc/passwd. (ie. get rid of one of
>> them, if you have both)
>
> Hmm, that's a strange requirement, since FreeBSD by default comes with
> both.  That should probably be documented in the nfsv4 man page.
>

Well, if the mapping from uid->name is not unique, getpwuid() will just
return one of them and it probably won't be the expected one. Having
both "root" and "toor" only cause weird behaviour when "root" tries to
use a mount point. I had thought it was in the man pages, but I now
see it isn't mentioned. I'll try and remember to add it.

>>
>> This error indicates that there wasn't a valid FH for the server. I
>> suspect that the mount failed. (It does a loop of Lookups from "/" in
>> the kernel during the mount and it somehow got confused part way through.)
>
> If the mount failed, why would it allow me to "ls /vol/a" and see both "b"
> and "c" directories as well as other files/directories on /vol/ ?
>
>> I don't know why these empty dirs would confuse it. I'll try a test
>> here, but I suspect the real problem was that the mount failed and
>> then happened to succeed after you deleted the empty dirs.
>
> It doesn't seem likely.  I spent an hour mounting and unmounting and each
> mount looked successful in that there were files and directories besides
> the two I was trying to decend into.
>

My theory was that, since you used "soft", one of the Lookups during
the mounting process in the kernel failed with ETIMEDOUT. It isn't
coded to handle that. There are lots of things that will break in
the NFSv4 client if "soft" or "intr" are used. (That is in the mount_nfs
man page, but right at the end, so it could get missed.)

Maybe "broken mount" would have been a better term than "failed mount".

If more recent mount attempts are without "soft", then I would expect
them to work reliably. (If you feel daring, add the empty subdirs back
and see if it fails?)

I will try a case with empty subdirs on the client, to see if there is
a problem when I do it. (It should just cover them up until umount, but
it could certainly be broken:-)

>> It still smells like some sort of transport/net interface/... issue
>> is at the bottom of this. (see response to your next post)
>
> It's possible.  I just had another NFSv4 client (with the same server) lock
> up:
>
> load: 0.00  cmd: ls 17410 [nfsv4lck] 641.87r 0.00u 0.00s 0% 1512k
>
> and:
>
> load: 0.00  cmd: make 87546 [wait] 37095.09r 0.01u 0.01s 0% 844k
>
> That make has been hung for hours, and the ls(1) was executed during that
> lockup.  I wish there was a way I could unhang these processes and unmount
> the NFS mount without panicking the kernel, but alas even this fails:
>
> # umount -f /sw
> load: 0.00  cmd: umount 17479 [nfsclumnt] 1.27r 0.00u 0.04s 0% 788k
>

The plan is to implement a "hard forced" umount (something like -ff)
which will throw away data, but get the umount done, but it hasn't been
coded yet. (For 8.2 maybe?)

> A "shutdown -p now" resulted in a panic with the speaker beeping
> constantly and no console output.
>
> It's possible the NICs are all suspect, but all of this worked fine a
> couple of days ago when I was only using NFSv3.
>
Yea, if NFSv3 worked fine with the same kernel, it seems more likely
an experimental NFS server issue, possibly related to scheduling the
busy CPUs. (If it was a NIC related problem, it is most likely related
to the driver, but if the NFSv3 case was using the same driver, that
doesn't seem likely.)

You are now using "rsize=32768,wsize=32768" aren't you?
(If you aren't yet using that, try it, since larger bursts of
traffic can definitely "tickle" nics driver problems, to borrow
Jeremy's term.)

rick



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.GSO.4.63.1006281950260.13834>