Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 19 Mar 2010 11:05:21 -0400
From:      Steve Polyack <korvus@comcast.net>
To:        John Baldwin <jhb@freebsd.org>
Cc:        freebsd-fs@freebsd.org, User Questions <freebsd-questions@freebsd.org>, bseklecki@noc.cfi.pgh.pa.us
Subject:   Re: FreeBSD NFS client goes into infinite retry loop
Message-ID:  <4BA392B1.4050107@comcast.net>
In-Reply-To: <4BA37AE9.4060806@comcast.net>
References:  <4BA3613F.4070606@comcast.net> <201003190831.00950.jhb@freebsd.org> <4BA37AE9.4060806@comcast.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On 03/19/10 09:23, Steve Polyack wrote:
> On 03/19/10 08:31, John Baldwin wrote:
>> On Friday 19 March 2010 7:34:23 am Steve Polyack wrote:
>>> Hi, we use a FreeBSD 8-STABLE (from shortly after release) system as an
>>> NFS server to provide user home directories which get mounted across a
>>> few machines (all 6.3-RELEASE).  For the past few weeks we have been
>>> running into problems where one particular client will go into an
>>> infinite loop where it is repeatedly trying to write data which causes
>>> the NFS server to return "reply ok 40 write ERROR: Input/output error
>>> PRE: POST:".  This retry loop can cause between 20mbps and 500mbps of
>>> constant traffic on our network, depending on the size of the data
>>> associated with the failed write.
>>>
>>> We spent some time on the issue and determined that something on one of
>>> the clients is deleting a file as it is being written to by another NFS
>>> client.  We were able to enable the NFS lockmgr and use lockf(1) to fix
>>> most of these conditions, and the frequency of this problem has dropped
>>> from once a night to once a week.  However, it's still a problem and we
>>> can't necessarily force all of our users to "play nice" and use 
>>> lockf/flock.
>>>
>>> Has anyone seen this before?  No errors are being logged on the NFS
>>> server itself, but the "Server Ret-Failed" counter begins to increase
>>> rapidly whenever a client gets stuck in this infinite retry loop:
>>> Server Ret-Failed
>>>           224768961
>>>
>>> I have a feeling that using NFS in such a matter may simply be prone to
>>> such problems, but what confuses me is why the NFS client system is
>>> infinitely retrying the write operation and causing itself so much 
>>> grief.
>> Yes, your feeling is correct.  This sort of race is inherent to NFS 
>> if you do
>> not use some sort of locking protocol to resolve the race.  The infinite
>> retries sound like a client-side issue.  Have you been able to try a 
>> newer OS
>> version on a client to see if it still causes the same behavior?
>>
> I can't try a newer FBSD version on the client where we are seeing the 
> problems, but I can recreate the problem fairly easily.  Perhaps I'll 
> try it with an 8.0 client.  If I remember correctly, one of the 
> strange things is that it doesn't seem to hit "critical mass" until a 
> few hours after the operation first fails.  I may be wrong, but I'll 
> double check that when I check vs. 8.0-release.
>
> I forgot to add this in the first post, but these are all TCP NFS v3 
> mounts.
>
> Thanks for the response.

Ok, so I'm still able to trigger what appears to be the same retry loop 
with an 8.0-RELEASE nfsv3 client (going on 1.5 hours now):
$ cat nfs.sh
client#!/usr/local/bin/bash
for a in {1..15} ; do
   sleep 1;
   echo "$a$a$";
done
client$ ./nfs.sh >~/output

the on the server while the above is running:
server$ rm ~/output

What happens is that you will see 3-4 of the same write attempts happen 
per minute via tcpdump.  Our previous logs show that this is how it 
starts, and then ~4 hours later it begins to spiral out of control, 
throwing out up to 3,000 of the same failed write requests per second.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4BA392B1.4050107>