Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 17 Aug 2010 10:46:00 -0700
From:      Mark Morley <mark@islandnet.com>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        FreeBSD Stable <freebsd-stable@freebsd.org>, Mark Morley <mark@islandnet.com>
Subject:   Re: NFS stalling on 8.1-STABLE
Message-ID:  <9c4ecm1t.1282067160@helpdesk.islandnet.com>

next in thread | raw e-mail | index | archive | help


On Sun, 15 Aug 2010 17:11:01 -0400 (EDT) Rick Macklem <rmacklem@uoguelph.ca> wrote:
>> Hi all,
>>
>> I have five front end web servers that all mount their content from the same server via NFS.  If I stress the link on any one of the machines (eg: copy a large directory with a lot of files to/from the mounted file system) the client will pause.  That is, all processes trying to access that mount will freeze.  The log files with hundreds or thousands of nfs server not responding / is alive again messages. After 60 seconds it returns to normal, unless the load is still there in which case it continues to pause.
>>
>
>The 60sec delay suggests that the client is doing a TCP reconnect. I'd suggest that you
>look at a packet trace in wireshark (it knows how to decode NFS packets) and see if
>there are new TCP connections (SYN, SYN-ACK,...) being made. If that is what is
>happening, I suspect it is NIC driver related, but it is really hard to say.

I'll try this if/when it happens again.

>If you can try a network interface of a different type (not em) that will check to
>see if it is an em(4) issue.

Unfortunately I don't have any non-em cards around.

>Alternately, you could try turning off the TSO and checksum offload stuff for the
>em(4) and see if that helps.

Hmm, interesting.  The four machines that seem to be working (so far) have these enabled by default.  The fifth one has checksums enabled, but not TSO.  Doesn't appear to support it.

I also tried switching from TCP to UDP.  This seems to be working (so far) on four of the clients (which happen to be identical load balanced machines), but on the fifth one (which serves a different purpose) I'm getting something really weird.  Instead of locking up periodically as before, it's actually losing the mount.  For example, a 'df' doesn't include the mounted system.  If I try to access the mounted system (with 'ls' for example) I get an "Input / output error" message.  I can remount it, but only after I force a dismount.

Mark



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9c4ecm1t.1282067160>