Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 12 Nov 2010 07:26:23 +0100
From:      Andre Oppermann <andre@freebsd.org>
To:        Lawrence Stewart <lstewart@freebsd.org>
Cc:        freebsd-net@freebsd.org, Christopher Penney <penney@msu.edu>
Subject:   Re: NFS + FreeBSD TCP Behavior with Linux NAT
Message-ID:  <4CDCDE0F.8010501@freebsd.org>
In-Reply-To: <4CDCA679.7020401@freebsd.org>
References:  <AANLkTikmpXDsi9N36D%2BM1ZFfyNGAZ3A-asaTNm5U7PwK@mail.gmail.com> <4CDC5490.7030109@freebsd.org> <4CDCA679.7020401@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 12.11.2010 03:29, Lawrence Stewart wrote:
> On 11/12/10 07:39, Julian Elischer wrote:
>> On 11/11/10 6:36 AM, Christopher Penney wrote:
>>> Hi,
>>>
>>> I have a curious problem I'm hoping someone can help with or at least
>>> educate me on.
>>>
>>> I have several large Linux clusters and for each one we hide the compute
>>> nodes behind a head node using NAT.  Historically, this has worked
>>> very well
>>> for us and any time a NAT gateway (the head node) reboots everything
>>> recovers within a minute or two of it coming back up.  This includes NFS
>>> mounts from Linux and Solaris NFS servers, license server connections,
>>> etc.
>>>
>>> Recently, we added a FreeBSD based NFS server to our cluster resources
>>> and
>>> have had significant issues with NFS mounts hanging if the head node
>>> reboots.  We don't have this happen much, but it does occasionally
>>> happen.
>>>    I've explored this and it seems the behavior of FreeBSD differs a
>>> bit from
>>> at least Linux and Solaris with respect to TCP recovery.  I'm curious if
>>> someone can explain this or offer any workarounds.
>>>
>>> Here are some specifics from a test I ran:
>>>
>>> Before the reboot two Linux clients were mounting the FreeBSD server.
>>> They
>>> were both using port 903 locally.  On the head node clientA:903 was
>>> remapped
>>> to headnode:903 and clientB:903 was remapped to headnode:601.  There
>>> is no
>>> activity when the reboot occurs.  The head node takes a few minutes to
>>> come
>>> back up (we kept it down for several minutes).
>>>
>>> When it comes back up clientA and clientB try to reconnect to the FreeBSD
>>> NFS server.  They both use the same source port, but since the head
>>> node's
>>> conntrack table is cleared it's a race to see who gets what port and this
>>> time clientA:903 appears as headnode:601 and clientB:903 appears as
>>> headnode:903 (>>>   they essentially switch places as far as the FreeBSD
>>> server would see<<<   ).
>>>
>>> The FreeBSD NFS server, since there was no outstanding acks it was
>>> waiting
>>> on, thinks things are ok so when it gets a SYN from the two clients it
>>> only
>>> responds with an ACK.  The ACK for each that it replies with is bogus
>>> (invalid seq number) because it's using the return path the other
>>> client was
>>> using before the reboot so the client sends a RST back, but it never
>>> gets to
>>> the FreeBSD system since the head node's NAT hasn't yet seen the full
>>> handshake (that would allow return packets).  The end result is a
>>> "permanent" hang (at least until it would otherwise cleanup idle TCP
>>> connections).
>>>
>>> This is in stark contrast to the behavior of the other systems we have.
>>>    Other systems respond to the SYN used to reconnect with a SYN/ACK.
>>> They
>>> appear to implicitly tear down the return path based on getting a SYN
>>> from a
>>> seemingly already established connection.
>>>
>>> I'm assuming this is one of the grey areas where there is no specific
>>> behavior outlined in an RFC?  Is there any way to make the FreeBSD system
>>> more reliable in this situation (like making it implicitly tear down the
>>> return)?  Or is there a way to adjust the NAT setup to allow the RST to
>>> return to the FreeBSD system?  Currently, NAT is setup with simply:
>>>
>>> iptables -t nat -A POSTROUTING -s 10.1.0.0/16 -o bond0 -j SNAT --to
>>> 1.2.3.4
>>>
>>> Where 1.2.3.4 is the intranet address and 10.1.0.0 is the cluster
>>> network.
>>
>> I just added NFS to the subject because the NFS people are thise you
>> need to
>> connect with.
>
> Skimming Chris' problem description, I don't think I agree that this is
> an NFS issue and agree with Chris that it's netstack related behaviour
> as opposed to application related.
>
> Chris, I have minimal cycles at the moment and your scenario is bending
> my brain a little bit too much to give a quick response. A tcpdump
> excerpt showing such an exchange would be very useful. I'll try come
> back to it when I I have a sec. Andre, do you have a few cycles to
> digest this in more detail?

I had very few cycles since EuroBSDCon as well but this weekend my
little son has a sleep over at my mother in law and my wife is at
work.  So I'm going to reduce my FreeBSD backlog.  There are a few
things that have queued up.  I should get enough time to take care
of this one.

-- 
Andre



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4CDCDE0F.8010501>