Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Mar 2021 00:49:57 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Jason Breitman <jbreitman@tildenparkcapital.com>, Youssef GHORBAL <youssef.ghorbal@pasteur.fr>
Cc:        "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Subject:   Re: NFS Mount Hangs
Message-ID:  <YQXPR0101MB0968FB1FF0FC481CE37E9A81DD649@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <F6C37433-93B9-4E80-A963-49FC151C6FE8@tildenparkcapital.com>
References:  <C643BB9C-6B61-4DAC-8CF9-CE04EA7292D0@tildenparkcapital.com> <3750001D-3F1C-4D9A-A9D9-98BCA6CA65A4@tildenparkcapital.com> <33693DE3-7FF8-4FAB-9A75-75576B88A566@tildenparkcapital.com> <D67AF317-D238-4EC0-8C7F-22D54AD5144C@pasteur.fr> <B5E47AF2-5F24-4BD8-B228-A03246C03A6D@tildenparkcapital.com> <E652D260-BC35-462B-A53B-E728CF972F09@pasteur.fr>, <F6C37433-93B9-4E80-A963-49FC151C6FE8@tildenparkcapital.com>

next in thread | previous in thread | raw e-mail | index | archive | help
I am going to create a FreeBSD PR for this, so that this
does not get forgotten.

If anyone has a problem with me cutting/pasting their
comments in this thread into the PR, please email me soon.
(If I don't hear from you soon, I'll assume you are ok with it.)

Same goes for a post to linux-nfs@ver.kernels.org at some point.

I think the recently posted patch *might* work around the problem.
The underlying cause will likely be a mystery for some time, I think?

Thanks everyone for your comments, rick

________________________________________
From: owner-freebsd-net@freebsd.org <owner-freebsd-net@freebsd.org> on beha=
lf of Jason Breitman <jbreitman@tildenparkcapital.com>
Sent: Monday, March 22, 2021 9:24 AM
To: Youssef GHORBAL
Cc: freebsd-net@freebsd.org
Subject: Re: NFS Mount Hangs

CAUTION: This email originated from outside of the University of Guelph. Do=
 not click links or open attachments unless you recognize the sender and kn=
ow the content is safe. If in doubt, forward suspicious emails to IThelp@uo=
guelph.ca


Agreed.  I had made the changes on the FreeBSD Server side and was suggesti=
ng that a new TCP connection needed to be established between the client an=
d server for the settings to take effect.
I rebooted all of my Debian clients on Sunday to achieve that goal, establi=
shing a new NFSv4 TCP connection with the file server,  and will let the gr=
oup know if I see another hang.

Jason Breitman


On Mar 22, 2021, at 7:27 AM, Youssef GHORBAL <youssef.ghorbal@pasteur.fr> w=
rote:



> On 21 Mar 2021, at 14:41, Jason Breitman <jbreitman@tildenparkcapital.com=
> wrote:
>
> Thanks for sharing as this sounds exactly like my issue.
>
> I had implemented the change below on 3/8/2021 and have experienced the N=
FS hang after that.
> Do I need to reboot or umount / mount all of the clients and then I will =
be ok?
>
> I had not rebooted the clients, but would to get out of this situation.
> It is logical that a new TCP session over 2049 needs to be reestablished =
for the changes to take effect.
>
> net.inet.tcp.fast_finwait2_recycle=3D1
> net.inet.tcp.finwait2_timeout=3D1000

In my case, those were implemented on the server (FreeBSD side) since the B=
SD box that was closing the connection and the FIN_WAIT_2 state was on its =
side.
In your cas the FIN_WAIT_2 is on the client side. I don=92t know if these s=
ysctl are even availale on Linux.

> I can also confirm that the iptables solution that you use on the client =
to get out of the hung mount without a reboot work for me.
> #!/bin/sh
>
> progName=3D"nfsClientFix"
> delay=3D15
> nfs_ip=3DNFS.Server.IP.X
>
> nfs_fin_wait2_state() {
>   /usr/bin/netstat -an | /usr/bin/grep ${nfs_ip}:2049 | /usr/bin/grep FIN=
_WAIT2 > /dev/null 2>&1
>   return $?
> }
>
>
> nfs_fin_wait2_state
> result=3D$?
> if [ ${result} -eq 0 ] ; then
>   /usr/bin/logger -s -i -p local7.error -t ${progName} "NFS Connection is=
 in FIN_WAIT2!"
>   /usr/bin/logger -s -i -p local7.error -t ${progName} "Enabling firewall=
 to block ${nfs_ip}!"
>   /usr/sbin/iptables -A INPUT -s ${nfs_ip} -j DROP
>
>   while true
>   do
>       /usr/bin/sleep ${delay}
>       nfs_fin_wait2_state
>       result=3D$?
>       if [ ${result} -ne 0 ] ; then
>           /usr/bin/logger -s -i -p local7.notice -t ${progName} "NFS Conn=
ection is OK."
>           /usr/bin/logger -s -i -p local7.error -t ${progName} "Disabling=
 firewall to allow access to ${nfs_ip}!"
>           /usr/sbin/iptables -D INPUT -s ${nfs_ip}  -j DROP
>           break
>       fi
>   done
> fi
>
>
> Jason Breitman
>
>
> On Mar 19, 2021, at 8:40 PM, Youssef GHORBAL <youssef.ghorbal@pasteur.fr>=
 wrote:
>
> Hi Jason,
>
>> On 17 Mar 2021, at 18:17, Jason Breitman <jbreitman@tildenparkcapital.co=
m> wrote:
>>
>> Please review the details below and let me know if there is a setting th=
at I should apply to my FreeBSD NFS Server or if there is a bug fix that I =
can apply to resolve my issue.
>> I shared this information with the linux-nfs mailing list and they belie=
ve the issue is on the server side.
>>
>> Issue
>> NFSv4 mounts periodically hang on the NFS Client.
>>
>> During this time, it is possible to manually mount from another NFS Serv=
er on the NFS Client having issues.
>> Also, other NFS Clients are successfully mounting from the NFS Server in=
 question.
>> Rebooting the NFS Client appears to be the only solution.
>
> I had experienced a similar weird situation with periodically stuck Linux=
 NFS clients mounting Isilon NFS servers (Isilon is FreeBSD based but they =
seem to have there own nfsd)
> We=92ve had better luck and we did manage to have packet captures on both=
 sides during the issue. The gist of it goes like follows:
>
> - Data flows correctly between SERVER and the CLIENT
> - At some point SERVER starts decreasing it's TCP Receive Window until it=
 reachs 0
> - The client (eager to send data) can only ack data sent by SERVER.
> - When SERVER was done sending data, the client starts sending TCP Window=
 Probes hoping that the TCP Window opens again so he can flush its buffers.
> - SERVER responds with a TCP Zero Window to those probes.
> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully =
closes the TCP connection sending a FIN Packet (and still a TCP Window at 0=
)
> - CLIENT ACK that FIN.
> - SERVER goes in FIN_WAIT_2 state
> - CLIENT closes its half part part of the socket and goes in LAST_ACK sta=
te.
> - FIN is never sent by the client since there still data in its SendQ and=
 receiver TCP Window is still 0. At this stage the client starts sending TC=
P Window Probes again and again hoping that the server opens its TCP Window=
 so it can flush it's buffers and terminate its side of the socket.
> - SERVER keeps responding with a TCP Zero Window to those probes.
> =3D> The last two steps goes on and on for hours/days freezing the NFS mo=
unt bound to that TCP session.
>
> If we had a situation where CLIENT was responsible for closing the TCP Wi=
ndow (and initiating the TCP FIN first) and server wanting to send data we=
=92ll end up in the same state as you I think.
>
> We=92ve never had the root cause of why the SERVER decided to close the T=
CP Window and no more acccept data, the fix on the Isilon part was to recyc=
le more aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_rec=
ycle=3D1 & net.inet.tcp.finwait2_timeout=3D5000). Once the socket recycled =
and at the next occurence of CLIENT TCP Window probe, SERVER sends a RST, t=
riggering the teardown of the session on the client side, a new TCP handcha=
ke, etc and traffic flows again (NFS starts responding)
>
> To avoid rebooting the client (and before the aggressive FIN_WAIT_2  was =
implemented on the Isilon side) we=92ve added a check script on the client =
that detects LAST_ACK sockets on the client and through iptables rule enfor=
ces a TCP RST, Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport=
 $local_port -j REJECT --reject-with tcp-reset (the script removes this ipt=
ables rule as soon as the LAST_ACK disappears)
>
> The bottom line would be to have a packet capture during the outage (clie=
nt and/or server side), it will show you at least the shape of the TCP exch=
ange when NFS is stuck.
>
> Youssef
>
>


_______________________________________________
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQXPR0101MB0968FB1FF0FC481CE37E9A81DD649>