Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 28 Jul 2015 19:39:20 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Ahmed Kamal <email.ahmedkamal@googlemail.com>
Cc:        Graham Allan <allan@physics.umn.edu>,  Ahmed Kamal via freebsd-fs <freebsd-fs@freebsd.org>
Subject:   Re: Linux NFSv4 clients are getting (bad sequence-id error!)
Message-ID:  <1089316279.4709692.1438126760802.JavaMail.zimbra@uoguelph.ca>
In-Reply-To: <CANzjMX5Q4TNLBxrAm6R2F6oUdfgRD8dX1LRZiniJA4M4HTN_=w@mail.gmail.com>
References:  <684628776.2772174.1435793776748.JavaMail.zimbra@uoguelph.ca> <184170291.10949389.1437161519387.JavaMail.zimbra@uoguelph.ca> <CANzjMX4NmxBErtEu=e5yEGJ6gAJBF4_ar_aPdNDO2-tUcePqTQ@mail.gmail.com> <55B12EB7.6030607@physics.umn.edu> <1935759160.2320694.1437688383362.JavaMail.zimbra@uoguelph.ca> <CANzjMX48F1gAVwqq64q=yALfTBNEc7iMbKAK1zi6aUfoF3WpOw@mail.gmail.com> <576106597.2326662.1437688749018.JavaMail.zimbra@uoguelph.ca> <CANzjMX5Q4TNLBxrAm6R2F6oUdfgRD8dX1LRZiniJA4M4HTN_=w@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
------=_Part_4709690_1367398941.1438126760800
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Ahmed Kamal wrote:
> Hi again Rick,
> 
> Seems that I'm still being unlucky with nfs :/ I caught one of the newly
> installed RHEL6 boxes having high CPU usage, and bombarding the BSD NFS box
> with 10Mbps traffic .. I caught a tcpdump as you mentioned .. You can
> download it here:
> 
> https://dl.dropboxusercontent.com/u/51939288/nfs41-high-client-cpu.pcap.bz2
> 
Ok, the packet trace suggests that the NFSv4 server is broken (it is replying
with NFS4ERR_STALE_CLIENTID for a recently generated ClientID).
Now, I can't be sure, but the only explanation I can come up with is...
- For some arches (I only have i386, so I wouldn't have seen this during testing),
  time_t is 64bits (uint64_t).
  --> If time_seconds somehow doesn't fit in the low order 32bits, then the code
      would be busted for these arches because nfsrvboottime is set to time_seconds
      when the server is started and then there are comparisons like:
      if (nfsrvboottime != clientid.lval[0])
           return (NFSERR_STALECLIENTID);
       /* where clientid.lval[0] is a uint32_t */
Anyhow, if this is what is happening, the attached simple patch should fix it.
(I don't know how time_seconds would exceed 4billion, but the clock code is
 pretty convoluted, so I can't say if it can possibly happen?)

rick
ps: Hmm, on i386 time_seconds ends up at 1438126486, so maybe it can exceed
    4*1024*1024*1024 - 1 on amd64?

> I didn't restart the client yet .. so if you catch me in the next few hours
> and want me to run any diagnostics, let me know. Thanks a lot all for
> helping
> 
> On Thu, Jul 23, 2015 at 11:59 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote:
> 
> > Ahmed Kamal wrote:
> > > Can you please let me know the ultimate packet trace command I'd need to
> > > run in case of any nfs4 troubles .. I guess this should be comprehensive
> > > even at the expense of a larger output size (which we can trim later)..
> > > Thanks a lot for the help!
> > >
> > tcpdump -s 0 -w <file>.pcap host <client-host-name>
> > (<file> refers to a file name you choose and <client-host-name> refers to
> >  the host name of a client generating traffic.)
> > --> But you won't be able to allow this to run for long during the storm
> > or the
> >     file will be huge.
> >
> > Then you look at <file>.pcap in wireshark, which knows NFS.
> >
> > rick
> >
> > > On Thu, Jul 23, 2015 at 11:53 PM, Rick Macklem <rmacklem@uoguelph.ca>
> > wrote:
> > >
> > > > Graham Allan wrote:
> > > > > For our part, the user whose code triggered the pathological
> > behaviour
> > > > > on SL5 reran it on SL6 without incident - I still see lots of
> > > > > sequence-id errors in the logs, but nothing bad happened.
> > > > >
> > > > > I'd still like to ask them to rerun again on SL5 to see if the
> > "accept
> > > > > skipped seqid" patch had any effect, though I think we expect not.
> > Maybe
> > > > > it would be nice if I could get set up to capture rolling tcpdumps of
> > > > > the nfs traffic before they run that though...
> > > > >
> > > > > Graham
> > > > >
> > > > > On 7/20/2015 10:26 PM, Ahmed Kamal wrote:
> > > > > > Hi folks,
> > > > > >
> > > > > > I've upgraded a test client to rhel6 today, and I'll keep an eye
> > on it
> > > > > > to see what happens.
> > > > > >
> > > > > > During the process, I made the (I guess mistake) of zfs send |
> > recv to
> > > > a
> > > > > > locally attached usb disk for backup purposes .. long story short,
> > > > > > sharenfs property on the received filesystem was causing some
> > > > nfs/mountd
> > > > > > errors in logs .. I wasn't too happy with what I got .. I
> > destroyed the
> > > > > > backup datasets and the whole pool eventually .. and then rebooted
> > the
> > > > > > whole nas box .. After reboot my logs are still flooded with
> > > > > >
> > > > > > Jul 21 05:12:36 nas kernel: nfsrv_cache_session: no session
> > > > > > Jul 21 05:13:07 nas last message repeated 7536 times
> > > > > > Jul 21 05:15:08 nas last message repeated 29664 times
> > > > > >
> > > > > > Not sure what that means .. or how it can be stopped .. Anyway,
> > will
> > > > > > keep you posted on progress.
> > > > >
> > > > Oh, I didn't see the part about "reboot" before. Unfortunately, it
> > sounds
> > > > like the
> > > > client isn't recovering after the session is lost. When the server
> > > > reboots, the
> > > > client(s) will get NFS4ERR_BAD_SESSION errors back because the server
> > > > reboot has
> > > > deleted all sessions. The NFS4ERR_BAD_SESSION should trigger state
> > > > recovery on the client.
> > > > (It doesn't sound like the clients went into recovery, starting with a
> > > > Create_session
> > > >  operation, but without a packet trace, I can't be sure?)
> > > >
> > > > rick
> > > >
> > > > >
> > > > > --
> > > > >
> > -------------------------------------------------------------------------
> > > > > Graham Allan - gta@umn.edu - allan@physics.umn.edu
> > > > > School of Physics and Astronomy - University of Minnesota
> > > > >
> > -------------------------------------------------------------------------
> > > > >
> > > > >
> > > >
> > >
> >
> 

------=_Part_4709690_1367398941.1438126760800
Content-Type: text/x-patch; name=64bitboottime.patch
Content-Disposition: attachment; filename=64bitboottime.patch
Content-Transfer-Encoding: base64

LS0tIGZzL25mc3NlcnZlci9uZnNfbmZzZHN0YXRlLmMuc2F2CTIwMTUtMDctMjggMTg6NTQ6MDYu
NTYxNDU0MDAwIC0wNDAwDQorKysgZnMvbmZzc2VydmVyL25mc19uZnNkc3RhdGUuYwkyMDE1LTA3
LTI4IDE5OjAwOjIwLjM1MTA4OTAwMCAtMDQwMA0KQEAgLTQ4Nyw3ICs0ODcsNyBAQCBuZnNydl9n
ZXRjbGllbnQobmZzcXVhZF90IGNsaWVudGlkLCBpbnQgDQogCWlmIChjbHBwKQ0KIAkJKmNscHAg
PSBOVUxMOw0KIAlpZiAoKG5kID09IE5VTEwgfHwgKG5kLT5uZF9mbGFnICYgTkRfTkZTVjQxKSA9
PSAwIHx8DQotCSAgICBvcGZsYWdzICE9IENMT1BTX1JFTkVXKSAmJiBuZnNydmJvb3R0aW1lICE9
IGNsaWVudGlkLmx2YWxbMF0pIHsNCisJICAgIG9wZmxhZ3MgIT0gQ0xPUFNfUkVORVcpICYmICh1
aW50MzJfdCluZnNydmJvb3R0aW1lICE9IGNsaWVudGlkLmx2YWxbMF0pIHsNCiAJCWVycm9yID0g
TkZTRVJSX1NUQUxFQ0xJRU5USUQ7DQogCQlnb3RvIG91dDsNCiAJfQ0KQEAgLTY4Myw3ICs2ODMs
NyBAQCBuZnNydl9kZXN0cm95Y2xpZW50KG5mc3F1YWRfdCBjbGllbnRpZCwgDQogCXN0cnVjdCBu
ZnNjbGllbnRoYXNoaGVhZCAqaHA7DQogCWludCBlcnJvciA9IDAsIGksIGlnb3Rsb2NrOw0KIA0K
LQlpZiAobmZzcnZib290dGltZSAhPSBjbGllbnRpZC5sdmFsWzBdKSB7DQorCWlmICgodWludDMy
X3QpbmZzcnZib290dGltZSAhPSBjbGllbnRpZC5sdmFsWzBdKSB7DQogCQllcnJvciA9IE5GU0VS
Ul9TVEFMRUNMSUVOVElEOw0KIAkJZ290byBvdXQ7DQogCX0NCkBAIC0zOTk2LDExICszOTk2LDEx
IEBAIG5mc3J2X2NoZWNrcmVzdGFydChuZnNxdWFkX3QgY2xpZW50aWQsIHUNCiAJICovDQogCWlm
IChmbGFncyAmDQogCSAgICAoTkZTTENLX09QRU4gfCBORlNMQ0tfVEVTVCB8IE5GU0xDS19SRUxF
QVNFIHwgTkZTTENLX0RFTEVHUFVSR0UpKSB7DQotCQlpZiAoY2xpZW50aWQubHZhbFswXSAhPSBu
ZnNydmJvb3R0aW1lKSB7DQorCQlpZiAoY2xpZW50aWQubHZhbFswXSAhPSAodWludDMyX3QpbmZz
cnZib290dGltZSkgew0KIAkJCXJldCA9IE5GU0VSUl9TVEFMRUNMSUVOVElEOw0KIAkJCWdvdG8g
b3V0Ow0KIAkJfQ0KLQl9IGVsc2UgaWYgKHN0YXRlaWRwLT5vdGhlclswXSAhPSBuZnNydmJvb3R0
aW1lICYmDQorCX0gZWxzZSBpZiAoc3RhdGVpZHAtPm90aGVyWzBdICE9ICh1aW50MzJfdCluZnNy
dmJvb3R0aW1lICYmDQogCQlzcGVjaWFsaWQgPT0gMCkgew0KIAkJcmV0ID0gTkZTRVJSX1NUQUxF
U1RBVEVJRDsNCiAJCWdvdG8gb3V0Ow0K
------=_Part_4709690_1367398941.1438126760800--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1089316279.4709692.1438126760802.JavaMail.zimbra>