Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 23 Mar 2014 20:38:59 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Christopher Forgeron <csforgeron@gmail.com>
Cc:        FreeBSD Net <freebsd-net@freebsd.org>, Garrett Wollman <wollman@freebsd.org>, Jack Vogel <jfvogel@gmail.com>, Markus Gebert <markus.gebert@hostpoint.ch>
Subject:   Re: 9.2 ixgbe tx queue hang
Message-ID:  <1850411724.1687820.1395621539316.JavaMail.root@uoguelph.ca>
In-Reply-To: <CAB2_NwAEzgs1u7GkueKrhMT7iSRqZqkHObrOrXeaLC_EW7Nnwg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Christopher Forgeron wrote:
> Hi Rick, very helpful as always.
> 
> 
> On Sat, Mar 22, 2014 at 6:18 PM, Rick Macklem <rmacklem@uoguelph.ca>
> wrote:
> 
> > Christopher Forgeron wrote:
> >
> > Well, you could try making if_hw_tsomax somewhat smaller. (I can't
> > see
> > how the packet including ethernet header would be more than 64K
> > with the
> > patch, but?? For example, the ether_output() code can call
> > ng_output()
> > and I have no idea if that might grow the data size of the packet?)
> >
> 
> That's what I was thinking - I was going to drop it down to 32k,
> which is
> extreme, but I wanted to see if it cured it or not. Something would
> have to
> be very broken to be adding nearly 32k to a packet.
> 
> 
> > To be honest, the optimum for NFS would be setting if_hw_tsomax ==
> > 56K,
> > since that would avoid the overhead of the m_defrag() calls.
> > However,
> > it is suboptimal for other TCP transfers.
> >
> 
Ok, here is the critical code snippet from tcp_output():
       /*
774 	* Limit a burst to t_tsomax minus IP,
775 	* TCP and options length to keep ip->ip_len
776 	* from overflowing or exceeding the maximum
777 	* length allowed by the network interface.
778 	*/
779 	if (len > tp->t_tsomax - hdrlen) {
780 	    len = tp->t_tsomax - hdrlen;
781 	    sendalot = 1;
782 	}
783 	
784 	/*
785 	* Prevent the last segment from being
786 	* fractional unless the send sockbuf can
787 	* be emptied.
788 	*/
789 	if (sendalot && off + len < so->so_snd.sb_cc) {
790 	    len -= len % (tp->t_maxopd - optlen);
791 	    sendalot = 1;
792 	}
The first "if" at #779 limits the len to if_hw_tsomax - hdrlen.
(tp->t_tsomax == if_hw_tsomax and hdrlen == size of TCP/IP header)
The second "if" at #789 reduces the len to an exact multiple of the output
MTU if it won't empty the send queue.

Here's how I think things work:
- For a full 64K of read/write data, NFS generates an mbuf list with
  32 MCLBYTES clusters of data and two small header packets prepended
  in front of them (one for the RPC header + one for the NFS args that
  come before the data).
  Total data length is a little over 64K (something like 65600bytes).
  - When the above code processes this, it reduces the length to
    if_hw_tsomax (65535 by default). { if at #779 }
  - Second "if" at #789 reduces it further (63000 for a 9000byte MTU).
  tcp_output() prepends an mbuf with the TCP/IP header in it, resulting
  is a total data length somewhat less than 64K and passes this to the
  ixgbe.c driver.
- The ixgbe.c driver prepends an ethernet header (14 or maybe 18bytes in
  length) by calling ether_output() and then hands it (a little less than
  64K bytes of data in 35mbufs) to ixgbe_xmit().
  ixgbe_xmit() calls bus_dmamap_load_mbuf_sg() which fails, returning
  EFBIG, because the list has more than 32 mbufs in it.
  - then it calls m_defrag(), which copies the slightly less than 64K
    of data to a list of 32 mbuf clusters.
  - bus_dmamap_load_mbuf_sg() is called again and succeeds this time
    because the list is only 32 mbufs long.
   (The call to m_defrag() adds some overhead and does have the potential
    to fail if mbuf clusters are exhausted, so this works, but isn't ideal.)

The problem case happens when the size of the I/O is a little less than
the full 64K (hit EOF for read or a smaller than 64K dirty region in a
buffer cache block for write.
- Now, for example, the total data length for the mbuf chain (including
  RPC, NFS and TCP/IP headers) could be 65534 (slightly less than 64K).
The first "if" doesn't change the "len", since it is less than if_hw_tsomax.
The second "if" doesn't change the "len" if there is no additional data in
the send queue.
--> Now the ixgbe driver prepends an ethernet header, increasing the total
    data length to 65548 (a little over 64K).
   - First call to bus_dmamap_load_mbuf_sg() fails with EFBIG because the
     mbuf list has more than 32 entries.
   - calls m_defrag(), which copies the data to a list of 33 mbuf clusters.
     (> 64K requires 33 * 2K clusters)
   - Second call to bus_dmamap_load_mbuf_sg() fails again with EFBIG, because
     the list has 33 mbufs in it.
   --> Returns EFBIG and throws away the TSO segment without sending it.

For NFS, the ideal would be to not only never fail with EFBIG, but to not
have the overhead of calling m_defrag().
- One way is to use pagesize (4K) clusters, so that the mbuf list only has
  19 entries.
- Another way is to teach tcp_output() to limit the mbuf list to 32 mbufs
  as well as 65535 bytes in length.
- Yet another is to make if_hw_tsomax small enough that the mbuf list
  doesn't exceed 32 mbufs. (56K would do this for NFS, but is suboptimal
  for other traffic.)

I am hoping that the only reason you still saw a few EFBIGs with the
patch that reduced if_hw_tsomax by ETHER_HDR_LEN was that some had
the additional 4bytes for a vlan header. If that is the case, the
slightly modified patch should make all EFBIG error returns go away.

> I'm very interested in NFS performance, so this is interesting to me
> - Do
> you have the time to educate me on this? I was going to spend this
> week
> hacking out the NFS server cache, as I feel ZFS does a better job,
> and my
> cache stats are always terrible, as to be expected when I have such a
> wide
> data usage on these sans.
> 
The DRC (duplicate request cache) is not for performance, it is for
correctness. It comes from the fact that Sun RPC is "at least once"
and can retry non-idempotent operations (ones that modify the file
system) resulting in a corrupted file system on the server.

The risk is lower when using TCP vs UDP, but is still non-zero.

If you had a "perfect network fabric that never lost packets", the
hit rate of the DRC is 0 and could safely be disabled. However, each
hit implies a case where file system corruption has been avoided, so
I think most environments want it, despite the overhead.
--> The better your network environment, the lower the hit rate.
    (Which means you want to see "terrible" cache stats.;-)
--> It is always some amount of overhead for the sake of correctness
    and never improves performance (well technically it does avoid
    redoing file system ops when there is a hit, but the performance
    effect is miniscule and not relevant).

This was all fixed by NFSv4.1, which uses a machanism called Sessions
to provide "exactly once" RPC semantics. As such, the NFSv4.1 server
(still in a projects branch on svn) doesn't use the DRC.

> >
> > One other thing you could do (if you still have them) is scan the
> > logs
> > for the code with my previous printf() patch and see if there is
> > ever
> > a size > 65549 in it. If there is, then if_hw_tsomax needs to be
> > smaller
> > by at least that size - 65549. (65535 + 14 == 65549)
> >
> 
> There were some 65548's for sure. Interestingly enough, the amount
> that it
> ruptures by seems to be increasing slowly. I should possibly let it
> rupture
> and run for a long time to see if there is a steadily increasing
> pattern...
> perhaps something is accidentally incrementing the packet by say 4
> bytes in
> a heavily loaded error condition.
> 
I doubt it, since people run other network interfaces that don't have
the 32mbuf (transmit segment) limitation without difficulties, as far
as I know.

> >
> 
> > I'm not familiar enough with the mbuf/uma allocators to "confirm"
> > it,
> > but I believe the "denied" refers to cases where m_getjcl() fails
> > to get
> > a jumbo mbuf and returns NULL.
> >
> > If this were to happen in m_defrag(), it would return NULL and the
> > ix
> > driver returns ENOBUFS, so this is not the case for EFBIG errors.
> >
> > BTW, the loop that your original printf code is in, just before the
> > retry:
> goto label: That's an error loop, and it looks to me that all/most
> packets
> traverse it at some time?
> 
It does a second try at calling bus_dmamap_load_mbuf_sg() after doing
the compaction copying of the mbuf list via m_defrag() and then returns
EFBIG if the second attempt fails.

> 
> > I don't know if increasing the limits for the jumbo mbufs via
> > sysctl
> > will help. If you are using the code without Jack's patch, which
> > uses
> > 9K mbufs, then I think it can fragment the address space and result
> > in no 9K contiguous areas to allocate from. (I'm just going by what
> > Garrett and others have said about this.)
> >
> >
> I never seem to be running out of mbufs - 4k or 9k. Unless it's
> possible
> for a starvation to occur without incrementing the counters.
> Additionally,
> netstat -m is recording denied mbufs on boot, so on a 96 Gig system
> that is
> just starting up, I don't think I am.. but a large increase in the
> buffers
> is on my list of desperation things to try.
> 
> Thanks for the hint on m_getjcl().. I'll dig around and see if I can
> find
> what's happening there. I guess it's time for me to learn basic
> dtrace as
> well. :-)
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscribe@freebsd.org"
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1850411724.1687820.1395621539316.JavaMail.root>