From owner-freebsd-fs@FreeBSD.ORG  Mon Mar 31 02:32:56 2014
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 1E32AC1B;
 Mon, 31 Mar 2014 02:32:56 +0000 (UTC)
Received: from mail-pa0-x236.google.com (mail-pa0-x236.google.com
 [IPv6:2607:f8b0:400e:c03::236])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id D8F7693E;
 Mon, 31 Mar 2014 02:32:55 +0000 (UTC)
Received: by mail-pa0-f54.google.com with SMTP id lf10so7588471pab.27
 for <multiple recipients>; Sun, 30 Mar 2014 19:32:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=from:date:to:cc:subject:message-id:reply-to:references:mime-version
 :content-type:content-disposition:in-reply-to:user-agent;
 bh=E0KhTnbURd8QuFlqldN8V4UJLFnqvrb1S78F+c6Z7mQ=;
 b=Tk3IrTtDGi/g14mHY/fSECh1HXScD/0PdTxd/GefRksIQfeeP69spImETymKA3HfsV
 UZ+p/ilENX3ru1MA7EJ9tv7XAL8QkGn043D7VeIZAWhWDAN6P+SUqZAGEyvp+ktfcVLT
 BY/bNdq4QxZLFi8thAZPotQd/OYz+dWTZ0PyB5jlP+SUpPsaOork9CSSn0uBzao0ybXH
 Xb2/CoRNaMTv07Hta8zQdd//gO0BkmpMMVZGWoikv1LfX9EjO3PDQeuotMljDpBVq4lj
 R7TPVaazwTurvqlS0cDJp3IQSbb7oBMMrxdxYS++dK0euNm6/+gvJnRQf2VOQVB6dmk2
 HB2A==
X-Received: by 10.68.58.34 with SMTP id n2mr4460385pbq.122.1396233175467;
 Sun, 30 Mar 2014 19:32:55 -0700 (PDT)
Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249])
 by mx.google.com with ESMTPSA id dn1sm39954451pbb.54.2014.03.30.19.32.52
 for <multiple recipients>
 (version=TLSv1 cipher=RC4-SHA bits=128/128);
 Sun, 30 Mar 2014 19:32:54 -0700 (PDT)
Received: by pyunyh@gmail.com (sSMTP sendmail emulation);
 Mon, 31 Mar 2014 11:32:53 +0900
From: Yonghyeon PYUN <pyunyh@gmail.com>
Date: Mon, 31 Mar 2014 11:32:53 +0900
To: Rick Macklem <rmacklem@uoguelph.ca>
Subject: Re: RFC: How to fix the NFS/iSCSI vs TSO problem
Message-ID: <20140331023253.GC3548@michelle.cdnetworks.com>
References: <20140326023334.GB2973@michelle.cdnetworks.com>
 <1903781266.1237680.1395880068597.JavaMail.root@uoguelph.ca>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1903781266.1237680.1395880068597.JavaMail.root@uoguelph.ca>
User-Agent: Mutt/1.4.2.3i
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>,
 FreeBSD Net <freebsd-net@freebsd.org>, Alexander Motin <mav@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
Reply-To: pyunyh@gmail.com
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2014 02:32:56 -0000

On Wed, Mar 26, 2014 at 08:27:48PM -0400, Rick Macklem wrote:
> pyunyh@gmail.com wrote:
> > On Tue, Mar 25, 2014 at 07:10:35PM -0400, Rick Macklem wrote:
> > > Hi,
> > > 
> > > First off, I hope you don't mind that I cross-posted this,
> > > but I wanted to make sure both the NFS/iSCSI and networking
> > > types say it.
> > > If you look in this mailing list thread:
> > >   http://docs.FreeBSD.org/cgi/mid.cgi?1850411724.1687820.1395621539316.JavaMail.root
> > > you'll see that several people have been working hard at testing
> > > and
> > > thanks to them, I think I now know what is going on.
> > 
> > 
> > Thanks for your hard work on narrowing down that issue.  I'm too
> > busy for $work in these days so I couldn't find time to investigate
> > the issue.
> > 
> > > (This applies to network drivers that support TSO and are limited
> > > to 32 transmit
> > >  segments->32 mbufs in chain.) Doing a quick search I found the
> > >  following
> > > drivers that appear to be affected (I may have missed some):
> > >  jme, fxp, age, sge, msk, alc, ale, ixgbe/ix, nfe, e1000/em, re
> > > 
> > 
> > The magic number 32 was chosen long time ago when I implemented TSO
> > in non-Intel drivers.  I tried to find optimal number to reduce the
> > size kernel stack usage at that time.  bus_dma(9) will coalesce
> > with previous segment if possible so I thought the number 32 was
> > not an issue.  Not sure current bus_dma(9) also has the same code
> > though.  The number 32 is arbitrary one so you can increase
> > it if you want.
> > 
> Well, in the case of "ix" Jack Vogel says it is a hardware limitation.
> I can't change drivers that I can't test and don't know anything about
> the hardware. Maybe replacing m_collapse() with m_defrag() is an exception,
> since I know what that is doing and it isn't hardware related, but I
> would still prefer a review by the driver author/maintainer before making
> such a change.
> 
> If there are drivers that you know can be increased from 32->35 please do
> so, since that will not only avoid the EFBIG failures but also avoid a
> lot of calls to m_defrag().
> 
> > > Further, of these drivers, the following use m_collapse() and not
> > > m_defrag()
> > > to try and reduce the # of mbufs in the chain. m_collapse() is not
> > > going to
> > > get the 35 mbufs down to 32 mbufs, as far as I can see, so these
> > > ones are
> > > more badly broken:
> > >  jme, fxp, age, sge, alc, ale, nfe, re
> > 
> > I guess m_defeg(9) is more optimized for non-TSO packets. You don't
> > want to waste CPU cycles to copy the full frame to reduce the
> > number of mbufs in the chain.  For TSO packets, m_defrag(9) looks
> > better but if we always have to copy a full TSO packet to make TSO
> > work, driver writers have to invent better scheme rather than
> > blindly relying on m_defrag(9), I guess.
> > 
> Yes, avoiding m_defrag() calls would be nice. For this issue, increasing
> the transmit segment limit from 32->35 does that, if the change can be
> done easily/safely.
> 
> Otherwise, all I can think of is my suggestion to add something like
> if_hw_tsomaxseg which the driver can use to tell tcp_output() the
> driver's limit for # of mbufs in the chain.
> 
> > > 
> > > The long description is in the above thread, but the short version
> > > is:
> > > - NFS generates a chain with 35 mbufs in it for (read/readdir
> > > replies and write requests)
> > >   made up of (tcpip header, RPC header, NFS args, 32 clusters of
> > >   file data)
> > > - tcp_output() usually trims the data size down to tp->t_tsomax
> > > (65535) and
> > >   then some more to make it an exact multiple of TCP transmit data
> > >   size.
> > >   - the net driver prepends an ethernet header, growing the length
> > >   by 14 (or
> > >     sometimes 18 for vlans), but in the first mbuf and not adding
> > >     one to the chain.
> > >   - m_defrag() copies this to a chain of 32 mbuf clusters (because
> > >   the total data
> > >     length is <= 64K) and it gets sent
> > > 
> > > However, it the data length is a little less than 64K when passed
> > > to tcp_output()
> > > so that the length including headers is in the range
> > > 65519->65535...
> > > - tcp_output() doesn't reduce its size.
> > >   - the net driver adds an ethernet header, making the total data
> > >   length slightly
> > >     greater than 64K
> > >   - m_defrag() copies it to a chain of 33 mbuf clusters, which
> > >   fails with EFBIG
> > > --> trainwrecks NFS performance, because the TSO segment is dropped
> > > instead of sent.
> > > 
> > > A tester also stated that the problem could be reproduced using
> > > iSCSI. Maybe
> > > Edward Napierala might know some details w.r.t. what kind of mbuf
> > > chain iSCSI
> > > generates?
> > > 
> > > Also, one tester has reported that setting if_hw_tsomax in the
> > > driver before
> > > the ether_ifattach() call didn't make the value of tp->t_tsomax
> > > smaller.
> > > However, reducing IP_MAXPACKET (which is what it is set to by
> > > default) did
> > > reduce it. I have no idea why this happens or how to fix it, but it
> > > implies
> > > that setting if_hw_tsomax in the driver isn't a solution until this
> > > is resolved.
> > > 
> > > So, what to do about this?
> > > First, I'd like a simple fix/workaround that can go into 9.3.
> > > (which is code
> > > freeze in May). The best thing I can think of is setting
> > > if_hw_tsomax to a
> > > smaller default value. (Line# 658 of sys/net/if.c in head.)
> > > 
> > > Version A:
> > > replace
> > >   ifp->if_hw_tsomax = IP_MAXPACKET;
> > > with
> > >   ifp->if_hw_tsomax = min(32 * MCLBYTES - (ETHER_HDR_LEN +
> > >   ETHER_VLAN_ENCAP_LEN),
> > >       IP_MAXPACKET);
> > > plus
> > >   replace m_collapse() with m_defrag() in the drivers listed above.
> > > 
> > > This would only reduce the default from 65535->65518, so it only
> > > impacts
> > > the uncommon case where the output size (with tcpip header) is
> > > within
> > > this range. (As such, I don't think it would have a negative impact
> > > for
> > > drivers that handle more than 32 transmit segments.)
> > > From the testers, it seems that this is sufficient to get rid of
> > > the EFBIG
> > > errors. (The total data length including ethernet header doesn't
> > > exceed 64K,
> > > so m_defrag() fits it into 32 mbuf clusters.)
> > > 
> > > The main downside of this is that there will be a lot of m_defrag()
> > > calls
> > > being done and they do quite a bit of bcopy()'ng.
> > > 
> > > Version B:
> > > replace
> > >   ifp->if_hw_tsomax = IP_MAXPACKET;
> > > with
> > >   ifp->if_hw_tsomax = min(29 * MCLBYTES, IP_MAXPACKET);
> > > 
> > > This one would avoid the m_defrag() calls, but might have a
> > > negative
> > > impact on TSO performance for drivers that can handle 35 transmit
> > > segments,
> > > since the maximum TSO segment size is reduced by about 6K. (Because
> > > of the
> > > second size reduction to an exact multiple of TCP transmit data
> > > size, the
> > > exact amount varies.)
> > > 
> > > Possible longer term fixes:
> > > One longer term fix might be to add something like if_hw_tsomaxseg
> > > so that
> > > a driver can set a limit on the number of transmit segments (mbufs
> > > in chain)
> > > and tcp_output() could use that to limit the size of the TSO
> > > segment, as
> > > required. (I have a first stab at such a patch, but no way to test
> > > it, so
> > > I can't see that being done by May. Also, it would require changes
> > > to a lot
> > > of drivers to make it work. I've attached this patch, in case
> > > anyone wants
> > > to work on it?)
> > > 
> > > Another might be to increase the size of MCLBYTES (I don't see this
> > > as
> > > practical for 9.3, although the actual change is simple.). I do
> > > think
> > > that increasing MCLBYTES might be something to consider doing in
> > > the
> > > future, for reasons beyond fixing this.
> > > 
> > > So, what do others think should be done? rick
> > > 
> > 
> > AFAIK all TSO capable drivers you mentioned above have no limit on
> > the number of TX segments in TSO path.  Not sure about Intel
> > controllers though.  Increasing the number of segment will consume
> > lots of kernel stack in those drivers. Given that ixgbe, which
> > seems to use 100, didn't show any kernal stack shortage, I think
> > bumping the number of segments would be quick way to address the
> > issue.
> > 
> Well, bumping it from 32->35 is all it would take for NFS (can't comment
> w.r.t. iSCSI). ixgbe uses 100 for the 82598 chip and 32 for the 82599
> (just so others aren't confused by the above comment). I understand
> your point was w.r.t. using 100 without blowing the kernel stack, but
> since the testers have been using "ix" with the 82599 chip which is
> limited to 32 transmit segments...
> 
> However, please increase any you know can be safely done from 32->35, rick

Done in r263957.