From owner-freebsd-fs@FreeBSD.ORG Mon Mar 31 02:32:56 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1E32AC1B; Mon, 31 Mar 2014 02:32:56 +0000 (UTC) Received: from mail-pa0-x236.google.com (mail-pa0-x236.google.com [IPv6:2607:f8b0:400e:c03::236]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D8F7693E; Mon, 31 Mar 2014 02:32:55 +0000 (UTC) Received: by mail-pa0-f54.google.com with SMTP id lf10so7588471pab.27 for ; Sun, 30 Mar 2014 19:32:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:date:to:cc:subject:message-id:reply-to:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=E0KhTnbURd8QuFlqldN8V4UJLFnqvrb1S78F+c6Z7mQ=; b=Tk3IrTtDGi/g14mHY/fSECh1HXScD/0PdTxd/GefRksIQfeeP69spImETymKA3HfsV UZ+p/ilENX3ru1MA7EJ9tv7XAL8QkGn043D7VeIZAWhWDAN6P+SUqZAGEyvp+ktfcVLT BY/bNdq4QxZLFi8thAZPotQd/OYz+dWTZ0PyB5jlP+SUpPsaOork9CSSn0uBzao0ybXH Xb2/CoRNaMTv07Hta8zQdd//gO0BkmpMMVZGWoikv1LfX9EjO3PDQeuotMljDpBVq4lj R7TPVaazwTurvqlS0cDJp3IQSbb7oBMMrxdxYS++dK0euNm6/+gvJnRQf2VOQVB6dmk2 HB2A== X-Received: by 10.68.58.34 with SMTP id n2mr4460385pbq.122.1396233175467; Sun, 30 Mar 2014 19:32:55 -0700 (PDT) Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249]) by mx.google.com with ESMTPSA id dn1sm39954451pbb.54.2014.03.30.19.32.52 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Sun, 30 Mar 2014 19:32:54 -0700 (PDT) Received: by pyunyh@gmail.com (sSMTP sendmail emulation); Mon, 31 Mar 2014 11:32:53 +0900 From: Yonghyeon PYUN Date: Mon, 31 Mar 2014 11:32:53 +0900 To: Rick Macklem Subject: Re: RFC: How to fix the NFS/iSCSI vs TSO problem Message-ID: <20140331023253.GC3548@michelle.cdnetworks.com> References: <20140326023334.GB2973@michelle.cdnetworks.com> <1903781266.1237680.1395880068597.JavaMail.root@uoguelph.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1903781266.1237680.1395880068597.JavaMail.root@uoguelph.ca> User-Agent: Mutt/1.4.2.3i Cc: FreeBSD Filesystems , FreeBSD Net , Alexander Motin X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: pyunyh@gmail.com List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2014 02:32:56 -0000 On Wed, Mar 26, 2014 at 08:27:48PM -0400, Rick Macklem wrote: > pyunyh@gmail.com wrote: > > On Tue, Mar 25, 2014 at 07:10:35PM -0400, Rick Macklem wrote: > > > Hi, > > > > > > First off, I hope you don't mind that I cross-posted this, > > > but I wanted to make sure both the NFS/iSCSI and networking > > > types say it. > > > If you look in this mailing list thread: > > > http://docs.FreeBSD.org/cgi/mid.cgi?1850411724.1687820.1395621539316.JavaMail.root > > > you'll see that several people have been working hard at testing > > > and > > > thanks to them, I think I now know what is going on. > > > > > > Thanks for your hard work on narrowing down that issue. I'm too > > busy for $work in these days so I couldn't find time to investigate > > the issue. > > > > > (This applies to network drivers that support TSO and are limited > > > to 32 transmit > > > segments->32 mbufs in chain.) Doing a quick search I found the > > > following > > > drivers that appear to be affected (I may have missed some): > > > jme, fxp, age, sge, msk, alc, ale, ixgbe/ix, nfe, e1000/em, re > > > > > > > The magic number 32 was chosen long time ago when I implemented TSO > > in non-Intel drivers. I tried to find optimal number to reduce the > > size kernel stack usage at that time. bus_dma(9) will coalesce > > with previous segment if possible so I thought the number 32 was > > not an issue. Not sure current bus_dma(9) also has the same code > > though. The number 32 is arbitrary one so you can increase > > it if you want. > > > Well, in the case of "ix" Jack Vogel says it is a hardware limitation. > I can't change drivers that I can't test and don't know anything about > the hardware. Maybe replacing m_collapse() with m_defrag() is an exception, > since I know what that is doing and it isn't hardware related, but I > would still prefer a review by the driver author/maintainer before making > such a change. > > If there are drivers that you know can be increased from 32->35 please do > so, since that will not only avoid the EFBIG failures but also avoid a > lot of calls to m_defrag(). > > > > Further, of these drivers, the following use m_collapse() and not > > > m_defrag() > > > to try and reduce the # of mbufs in the chain. m_collapse() is not > > > going to > > > get the 35 mbufs down to 32 mbufs, as far as I can see, so these > > > ones are > > > more badly broken: > > > jme, fxp, age, sge, alc, ale, nfe, re > > > > I guess m_defeg(9) is more optimized for non-TSO packets. You don't > > want to waste CPU cycles to copy the full frame to reduce the > > number of mbufs in the chain. For TSO packets, m_defrag(9) looks > > better but if we always have to copy a full TSO packet to make TSO > > work, driver writers have to invent better scheme rather than > > blindly relying on m_defrag(9), I guess. > > > Yes, avoiding m_defrag() calls would be nice. For this issue, increasing > the transmit segment limit from 32->35 does that, if the change can be > done easily/safely. > > Otherwise, all I can think of is my suggestion to add something like > if_hw_tsomaxseg which the driver can use to tell tcp_output() the > driver's limit for # of mbufs in the chain. > > > > > > > The long description is in the above thread, but the short version > > > is: > > > - NFS generates a chain with 35 mbufs in it for (read/readdir > > > replies and write requests) > > > made up of (tcpip header, RPC header, NFS args, 32 clusters of > > > file data) > > > - tcp_output() usually trims the data size down to tp->t_tsomax > > > (65535) and > > > then some more to make it an exact multiple of TCP transmit data > > > size. > > > - the net driver prepends an ethernet header, growing the length > > > by 14 (or > > > sometimes 18 for vlans), but in the first mbuf and not adding > > > one to the chain. > > > - m_defrag() copies this to a chain of 32 mbuf clusters (because > > > the total data > > > length is <= 64K) and it gets sent > > > > > > However, it the data length is a little less than 64K when passed > > > to tcp_output() > > > so that the length including headers is in the range > > > 65519->65535... > > > - tcp_output() doesn't reduce its size. > > > - the net driver adds an ethernet header, making the total data > > > length slightly > > > greater than 64K > > > - m_defrag() copies it to a chain of 33 mbuf clusters, which > > > fails with EFBIG > > > --> trainwrecks NFS performance, because the TSO segment is dropped > > > instead of sent. > > > > > > A tester also stated that the problem could be reproduced using > > > iSCSI. Maybe > > > Edward Napierala might know some details w.r.t. what kind of mbuf > > > chain iSCSI > > > generates? > > > > > > Also, one tester has reported that setting if_hw_tsomax in the > > > driver before > > > the ether_ifattach() call didn't make the value of tp->t_tsomax > > > smaller. > > > However, reducing IP_MAXPACKET (which is what it is set to by > > > default) did > > > reduce it. I have no idea why this happens or how to fix it, but it > > > implies > > > that setting if_hw_tsomax in the driver isn't a solution until this > > > is resolved. > > > > > > So, what to do about this? > > > First, I'd like a simple fix/workaround that can go into 9.3. > > > (which is code > > > freeze in May). The best thing I can think of is setting > > > if_hw_tsomax to a > > > smaller default value. (Line# 658 of sys/net/if.c in head.) > > > > > > Version A: > > > replace > > > ifp->if_hw_tsomax = IP_MAXPACKET; > > > with > > > ifp->if_hw_tsomax = min(32 * MCLBYTES - (ETHER_HDR_LEN + > > > ETHER_VLAN_ENCAP_LEN), > > > IP_MAXPACKET); > > > plus > > > replace m_collapse() with m_defrag() in the drivers listed above. > > > > > > This would only reduce the default from 65535->65518, so it only > > > impacts > > > the uncommon case where the output size (with tcpip header) is > > > within > > > this range. (As such, I don't think it would have a negative impact > > > for > > > drivers that handle more than 32 transmit segments.) > > > From the testers, it seems that this is sufficient to get rid of > > > the EFBIG > > > errors. (The total data length including ethernet header doesn't > > > exceed 64K, > > > so m_defrag() fits it into 32 mbuf clusters.) > > > > > > The main downside of this is that there will be a lot of m_defrag() > > > calls > > > being done and they do quite a bit of bcopy()'ng. > > > > > > Version B: > > > replace > > > ifp->if_hw_tsomax = IP_MAXPACKET; > > > with > > > ifp->if_hw_tsomax = min(29 * MCLBYTES, IP_MAXPACKET); > > > > > > This one would avoid the m_defrag() calls, but might have a > > > negative > > > impact on TSO performance for drivers that can handle 35 transmit > > > segments, > > > since the maximum TSO segment size is reduced by about 6K. (Because > > > of the > > > second size reduction to an exact multiple of TCP transmit data > > > size, the > > > exact amount varies.) > > > > > > Possible longer term fixes: > > > One longer term fix might be to add something like if_hw_tsomaxseg > > > so that > > > a driver can set a limit on the number of transmit segments (mbufs > > > in chain) > > > and tcp_output() could use that to limit the size of the TSO > > > segment, as > > > required. (I have a first stab at such a patch, but no way to test > > > it, so > > > I can't see that being done by May. Also, it would require changes > > > to a lot > > > of drivers to make it work. I've attached this patch, in case > > > anyone wants > > > to work on it?) > > > > > > Another might be to increase the size of MCLBYTES (I don't see this > > > as > > > practical for 9.3, although the actual change is simple.). I do > > > think > > > that increasing MCLBYTES might be something to consider doing in > > > the > > > future, for reasons beyond fixing this. > > > > > > So, what do others think should be done? rick > > > > > > > AFAIK all TSO capable drivers you mentioned above have no limit on > > the number of TX segments in TSO path. Not sure about Intel > > controllers though. Increasing the number of segment will consume > > lots of kernel stack in those drivers. Given that ixgbe, which > > seems to use 100, didn't show any kernal stack shortage, I think > > bumping the number of segments would be quick way to address the > > issue. > > > Well, bumping it from 32->35 is all it would take for NFS (can't comment > w.r.t. iSCSI). ixgbe uses 100 for the 82598 chip and 32 for the 82599 > (just so others aren't confused by the above comment). I understand > your point was w.r.t. using 100 without blowing the kernel stack, but > since the testers have been using "ix" with the 82599 chip which is > limited to 32 transmit segments... > > However, please increase any you know can be safely done from 32->35, rick Done in r263957.