From owner-freebsd-net@FreeBSD.ORG Mon Jan 27 23:27:29 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4F99E9CE for ; Mon, 27 Jan 2014 23:27:29 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 125001D18 for ; Mon, 27 Jan 2014 23:27:28 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEABnr5lKDaFve/2dsb2JhbABag0RWgn25EE+BMXSCJQEBAQMBAQEBICsgCwUWGAICDRkCIwYBCSYOAgUEARwEh1ADCQgNqXWXJg2FVheBKYtOgTQQAgEbNAeCb4FJBIlIjAxngx6LK4VBg0seMYE9 X-IronPort-AV: E=Sophos;i="4.95,732,1384318800"; d="scan'208";a="90909892" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 27 Jan 2014 18:27:21 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 2F446B40EF; Mon, 27 Jan 2014 18:27:19 -0500 (EST) Date: Mon, 27 Jan 2014 18:27:19 -0500 (EST) From: Rick Macklem To: pyunyh@gmail.com Message-ID: <1168237133.17228249.1390865239175.JavaMail.root@uoguelph.ca> In-Reply-To: <20140127055047.GA1368@michelle.cdnetworks.com> Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: Daniel Braniss , freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Jan 2014 23:27:29 -0000 pyunyh@gmail.com wrote: > On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote: > > Adam McDougall wrote: > > > Also try rsize=32768,wsize=32768 in your mount options, made a > > > huge > > > difference for me. I've noticed slow file transfers on NFS in 9 > > > and > > > finally did some searching a couple months ago, someone suggested > > > it > > > and > > > they were on to something. > > > > > I have a "hunch" that might explain why 64K NFS reads/writes > > perform > > poorly for some network environments. > > A 64K NFS read reply/write request consists of a list of 34 mbufs > > when > > passed to TCP via sosend() and a total data length of around > > 65680bytes. > > Looking at a couple of drivers (virtio and ixgbe), they seem to > > expect > > no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. I > > think > > (I don't have anything that does TSO to confirm this) that NFS will > > pass > > a list that is longer (34 plus a TCP/IP header). > > At a glance, it appears that the drivers call m_defrag() or > > m_collapse() > > when the mbuf list won't fit in their scatter table (32 or 33 > > elements) > > and if this fails, just silently drop the data without sending it. > > If I'm right, there would considerable overhead from > > m_defrag()/m_collapse() > > and near disaster if they fail to fix the problem and the data is > > silently > > dropped instead of xmited. > > > > I think the actual number of DMA segments allocated for the mbuf > chain is determined by bus_dma(9). bus_dma(9) will coalesce > current segment with previous segment if possible. > Ok, I'll have to take a look, but I thought that an array of sized by "num_segs" is passed in as an argument. (And num_segs is set to either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).) It looked to me that the ixgbe driver called itself ix, so it isn't obvious to me which we are talking about. (I know that Daniel Braniss had an ix0 and ix1, which were fixed for NFS by disabling TSO.) I'll admit I mostly looked at virtio's network driver, since that was the one being used by J David. Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have been cropping up for quite a while, and I am just trying to find out why. (I have no hardware/software that exhibits the problem, so I can only look at the sources and ask others to try testing stuff.) > I'm not sure whether you're referring to ixgbe(4) or ix(4) but I > see the total length of all segment size of ix(4) is 65535 so > it has no room for ethernet/VLAN header of the mbuf chain. The > driver should be fixed to transmit a 64KB datagram. Well, if_hw_tsomax is set to 65535 by the generic code (the driver doesn't set it) and the code in tcp_output() seems to subtract the size of an tcp/ip header from that before passing data to the driver, so I think the mbuf chain passed to the driver will fit in one ip datagram. (I'd assume all sorts of stuff would break for TSO enabled drivers if that wasn't the case?) > I think the use of m_defrag(9) in TSO is suboptimal. All TSO > capable controllers are able to handle multiple TX buffers so it > should have used m_collapse(9) rather than copying entire chain > with m_defrag(9). > I haven't looked at these closely yet (plan on doing so to-day), but even m_collapse() looked like it copied data between mbufs and that is certainly suboptimal, imho. I don't see why a driver can't split the mbuf list, if there are too many entries for the scatter/gather and do it in two iterations (much like tcp_output() does already, since the data length exceeds 65535 - tcp/ip header size). However, at this point, I just want to find out if the long chain of mbufs is why TSO is problematic for these drivers, since I'll admit I'm getting tired of telling people to disable TSO (and I suspect some don't believe me and never try it). > > Anyhow, I have attached a patch that makes NFS use MJUMPAGESIZE > > clusters, > > so the mbuf count drops from 34 to 18. > > > > Could we make it conditional on size? > Not sure what you mean? If you mean "the size of the read/write", that would be possible for NFSv3, but less so for NFSv4. (The read/write is just one Op. in the compound for NFSv4 and there is no way to predict how much more data is going to be generated by subsequent Ops.) If by "size" you mean amount of memory in the machine then, yes, it certainly could be conditional on that. (I plan to try and look at the allocator to-day as well, but if others know of disadvantages with using MJUMPAGESIZE instead of MCLBYTES, please speak up.) Garrett Wollman already alluded to the MCLBYTES case being pre-allocated, but I'll admit I have no idea what the implications of that are at this time. > > If anyone has a TSO scatter/gather enabled net interface and can > > test this > > patch on it with NFS I/O (default of 64K rsize/wsize) when TSO is > > enabled > > and see what effect it has, that would be appreciated. > > > > Btw, thanks go to Garrett Wollman for suggesting the change to > > MJUMPAGESIZE > > clusters. > > > > rick > > ps: If the attachment doesn't make it through and you want the > > patch, just > > email me and I'll send you a copy. > > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" >