From owner-freebsd-net@FreeBSD.ORG  Mon Jan 27 23:27:29 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 4F99E9CE
 for <freebsd-net@freebsd.org>; Mon, 27 Jan 2014 23:27:29 +0000 (UTC)
Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca
 [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 125001D18
 for <freebsd-net@freebsd.org>; Mon, 27 Jan 2014 23:27:28 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqQEABnr5lKDaFve/2dsb2JhbABag0RWgn25EE+BMXSCJQEBAQMBAQEBICsgCwUWGAICDRkCIwYBCSYOAgUEARwEh1ADCQgNqXWXJg2FVheBKYtOgTQQAgEbNAeCb4FJBIlIjAxngx6LK4VBg0seMYE9
X-IronPort-AV: E=Sophos;i="4.95,732,1384318800"; d="scan'208";a="90909892"
Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.222])
 by esa-annu.net.uoguelph.ca with ESMTP; 27 Jan 2014 18:27:21 -0500
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 2F446B40EF;
 Mon, 27 Jan 2014 18:27:19 -0500 (EST)
Date: Mon, 27 Jan 2014 18:27:19 -0500 (EST)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: pyunyh@gmail.com
Message-ID: <1168237133.17228249.1390865239175.JavaMail.root@uoguelph.ca>
In-Reply-To: <20140127055047.GA1368@michelle.cdnetworks.com>
Subject: Re: Terrible NFS performance under 9.2-RELEASE?
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.202]
X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790)
Cc: Daniel Braniss <danny@cs.huji.ac.il>, freebsd-net@freebsd.org,
 Adam McDougall <mcdouga9@egr.msu.edu>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Jan 2014 23:27:29 -0000

pyunyh@gmail.com wrote:
> On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote:
> > Adam McDougall wrote:
> > > Also try rsize=32768,wsize=32768 in your mount options, made a
> > > huge
> > > difference for me.  I've noticed slow file transfers on NFS in 9
> > > and
> > > finally did some searching a couple months ago, someone suggested
> > > it
> > > and
> > > they were on to something.
> > > 
> > I have a "hunch" that might explain why 64K NFS reads/writes
> > perform
> > poorly for some network environments.
> > A 64K NFS read reply/write request consists of a list of 34 mbufs
> > when
> > passed to TCP via sosend() and a total data length of around
> > 65680bytes.
> > Looking at a couple of drivers (virtio and ixgbe), they seem to
> > expect
> > no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. I
> > think
> > (I don't have anything that does TSO to confirm this) that NFS will
> > pass
> > a list that is longer (34 plus a TCP/IP header).
> > At a glance, it appears that the drivers call m_defrag() or
> > m_collapse()
> > when the mbuf list won't fit in their scatter table (32 or 33
> > elements)
> > and if this fails, just silently drop the data without sending it.
> > If I'm right, there would considerable overhead from
> > m_defrag()/m_collapse()
> > and near disaster if they fail to fix the problem and the data is
> > silently
> > dropped instead of xmited.
> > 
> 
> I think the actual number of DMA segments allocated for the mbuf
> chain is determined by bus_dma(9).  bus_dma(9) will coalesce
> current segment with previous segment if possible.
> 
Ok, I'll have to take a look, but I thought that an array of sized
by "num_segs" is passed in as an argument. (And num_segs is set to
either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).)
It looked to me that the ixgbe driver called itself ix, so it isn't
obvious to me which we are talking about. (I know that Daniel Braniss
had an ix0 and ix1, which were fixed for NFS by disabling TSO.)

I'll admit I mostly looked at virtio's network driver, since that
was the one being used by J David.

Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have been
cropping up for quite a while, and I am just trying to find out why.
(I have no hardware/software that exhibits the problem, so I can
only look at the sources and ask others to try testing stuff.)

> I'm not sure whether you're referring to ixgbe(4) or ix(4) but I
> see the total length of all segment size of ix(4) is 65535 so
> it has no room for ethernet/VLAN header of the mbuf chain.  The
> driver should be fixed to transmit a 64KB datagram.
Well, if_hw_tsomax is set to 65535 by the generic code (the driver
doesn't set it) and the code in tcp_output() seems to subtract the
size of an tcp/ip header from that before passing data to the driver,
so I think the mbuf chain passed to the driver will fit in one
ip datagram. (I'd assume all sorts of stuff would break for TSO
enabled drivers if that wasn't the case?)

> I think the use of m_defrag(9) in TSO is suboptimal. All TSO
> capable controllers are able to handle multiple TX buffers so it
> should have used m_collapse(9) rather than copying entire chain
> with m_defrag(9).
> 
I haven't looked at these closely yet (plan on doing so to-day), but
even m_collapse() looked like it copied data between mbufs and that
is certainly suboptimal, imho. I don't see why a driver can't split
the mbuf list, if there are too many entries for the scatter/gather
and do it in two iterations (much like tcp_output() does already,
since the data length exceeds 65535 - tcp/ip header size).

However, at this point, I just want to find out if the long chain
of mbufs is why TSO is problematic for these drivers, since I'll
admit I'm getting tired of telling people to disable TSO (and I
suspect some don't believe me and never try it).

> > Anyhow, I have attached a patch that makes NFS use MJUMPAGESIZE
> > clusters,
> > so the mbuf count drops from 34 to 18.
> > 
> 
> Could we make it conditional on size?
> 
Not sure what you mean? If you mean "the size of the read/write",
that would be possible for NFSv3, but less so for NFSv4. (The read/write
is just one Op. in the compound for NFSv4 and there is no way to
predict how much more data is going to be generated by subsequent Ops.)

If by "size" you mean amount of memory in the machine then, yes, it
certainly could be conditional on that. (I plan to try and look at
the allocator to-day as well, but if others know of disadvantages with
using MJUMPAGESIZE instead of MCLBYTES, please speak up.)

Garrett Wollman already alluded to the MCLBYTES case being pre-allocated,
but I'll admit I have no idea what the implications of that are at this
time.

> > If anyone has a TSO scatter/gather enabled net interface and can
> > test this
> > patch on it with NFS I/O (default of 64K rsize/wsize) when TSO is
> > enabled
> > and see what effect it has, that would be appreciated.
> > 
> > Btw, thanks go to Garrett Wollman for suggesting the change to
> > MJUMPAGESIZE
> > clusters.
> > 
> > rick
> > ps: If the attachment doesn't make it through and you want the
> > patch, just
> >     email me and I'll send you a copy.
> > 
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscribe@freebsd.org"
>