From owner-freebsd-net@FreeBSD.ORG  Mon Aug 11 21:29:18 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id D0684B43;
 Mon, 11 Aug 2014 21:29:18 +0000 (UTC)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 90C62271F;
 Mon, 11 Aug 2014 21:29:18 +0000 (UTC)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id 63AE0B972;
 Mon, 11 Aug 2014 17:29:17 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-net@freebsd.org
Subject: Re: NFS client READ performance on -current
Date: Mon, 11 Aug 2014 16:53:42 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.4-CBSD-20140415; KDE/4.5.5; amd64; ; )
References: <2136988575.13956627.1405199640153.JavaMail.root@uoguelph.ca>
 <53C7B774.60304@freebsd.org> <1780417.KfjTWjeQCU@pippin.baldwin.cx>
In-Reply-To: <1780417.KfjTWjeQCU@pippin.baldwin.cx>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201408111653.42283.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Mon, 11 Aug 2014 17:29:17 -0400 (EDT)
Cc: pyunyh@gmail.com, "Russell L. Carter" <rcarter@pinyon.org>,
 Rick Macklem <rmacklem@uoguelph.ca>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Aug 2014 21:29:18 -0000

On Saturday, July 19, 2014 1:28:19 pm John Baldwin wrote:
> On Thursday 17 July 2014 19:45:56 Julian Elischer wrote:
> > On 7/15/14, 10:34 PM, John Baldwin wrote:
> > > On Saturday, July 12, 2014 5:14:00 pm Rick Macklem wrote:
> > >> Yonghyeon Pyun wrote:
> > >>> On Fri, Jul 11, 2014 at 09:54:23AM -0400, John Baldwin wrote:
> > >>>> On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote:
> > >>>>> John Baldwin wrote:
> > >>>>>> On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote:
> > >>>>>>> Russell L. Carter wrote:
> > >>>>>>>> On 07/02/14 19:09, Rick Macklem wrote:
> > >>>>>>>>> Could you please post the dmesg stuff for the network
> > >>>>>>>>> interface,
> > >>>>>>>>> so I can tell what driver is being used? I'll take a look
> > >>>>>>>>> at
> > >>>>>>>>> it,
> > >>>>>>>>> in case it needs to be changed to use m_defrag().
> > >>>>>>>> 
> > >>>>>>>> em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port
> > >>>>>>>> 0xd020-0xd03f
> > >>>>>>>> mem
> > >>>>>>>> 0xfe4a0000-0xfe4bffff,0xfe480000-0xfe49ffff irq 44 at
> > >>>>>>>> device 0.0
> > >>>>>>>> on
> > >>>>>>>> pci2
> > >>>>>>>> em0: Using an MSI interrupt
> > >>>>>>>> em0: Ethernet address: 00:15:17:bc:29:ba
> > >>>>>>>> 001.000007 [2323] netmap_attach             success for em0
> > >>>>>>>> tx
> > >>>>>>>> 1/1024
> > >>>>>>>> rx
> > >>>>>>>> 1/1024 queues/slots
> > >>>>>>>> 
> > >>>>>>>> This is one of those dual nic cards, so there is em1 as
> > >>>>>>>> well...
> > >>>>>>> 
> > >>>>>>> Well, I took a quick look at the driver and it does use
> > >>>>>>> m_defrag(),
> > >>>>>>> but
> > >>>>>>> I think that the "retry:" label it does a goto after doing so
> > >>>>>>> might
> > >>>>>>> be in
> > >>>>>>> the wrong place.
> > >>>>>>> 
> > >>>>>>> The attached untested patch might fix this.
> > >>>>>>> 
> > >>>>>>> Is it convenient to build a kernel with this patch applied
> > >>>>>>> and then
> > >>>>>>> try
> > >>>>>>> it with TSO enabled?
> > >>>>>>> 
> > >>>>>>> rick
> > >>>>>>> ps: It does have the transmit segment limit set to 32. I have
> > >>>>>>> no
> > >>>>>>> idea if
> > >>>>>>> 
> > >>>>>>>      this is a hardware limitation.
> > >>>>>> 
> > >>>>>> I think the retry is not in the wrong place, but the overhead
> > >>>>>> of all
> > >>>>>> those
> > >>>>>> pullups is apparently quite severe.
> > >>>>> 
> > >>>>> The m_defrag() call after the first failure will just barely
> > >>>>> squeeze
> > >>>>> the just under 64K TSO segment into 32 mbuf clusters. Then I
> > >>>>> think any
> > >>>>> m_pullup() done during the retry will allocate an mbuf
> > >>>>> (at a glance it seems to always do this when the old mbuf is a
> > >>>>> cluster)
> > >>>>> and prepend that to the list.
> > >>>>> --> Now the list is > 32 mbufs again and the
> > >>>>> bus_dmammap_load_mbuf_sg()
> > >>>>> 
> > >>>>>      will fail again on the retry, this time fatally, I think?
> > >>>>> 
> > >>>>> I can't see any reason to re-do all the stuff using m_pullup()
> > >>>>> and Russell
> > >>>>> reported that moving the "retry:" fixed his problem, from what I
> > >>>>> understood.
> > >>>> 
> > >>>> Ah, I had assumed (incorrectly) that the m_pullup()s would all be
> > >>>> nops in this
> > >>>> case.  It seems the NIC would really like to have all those things
> > >>>> in a single
> > >>>> segment, but it is not required, so I agree that your patch is
> > >>>> fine.
> > >>> 
> > >>> I recall em(4) controllers have various limitation in TSO. Driver
> > >>> has to update IP header to make TSO work so driver has to get a
> > >>> writable mbufs.  bpf(4) consumers will see IP packet length is 0
> > >>> after this change.  I think tcpdump has a compile time option to
> > >>> guess correct IP packet length.  The firmware of controller also
> > >>> should be able to access complete IP/TCP header in a single buffer.
> > >>> I don't remember more details in TSO limitation but I guess you may
> > >>> be able to get more details TSO limitation from publicly available
> > >>> Intel data sheet.
> > >> 
> > >> I think that the patch should handle this ok. All of the m_pullup()
> > >> stuff gets done the first time. Then, if the result is more than 32
> > >> mbufs in the list, m_defrag() is called to copy the chain. This should
> > >> result in all the header stuff in the first mbuf cluster and the map
> > >> call is done again with this list of clusters. (Without the patch,
> > >> m_pullup() would allocate another prepended mbuf and make the chain
> > >> more than 32mbufs again.)
> > > 
> > > Hmm, I am surprised by the m_pullup() behavior that it doesn't just
> > > notice that the first mbuf with a cluster has the desired data already
> > > and returns without doing anything.  That is, I'm surprised the first
> > > 
> > > statement in m_pullup() isn't just:
> > > 	if (n->m_len >= len)
> > > 	
> > > 		return (n);
> > 
> > I seem to remember that the standard behaviour is for the caller to do
> > exactly that.
> 
> Huh, the manpage doesn't really state that, and it does check in one case.
> However, I think that means that the code in em(4) is busted and should be
> checking m_len before all the calls to m_pullup().  I think this will fix
> the issue the same as Rick's change but it might also avoid unnecessary
> pullups in some cases when defrag isn't needed in the first place.

FYI, I still think this patch is worth testing if someone is up for it.

-- 
John Baldwin