From owner-freebsd-net@FreeBSD.ORG Mon Aug 11 21:29:18 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D0684B43; Mon, 11 Aug 2014 21:29:18 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 90C62271F; Mon, 11 Aug 2014 21:29:18 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 63AE0B972; Mon, 11 Aug 2014 17:29:17 -0400 (EDT) From: John Baldwin To: freebsd-net@freebsd.org Subject: Re: NFS client READ performance on -current Date: Mon, 11 Aug 2014 16:53:42 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.4-CBSD-20140415; KDE/4.5.5; amd64; ; ) References: <2136988575.13956627.1405199640153.JavaMail.root@uoguelph.ca> <53C7B774.60304@freebsd.org> <1780417.KfjTWjeQCU@pippin.baldwin.cx> In-Reply-To: <1780417.KfjTWjeQCU@pippin.baldwin.cx> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201408111653.42283.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Mon, 11 Aug 2014 17:29:17 -0400 (EDT) Cc: pyunyh@gmail.com, "Russell L. Carter" , Rick Macklem X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Aug 2014 21:29:18 -0000 On Saturday, July 19, 2014 1:28:19 pm John Baldwin wrote: > On Thursday 17 July 2014 19:45:56 Julian Elischer wrote: > > On 7/15/14, 10:34 PM, John Baldwin wrote: > > > On Saturday, July 12, 2014 5:14:00 pm Rick Macklem wrote: > > >> Yonghyeon Pyun wrote: > > >>> On Fri, Jul 11, 2014 at 09:54:23AM -0400, John Baldwin wrote: > > >>>> On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote: > > >>>>> John Baldwin wrote: > > >>>>>> On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote: > > >>>>>>> Russell L. Carter wrote: > > >>>>>>>> On 07/02/14 19:09, Rick Macklem wrote: > > >>>>>>>>> Could you please post the dmesg stuff for the network > > >>>>>>>>> interface, > > >>>>>>>>> so I can tell what driver is being used? I'll take a look > > >>>>>>>>> at > > >>>>>>>>> it, > > >>>>>>>>> in case it needs to be changed to use m_defrag(). > > >>>>>>>> > > >>>>>>>> em0: port > > >>>>>>>> 0xd020-0xd03f > > >>>>>>>> mem > > >>>>>>>> 0xfe4a0000-0xfe4bffff,0xfe480000-0xfe49ffff irq 44 at > > >>>>>>>> device 0.0 > > >>>>>>>> on > > >>>>>>>> pci2 > > >>>>>>>> em0: Using an MSI interrupt > > >>>>>>>> em0: Ethernet address: 00:15:17:bc:29:ba > > >>>>>>>> 001.000007 [2323] netmap_attach success for em0 > > >>>>>>>> tx > > >>>>>>>> 1/1024 > > >>>>>>>> rx > > >>>>>>>> 1/1024 queues/slots > > >>>>>>>> > > >>>>>>>> This is one of those dual nic cards, so there is em1 as > > >>>>>>>> well... > > >>>>>>> > > >>>>>>> Well, I took a quick look at the driver and it does use > > >>>>>>> m_defrag(), > > >>>>>>> but > > >>>>>>> I think that the "retry:" label it does a goto after doing so > > >>>>>>> might > > >>>>>>> be in > > >>>>>>> the wrong place. > > >>>>>>> > > >>>>>>> The attached untested patch might fix this. > > >>>>>>> > > >>>>>>> Is it convenient to build a kernel with this patch applied > > >>>>>>> and then > > >>>>>>> try > > >>>>>>> it with TSO enabled? > > >>>>>>> > > >>>>>>> rick > > >>>>>>> ps: It does have the transmit segment limit set to 32. I have > > >>>>>>> no > > >>>>>>> idea if > > >>>>>>> > > >>>>>>> this is a hardware limitation. > > >>>>>> > > >>>>>> I think the retry is not in the wrong place, but the overhead > > >>>>>> of all > > >>>>>> those > > >>>>>> pullups is apparently quite severe. > > >>>>> > > >>>>> The m_defrag() call after the first failure will just barely > > >>>>> squeeze > > >>>>> the just under 64K TSO segment into 32 mbuf clusters. Then I > > >>>>> think any > > >>>>> m_pullup() done during the retry will allocate an mbuf > > >>>>> (at a glance it seems to always do this when the old mbuf is a > > >>>>> cluster) > > >>>>> and prepend that to the list. > > >>>>> --> Now the list is > 32 mbufs again and the > > >>>>> bus_dmammap_load_mbuf_sg() > > >>>>> > > >>>>> will fail again on the retry, this time fatally, I think? > > >>>>> > > >>>>> I can't see any reason to re-do all the stuff using m_pullup() > > >>>>> and Russell > > >>>>> reported that moving the "retry:" fixed his problem, from what I > > >>>>> understood. > > >>>> > > >>>> Ah, I had assumed (incorrectly) that the m_pullup()s would all be > > >>>> nops in this > > >>>> case. It seems the NIC would really like to have all those things > > >>>> in a single > > >>>> segment, but it is not required, so I agree that your patch is > > >>>> fine. > > >>> > > >>> I recall em(4) controllers have various limitation in TSO. Driver > > >>> has to update IP header to make TSO work so driver has to get a > > >>> writable mbufs. bpf(4) consumers will see IP packet length is 0 > > >>> after this change. I think tcpdump has a compile time option to > > >>> guess correct IP packet length. The firmware of controller also > > >>> should be able to access complete IP/TCP header in a single buffer. > > >>> I don't remember more details in TSO limitation but I guess you may > > >>> be able to get more details TSO limitation from publicly available > > >>> Intel data sheet. > > >> > > >> I think that the patch should handle this ok. All of the m_pullup() > > >> stuff gets done the first time. Then, if the result is more than 32 > > >> mbufs in the list, m_defrag() is called to copy the chain. This should > > >> result in all the header stuff in the first mbuf cluster and the map > > >> call is done again with this list of clusters. (Without the patch, > > >> m_pullup() would allocate another prepended mbuf and make the chain > > >> more than 32mbufs again.) > > > > > > Hmm, I am surprised by the m_pullup() behavior that it doesn't just > > > notice that the first mbuf with a cluster has the desired data already > > > and returns without doing anything. That is, I'm surprised the first > > > > > > statement in m_pullup() isn't just: > > > if (n->m_len >= len) > > > > > > return (n); > > > > I seem to remember that the standard behaviour is for the caller to do > > exactly that. > > Huh, the manpage doesn't really state that, and it does check in one case. > However, I think that means that the code in em(4) is busted and should be > checking m_len before all the calls to m_pullup(). I think this will fix > the issue the same as Rick's change but it might also avoid unnecessary > pullups in some cases when defrag isn't needed in the first place. FYI, I still think this patch is worth testing if someone is up for it. -- John Baldwin