From owner-freebsd-net@FreeBSD.ORG Thu Aug 14 18:28:49 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 60129CF8 for ; Thu, 14 Aug 2014 18:28:49 +0000 (UTC) Received: from quine.pinyon.org (quine.pinyon.org [65.101.5.249]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 31F172B88 for ; Thu, 14 Aug 2014 18:28:49 +0000 (UTC) Received: by quine.pinyon.org (Postfix, from userid 122) id 99B671603D6; Thu, 14 Aug 2014 11:28:47 -0700 (MST) X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on quine.pinyon.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham autolearn_force=no version=3.4.0 Received: from feyerabend.n1.pinyon.org (feyerabend.n1.pinyon.org [10.0.10.6]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by quine.pinyon.org (Postfix) with ESMTPSA id 9EF831602F1 for ; Thu, 14 Aug 2014 11:28:44 -0700 (MST) Message-ID: <53ECFFDC.3000406@pinyon.org> Date: Thu, 14 Aug 2014 11:28:44 -0700 From: "Russell L. Carter" User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:31.0) Gecko/20100101 Thunderbird/31.0 MIME-Version: 1.0 To: freebsd-net@freebsd.org Subject: Re: NFS client READ performance on -current References: <2136988575.13956627.1405199640153.JavaMail.root@uoguelph.ca> <53C7B774.60304@freebsd.org> <1780417.KfjTWjeQCU@pippin.baldwin.cx> <201408111653.42283.jhb@freebsd.org> In-Reply-To: <201408111653.42283.jhb@freebsd.org> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Aug 2014 18:28:49 -0000 I measured some transfer rates, and have appended them. On 08/11/14 13:53, John Baldwin wrote: > On Saturday, July 19, 2014 1:28:19 pm John Baldwin wrote: >> On Thursday 17 July 2014 19:45:56 Julian Elischer wrote: >>> On 7/15/14, 10:34 PM, John Baldwin wrote: >>>> On Saturday, July 12, 2014 5:14:00 pm Rick Macklem wrote: >>>>> Yonghyeon Pyun wrote: >>>>>> On Fri, Jul 11, 2014 at 09:54:23AM -0400, John Baldwin wrote: >>>>>>> On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote: >>>>>>>> John Baldwin wrote: >>>>>>>>> On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote: >>>>>>>>>> Russell L. Carter wrote: >>>>>>>>>>> On 07/02/14 19:09, Rick Macklem wrote: >>>>>>>>>>>> Could you please post the dmesg stuff for the network >>>>>>>>>>>> interface, >>>>>>>>>>>> so I can tell what driver is being used? I'll take a look >>>>>>>>>>>> at >>>>>>>>>>>> it, >>>>>>>>>>>> in case it needs to be changed to use m_defrag(). >>>>>>>>>>> >>>>>>>>>>> em0: port >>>>>>>>>>> 0xd020-0xd03f >>>>>>>>>>> mem >>>>>>>>>>> 0xfe4a0000-0xfe4bffff,0xfe480000-0xfe49ffff irq 44 at >>>>>>>>>>> device 0.0 >>>>>>>>>>> on >>>>>>>>>>> pci2 >>>>>>>>>>> em0: Using an MSI interrupt >>>>>>>>>>> em0: Ethernet address: 00:15:17:bc:29:ba >>>>>>>>>>> 001.000007 [2323] netmap_attach success for em0 >>>>>>>>>>> tx >>>>>>>>>>> 1/1024 >>>>>>>>>>> rx >>>>>>>>>>> 1/1024 queues/slots >>>>>>>>>>> >>>>>>>>>>> This is one of those dual nic cards, so there is em1 as >>>>>>>>>>> well... >>>>>>>>>> >>>>>>>>>> Well, I took a quick look at the driver and it does use >>>>>>>>>> m_defrag(), >>>>>>>>>> but >>>>>>>>>> I think that the "retry:" label it does a goto after doing so >>>>>>>>>> might >>>>>>>>>> be in >>>>>>>>>> the wrong place. >>>>>>>>>> >>>>>>>>>> The attached untested patch might fix this. >>>>>>>>>> >>>>>>>>>> Is it convenient to build a kernel with this patch applied >>>>>>>>>> and then >>>>>>>>>> try >>>>>>>>>> it with TSO enabled? >>>>>>>>>> >>>>>>>>>> rick >>>>>>>>>> ps: It does have the transmit segment limit set to 32. I have >>>>>>>>>> no >>>>>>>>>> idea if >>>>>>>>>> >>>>>>>>>> this is a hardware limitation. >>>>>>>>> >>>>>>>>> I think the retry is not in the wrong place, but the overhead >>>>>>>>> of all >>>>>>>>> those >>>>>>>>> pullups is apparently quite severe. >>>>>>>> >>>>>>>> The m_defrag() call after the first failure will just barely >>>>>>>> squeeze >>>>>>>> the just under 64K TSO segment into 32 mbuf clusters. Then I >>>>>>>> think any >>>>>>>> m_pullup() done during the retry will allocate an mbuf >>>>>>>> (at a glance it seems to always do this when the old mbuf is a >>>>>>>> cluster) >>>>>>>> and prepend that to the list. >>>>>>>> --> Now the list is > 32 mbufs again and the >>>>>>>> bus_dmammap_load_mbuf_sg() >>>>>>>> >>>>>>>> will fail again on the retry, this time fatally, I think? >>>>>>>> >>>>>>>> I can't see any reason to re-do all the stuff using m_pullup() >>>>>>>> and Russell >>>>>>>> reported that moving the "retry:" fixed his problem, from what I >>>>>>>> understood. >>>>>>> >>>>>>> Ah, I had assumed (incorrectly) that the m_pullup()s would all be >>>>>>> nops in this >>>>>>> case. It seems the NIC would really like to have all those things >>>>>>> in a single >>>>>>> segment, but it is not required, so I agree that your patch is >>>>>>> fine. >>>>>> >>>>>> I recall em(4) controllers have various limitation in TSO. Driver >>>>>> has to update IP header to make TSO work so driver has to get a >>>>>> writable mbufs. bpf(4) consumers will see IP packet length is 0 >>>>>> after this change. I think tcpdump has a compile time option to >>>>>> guess correct IP packet length. The firmware of controller also >>>>>> should be able to access complete IP/TCP header in a single buffer. >>>>>> I don't remember more details in TSO limitation but I guess you may >>>>>> be able to get more details TSO limitation from publicly available >>>>>> Intel data sheet. >>>>> >>>>> I think that the patch should handle this ok. All of the m_pullup() >>>>> stuff gets done the first time. Then, if the result is more than 32 >>>>> mbufs in the list, m_defrag() is called to copy the chain. This should >>>>> result in all the header stuff in the first mbuf cluster and the map >>>>> call is done again with this list of clusters. (Without the patch, >>>>> m_pullup() would allocate another prepended mbuf and make the chain >>>>> more than 32mbufs again.) >>>> >>>> Hmm, I am surprised by the m_pullup() behavior that it doesn't just >>>> notice that the first mbuf with a cluster has the desired data already >>>> and returns without doing anything. That is, I'm surprised the first >>>> >>>> statement in m_pullup() isn't just: >>>> if (n->m_len >= len) >>>> >>>> return (n); >>> >>> I seem to remember that the standard behaviour is for the caller to do >>> exactly that. >> >> Huh, the manpage doesn't really state that, and it does check in one case. >> However, I think that means that the code in em(4) is busted and should be >> checking m_len before all the calls to m_pullup(). I think this will fix >> the issue the same as Rick's change but it might also avoid unnecessary >> pullups in some cases when defrag isn't needed in the first place. > > FYI, I still think this patch is worth testing if someone is up for it. > I realize that it would be better to run e.g. netperf. However I originally noticed the problem by running NFS read tests, and since that's trivial to do with rsync I'm doing that again. The source file is sitting on a 6 drive zfs raidz2 and it's writing to a fast ssd. Same hardware with linux is a trifle faster, so probably the test setup is revealing enough. I like to run rsync -avP so that I can watch the behavior over time. The transfer rates have been fairly steady, +-5MB/s. Eyeballing it, I'm not seeing much in the way of cache effects. r269700 2014-08-07 10G transfer 65MB/s nfs read. after patch and install if_em.ko* to both sides and reboot: 10G transfer 62MB/s nfs read. immediately read different 5G file: 67MB/s then reread original 10G file: 236MB/s ?? faster than wire? that's a cache effect, but still, I find it surprising. then read a different 12G file: 62.9MB/s So, previously I was seeing ~65MB/s from the original patch, which is what I see today. JHB's patch seems to be slightly slower, repeatable. But given the variance of the transfer rates, it's not really much different. HTH, Russell ps: a quick question about quickly building modules: Suppose I have a fully populated /usr/obj from buildworld and buildkernel (and have installed it), what's the most efficient method for rebuilding a single module and getting it installed into /boot/kernel?