From owner-freebsd-net@FreeBSD.ORG  Thu Aug 14 18:28:49 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 60129CF8
 for <freebsd-net@freebsd.org>; Thu, 14 Aug 2014 18:28:49 +0000 (UTC)
Received: from quine.pinyon.org (quine.pinyon.org [65.101.5.249])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 31F172B88
 for <freebsd-net@freebsd.org>; Thu, 14 Aug 2014 18:28:49 +0000 (UTC)
Received: by quine.pinyon.org (Postfix, from userid 122)
 id 99B671603D6; Thu, 14 Aug 2014 11:28:47 -0700 (MST)
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on quine.pinyon.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00
 autolearn=ham autolearn_force=no version=3.4.0
Received: from feyerabend.n1.pinyon.org (feyerabend.n1.pinyon.org [10.0.10.6])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128
 bits)) (No client certificate requested)
 by quine.pinyon.org (Postfix) with ESMTPSA id 9EF831602F1
 for <freebsd-net@freebsd.org>; Thu, 14 Aug 2014 11:28:44 -0700 (MST)
Message-ID: <53ECFFDC.3000406@pinyon.org>
Date: Thu, 14 Aug 2014 11:28:44 -0700
From: "Russell L. Carter" <rcarter@pinyon.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:31.0) Gecko/20100101 Thunderbird/31.0
MIME-Version: 1.0
To: freebsd-net@freebsd.org
Subject: Re: NFS client READ performance on -current
References: <2136988575.13956627.1405199640153.JavaMail.root@uoguelph.ca>
 <53C7B774.60304@freebsd.org> <1780417.KfjTWjeQCU@pippin.baldwin.cx>
 <201408111653.42283.jhb@freebsd.org>
In-Reply-To: <201408111653.42283.jhb@freebsd.org>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Aug 2014 18:28:49 -0000

I measured some transfer rates, and have appended them.

On 08/11/14 13:53, John Baldwin wrote:
> On Saturday, July 19, 2014 1:28:19 pm John Baldwin wrote:
>> On Thursday 17 July 2014 19:45:56 Julian Elischer wrote:
>>> On 7/15/14, 10:34 PM, John Baldwin wrote:
>>>> On Saturday, July 12, 2014 5:14:00 pm Rick Macklem wrote:
>>>>> Yonghyeon Pyun wrote:
>>>>>> On Fri, Jul 11, 2014 at 09:54:23AM -0400, John Baldwin wrote:
>>>>>>> On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote:
>>>>>>>> John Baldwin wrote:
>>>>>>>>> On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote:
>>>>>>>>>> Russell L. Carter wrote:
>>>>>>>>>>> On 07/02/14 19:09, Rick Macklem wrote:
>>>>>>>>>>>> Could you please post the dmesg stuff for the network
>>>>>>>>>>>> interface,
>>>>>>>>>>>> so I can tell what driver is being used? I'll take a look
>>>>>>>>>>>> at
>>>>>>>>>>>> it,
>>>>>>>>>>>> in case it needs to be changed to use m_defrag().
>>>>>>>>>>>
>>>>>>>>>>> em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port
>>>>>>>>>>> 0xd020-0xd03f
>>>>>>>>>>> mem
>>>>>>>>>>> 0xfe4a0000-0xfe4bffff,0xfe480000-0xfe49ffff irq 44 at
>>>>>>>>>>> device 0.0
>>>>>>>>>>> on
>>>>>>>>>>> pci2
>>>>>>>>>>> em0: Using an MSI interrupt
>>>>>>>>>>> em0: Ethernet address: 00:15:17:bc:29:ba
>>>>>>>>>>> 001.000007 [2323] netmap_attach             success for em0
>>>>>>>>>>> tx
>>>>>>>>>>> 1/1024
>>>>>>>>>>> rx
>>>>>>>>>>> 1/1024 queues/slots
>>>>>>>>>>>
>>>>>>>>>>> This is one of those dual nic cards, so there is em1 as
>>>>>>>>>>> well...
>>>>>>>>>>
>>>>>>>>>> Well, I took a quick look at the driver and it does use
>>>>>>>>>> m_defrag(),
>>>>>>>>>> but
>>>>>>>>>> I think that the "retry:" label it does a goto after doing so
>>>>>>>>>> might
>>>>>>>>>> be in
>>>>>>>>>> the wrong place.
>>>>>>>>>>
>>>>>>>>>> The attached untested patch might fix this.
>>>>>>>>>>
>>>>>>>>>> Is it convenient to build a kernel with this patch applied
>>>>>>>>>> and then
>>>>>>>>>> try
>>>>>>>>>> it with TSO enabled?
>>>>>>>>>>
>>>>>>>>>> rick
>>>>>>>>>> ps: It does have the transmit segment limit set to 32. I have
>>>>>>>>>> no
>>>>>>>>>> idea if
>>>>>>>>>>
>>>>>>>>>>      this is a hardware limitation.
>>>>>>>>>
>>>>>>>>> I think the retry is not in the wrong place, but the overhead
>>>>>>>>> of all
>>>>>>>>> those
>>>>>>>>> pullups is apparently quite severe.
>>>>>>>>
>>>>>>>> The m_defrag() call after the first failure will just barely
>>>>>>>> squeeze
>>>>>>>> the just under 64K TSO segment into 32 mbuf clusters. Then I
>>>>>>>> think any
>>>>>>>> m_pullup() done during the retry will allocate an mbuf
>>>>>>>> (at a glance it seems to always do this when the old mbuf is a
>>>>>>>> cluster)
>>>>>>>> and prepend that to the list.
>>>>>>>> --> Now the list is > 32 mbufs again and the
>>>>>>>> bus_dmammap_load_mbuf_sg()
>>>>>>>>
>>>>>>>>      will fail again on the retry, this time fatally, I think?
>>>>>>>>
>>>>>>>> I can't see any reason to re-do all the stuff using m_pullup()
>>>>>>>> and Russell
>>>>>>>> reported that moving the "retry:" fixed his problem, from what I
>>>>>>>> understood.
>>>>>>>
>>>>>>> Ah, I had assumed (incorrectly) that the m_pullup()s would all be
>>>>>>> nops in this
>>>>>>> case.  It seems the NIC would really like to have all those things
>>>>>>> in a single
>>>>>>> segment, but it is not required, so I agree that your patch is
>>>>>>> fine.
>>>>>>
>>>>>> I recall em(4) controllers have various limitation in TSO. Driver
>>>>>> has to update IP header to make TSO work so driver has to get a
>>>>>> writable mbufs.  bpf(4) consumers will see IP packet length is 0
>>>>>> after this change.  I think tcpdump has a compile time option to
>>>>>> guess correct IP packet length.  The firmware of controller also
>>>>>> should be able to access complete IP/TCP header in a single buffer.
>>>>>> I don't remember more details in TSO limitation but I guess you may
>>>>>> be able to get more details TSO limitation from publicly available
>>>>>> Intel data sheet.
>>>>>
>>>>> I think that the patch should handle this ok. All of the m_pullup()
>>>>> stuff gets done the first time. Then, if the result is more than 32
>>>>> mbufs in the list, m_defrag() is called to copy the chain. This should
>>>>> result in all the header stuff in the first mbuf cluster and the map
>>>>> call is done again with this list of clusters. (Without the patch,
>>>>> m_pullup() would allocate another prepended mbuf and make the chain
>>>>> more than 32mbufs again.)
>>>>
>>>> Hmm, I am surprised by the m_pullup() behavior that it doesn't just
>>>> notice that the first mbuf with a cluster has the desired data already
>>>> and returns without doing anything.  That is, I'm surprised the first
>>>>
>>>> statement in m_pullup() isn't just:
>>>> 	if (n->m_len >= len)
>>>> 	
>>>> 		return (n);
>>>
>>> I seem to remember that the standard behaviour is for the caller to do
>>> exactly that.
>>
>> Huh, the manpage doesn't really state that, and it does check in one case.
>> However, I think that means that the code in em(4) is busted and should be
>> checking m_len before all the calls to m_pullup().  I think this will fix
>> the issue the same as Rick's change but it might also avoid unnecessary
>> pullups in some cases when defrag isn't needed in the first place.
> 
> FYI, I still think this patch is worth testing if someone is up for it.
> 

I realize that it would be better to run e.g. netperf.  However I
originally noticed the problem by running NFS read tests, and since
that's trivial to do with rsync I'm doing that again.  The source file
is sitting on a 6 drive zfs raidz2 and it's writing to a fast ssd.
Same hardware with linux is a trifle faster, so probably the test
setup is revealing enough.

I like to run rsync -avP so that I can watch the behavior over time.
The transfer rates have been fairly steady, +-5MB/s.  Eyeballing it,
I'm not seeing much in the way of cache effects.

r269700 2014-08-07
10G transfer 65MB/s nfs read.

after patch and install if_em.ko* to both sides and reboot:

10G transfer 62MB/s nfs read.

immediately read different 5G file: 67MB/s

then reread original 10G file: 236MB/s ?? faster than wire?
that's a cache effect, but still, I find it surprising.

then read a different 12G file: 62.9MB/s

So, previously I was seeing ~65MB/s from the original patch, which is
what I see today.  JHB's patch seems to be slightly slower,
repeatable.  But given the variance of the transfer rates, it's not
really much different.

HTH,
Russell

ps: a quick question about quickly building modules:  Suppose I have
a fully populated /usr/obj from buildworld and buildkernel (and have
installed it), what's the most efficient method for rebuilding a single
module and getting it installed into /boot/kernel?