From owner-freebsd-net@FreeBSD.ORG  Tue Jul 15 07:24:59 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 54B67F83;
 Tue, 15 Jul 2014 07:24:59 +0000 (UTC)
Received: from mail-pa0-x232.google.com (mail-pa0-x232.google.com
 [IPv6:2607:f8b0:400e:c03::232])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 220EF2588;
 Tue, 15 Jul 2014 07:24:59 +0000 (UTC)
Received: by mail-pa0-f50.google.com with SMTP id et14so1339518pad.23
 for <multiple recipients>; Tue, 15 Jul 2014 00:24:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=from:date:to:cc:subject:message-id:reply-to:references:mime-version
 :content-type:content-disposition:in-reply-to:user-agent;
 bh=PcczKJlQXd2Nid+bZBFSQ4jg9PHTO4QQC7uyR+d+dzs=;
 b=n0G/sFcfLwGXg+iHiKx62RdtTf0C9UVz6GgFRgkWNHVxAG1GGzvZCVqTVch/+XRm8D
 3qe9pJav28Fpk/lYZSoBGixb1k7S6SMCstfGiG3G87XU+CzSE8o733EqF0V6OTNUWJt2
 sQKYtNw2bSlwg/SliO4XRDMyFxHUXamciXnCkT323CJwKXkP/L/0Cwn95w8ZDiTxANvy
 ELhO3gSKdxWVIb5TnXmJdFbiasm+zI7O+PJwl6tbYuxSv2aosSZKCO4RgyQWyOI898Uh
 4tODqNEKu0X37JepmcBl44HjBWLXdaT5eglFKee/9BP2UwTUM2Y2da/ZxmXtlvuqig83
 waJg==
X-Received: by 10.66.65.204 with SMTP id z12mr21094161pas.60.1405409098609;
 Tue, 15 Jul 2014 00:24:58 -0700 (PDT)
Received: from pyunyh@gmail.com ([106.247.248.2])
 by mx.google.com with ESMTPSA id ei4sm12967689pbb.42.2014.07.15.00.24.55
 for <multiple recipients>
 (version=TLSv1 cipher=RC4-SHA bits=128/128);
 Tue, 15 Jul 2014 00:24:57 -0700 (PDT)
From: Yonghyeon PYUN <pyunyh@gmail.com>
X-Google-Original-From: "Yonghyeon PYUN" <yongari@>
Received: by pyunyh@gmail.com (sSMTP sendmail emulation);
 Tue, 15 Jul 2014 16:24:49 +0900
Date: Tue, 15 Jul 2014 16:24:49 +0900
To: Rick Macklem <rmacklem@uoguelph.ca>
Subject: Re: NFS client READ performance on -current
Message-ID: <20140715072449.GA1488@michelle.fasterthan.com>
Reply-To: pyunyh@gmail.com
References: <20140712060538.GA3649@michelle.fasterthan.com>
 <2136988575.13956627.1405199640153.JavaMail.root@uoguelph.ca>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <2136988575.13956627.1405199640153.JavaMail.root@uoguelph.ca>
User-Agent: Mutt/1.4.2.3i
Cc: "Russell L. Carter" <rcarter@pinyon.org>, freebsd-net@freebsd.org,
 John Baldwin <jhb@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Jul 2014 07:24:59 -0000

On Sat, Jul 12, 2014 at 05:14:00PM -0400, Rick Macklem wrote:
> Yonghyeon Pyun wrote:
> > On Fri, Jul 11, 2014 at 09:54:23AM -0400, John Baldwin wrote:
> > > On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote:
> > > > John Baldwin wrote:
> > > > > On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote:
> > > > > > Russell L. Carter wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 07/02/14 19:09, Rick Macklem wrote:
> > > > > > > 
> > > > > > > > Could you please post the dmesg stuff for the network
> > > > > > > > interface,
> > > > > > > > so I can tell what driver is being used? I'll take a look
> > > > > > > > at
> > > > > > > > it,
> > > > > > > > in case it needs to be changed to use m_defrag().
> > > > > > > 
> > > > > > > em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port
> > > > > > > 0xd020-0xd03f
> > > > > > > mem
> > > > > > > 0xfe4a0000-0xfe4bffff,0xfe480000-0xfe49ffff irq 44 at
> > > > > > > device 0.0
> > > > > > > on
> > > > > > > pci2
> > > > > > > em0: Using an MSI interrupt
> > > > > > > em0: Ethernet address: 00:15:17:bc:29:ba
> > > > > > > 001.000007 [2323] netmap_attach             success for em0
> > > > > > > tx
> > > > > > > 1/1024
> > > > > > > rx
> > > > > > > 1/1024 queues/slots
> > > > > > > 
> > > > > > > This is one of those dual nic cards, so there is em1 as
> > > > > > > well...
> > > > > > > 
> > > > > > Well, I took a quick look at the driver and it does use
> > > > > > m_defrag(),
> > > > > > but
> > > > > > I think that the "retry:" label it does a goto after doing so
> > > > > > might
> > > > > > be in
> > > > > > the wrong place.
> > > > > > 
> > > > > > The attached untested patch might fix this.
> > > > > > 
> > > > > > Is it convenient to build a kernel with this patch applied
> > > > > > and then
> > > > > > try
> > > > > > it with TSO enabled?
> > > > > > 
> > > > > > rick
> > > > > > ps: It does have the transmit segment limit set to 32. I have
> > > > > > no
> > > > > > idea if
> > > > > >     this is a hardware limitation.
> > > > > 
> > > > > I think the retry is not in the wrong place, but the overhead
> > > > > of all
> > > > > those
> > > > > pullups is apparently quite severe.
> > > > The m_defrag() call after the first failure will just barely
> > > > squeeze
> > > > the just under 64K TSO segment into 32 mbuf clusters. Then I
> > > > think any
> > > > m_pullup() done during the retry will allocate an mbuf
> > > > (at a glance it seems to always do this when the old mbuf is a
> > > > cluster)
> > > > and prepend that to the list.
> > > > --> Now the list is > 32 mbufs again and the
> > > > bus_dmammap_load_mbuf_sg()
> > > >     will fail again on the retry, this time fatally, I think?
> > > > 
> > > > I can't see any reason to re-do all the stuff using m_pullup()
> > > > and Russell
> > > > reported that moving the "retry:" fixed his problem, from what I
> > > > understood.
> > > 
> > > Ah, I had assumed (incorrectly) that the m_pullup()s would all be
> > > nops in this
> > > case.  It seems the NIC would really like to have all those things
> > > in a single
> > > segment, but it is not required, so I agree that your patch is
> > > fine.
> > > 
> > 
> > I recall em(4) controllers have various limitation in TSO. Driver
> > has to update IP header to make TSO work so driver has to get a
> > writable mbufs.  bpf(4) consumers will see IP packet length is 0
> > after this change.  I think tcpdump has a compile time option to
> > guess correct IP packet length.  The firmware of controller also
> > should be able to access complete IP/TCP header in a single buffer.
> > I don't remember more details in TSO limitation but I guess you may
> > be able to get more details TSO limitation from publicly available
> > Intel data sheet.
> I think that the patch should handle this ok. All of the m_pullup()
> stuff gets done the first time. Then, if the result is more than 32
> mbufs in the list, m_defrag() is called to copy the chain. This should
> result in all the header stuff in the first mbuf cluster and the map
> call is done again with this list of clusters. (Without the patch,
> m_pullup() would allocate another prepended mbuf and make the chain
> more than 32mbufs again.)
> 

Yes, your patch looks right.

> Russell seemed to confirm that the patch fixed the problem for him,
> but since I don't have em(4) hardware, it would be nice to have someone
> with commit privilege and access to em(4) hardware test and commit it.
> 

Due to breakage of power supply on a box with em(4) controller, I
can't test the patch.  But I guess it's ok to commit it and Russel
already tested it.

Thanks for your patch.