Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 12 Jul 2014 17:14:00 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        pyunyh@gmail.com
Cc:        "Russell L. Carter" <rcarter@pinyon.org>, freebsd-net@freebsd.org, John Baldwin <jhb@freebsd.org>
Subject:   Re: NFS client READ performance on -current
Message-ID:  <2136988575.13956627.1405199640153.JavaMail.root@uoguelph.ca>
In-Reply-To: <20140712060538.GA3649@michelle.fasterthan.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Yonghyeon Pyun wrote:
> On Fri, Jul 11, 2014 at 09:54:23AM -0400, John Baldwin wrote:
> > On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote:
> > > John Baldwin wrote:
> > > > On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote:
> > > > > Russell L. Carter wrote:
> > > > > > 
> > > > > > 
> > > > > > On 07/02/14 19:09, Rick Macklem wrote:
> > > > > > 
> > > > > > > Could you please post the dmesg stuff for the network
> > > > > > > interface,
> > > > > > > so I can tell what driver is being used? I'll take a look
> > > > > > > at
> > > > > > > it,
> > > > > > > in case it needs to be changed to use m_defrag().
> > > > > > 
> > > > > > em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port
> > > > > > 0xd020-0xd03f
> > > > > > mem
> > > > > > 0xfe4a0000-0xfe4bffff,0xfe480000-0xfe49ffff irq 44 at
> > > > > > device 0.0
> > > > > > on
> > > > > > pci2
> > > > > > em0: Using an MSI interrupt
> > > > > > em0: Ethernet address: 00:15:17:bc:29:ba
> > > > > > 001.000007 [2323] netmap_attach             success for em0
> > > > > > tx
> > > > > > 1/1024
> > > > > > rx
> > > > > > 1/1024 queues/slots
> > > > > > 
> > > > > > This is one of those dual nic cards, so there is em1 as
> > > > > > well...
> > > > > > 
> > > > > Well, I took a quick look at the driver and it does use
> > > > > m_defrag(),
> > > > > but
> > > > > I think that the "retry:" label it does a goto after doing so
> > > > > might
> > > > > be in
> > > > > the wrong place.
> > > > > 
> > > > > The attached untested patch might fix this.
> > > > > 
> > > > > Is it convenient to build a kernel with this patch applied
> > > > > and then
> > > > > try
> > > > > it with TSO enabled?
> > > > > 
> > > > > rick
> > > > > ps: It does have the transmit segment limit set to 32. I have
> > > > > no
> > > > > idea if
> > > > >     this is a hardware limitation.
> > > > 
> > > > I think the retry is not in the wrong place, but the overhead
> > > > of all
> > > > those
> > > > pullups is apparently quite severe.
> > > The m_defrag() call after the first failure will just barely
> > > squeeze
> > > the just under 64K TSO segment into 32 mbuf clusters. Then I
> > > think any
> > > m_pullup() done during the retry will allocate an mbuf
> > > (at a glance it seems to always do this when the old mbuf is a
> > > cluster)
> > > and prepend that to the list.
> > > --> Now the list is > 32 mbufs again and the
> > > bus_dmammap_load_mbuf_sg()
> > >     will fail again on the retry, this time fatally, I think?
> > > 
> > > I can't see any reason to re-do all the stuff using m_pullup()
> > > and Russell
> > > reported that moving the "retry:" fixed his problem, from what I
> > > understood.
> > 
> > Ah, I had assumed (incorrectly) that the m_pullup()s would all be
> > nops in this
> > case.  It seems the NIC would really like to have all those things
> > in a single
> > segment, but it is not required, so I agree that your patch is
> > fine.
> > 
> 
> I recall em(4) controllers have various limitation in TSO. Driver
> has to update IP header to make TSO work so driver has to get a
> writable mbufs.  bpf(4) consumers will see IP packet length is 0
> after this change.  I think tcpdump has a compile time option to
> guess correct IP packet length.  The firmware of controller also
> should be able to access complete IP/TCP header in a single buffer.
> I don't remember more details in TSO limitation but I guess you may
> be able to get more details TSO limitation from publicly available
> Intel data sheet.
I think that the patch should handle this ok. All of the m_pullup()
stuff gets done the first time. Then, if the result is more than 32
mbufs in the list, m_defrag() is called to copy the chain. This should
result in all the header stuff in the first mbuf cluster and the map
call is done again with this list of clusters. (Without the patch,
m_pullup() would allocate another prepended mbuf and make the chain
more than 32mbufs again.)

Russell seemed to confirm that the patch fixed the problem for him,
but since I don't have em(4) hardware, it would be nice to have someone
with commit privilege and access to em(4) hardware test and commit it.

rick

> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscribe@freebsd.org"
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2136988575.13956627.1405199640153.JavaMail.root>