Date: Mon, 3 Feb 2014 09:04:24 +0200 From: Daniel Braniss <danny@cs.huji.ac.il> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: Pyun YongHyeon <pyunyh@gmail.com>, FreeBSD Net <freebsd-net@freebsd.org>, Adam McDougall <mcdouga9@egr.msu.edu>, Jack Vogel <jfvogel@gmail.com> Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <4AA2405B-8C52-49E1-AC33-F92762156152@cs.huji.ac.il> In-Reply-To: <906704123.1485103.1391357730899.JavaMail.root@uoguelph.ca> References: <906704123.1485103.1391357730899.JavaMail.root@uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
On Feb 2, 2014, at 6:15 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote: > Daniel Braniss wrote: >> hi Rick, et.all. >>=20 >> tried your patch but it didn=92t help,the server is stuck. > Oh well. I was hoping that was going to make TSO work reliably. > Just to comfirm it, this server works reliably when TSO is disabled? >=20 absolutely, with TSO disabled there is no problem,and it=92s slightly = faster, the host is =91server class=92 , a PC might be different. cheers, danny > Thanks for doing the testing, rick >=20 >> just for fun, I tried a different client/host, this one has a >> broadcom NextXtreme II that was >> MFC=92ed lately, and the results are worse than the Intel (5hs = instead >> of 4hs) but faster without TSO >>=20 >> with TSO enabled and bs=3D32k: >> 5.09hs 18325.62 real 1109.23 user 4591.60 sys >>=20 >> without TSO: >> 4.75hs 17120.40 real 1114.08 user 3537.61 sys >>=20 >> So what is the advantage of using TSO? (no complain here, just >> curious) >>=20 >> I=92ll try to see if as a server it has the same TSO related issues. >>=20 >> cheers, >> danny >>=20 >> On Jan 28, 2014, at 3:51 AM, Rick Macklem <rmacklem@uoguelph.ca> >> wrote: >>=20 >>> Jack Vogel wrote: >>>> That header file is for the VF driver :) which I don't believe is >>>> being >>>> used in this case. >>>> The driver is capable of handling 256K but its limited by the >>>> stack >>>> to 64K >>>> (look in >>>> ixgbe.h), so its not a few bytes off due to the vlan header. >>>>=20 >>>> The scatter size is not an arbitrary one, its due to hardware >>>> limitations >>>> in Niantic >>>> (82599). Turning off TSO in the 10G environment is not practical, >>>> you will >>>> have >>>> trouble getting good performance. >>>>=20 >>>> Jack >>>>=20 >>> Well, if you look at this thread, Daniel got much better >>> performance >>> by turning off TSO. However, I agree that this is not an ideal >>> solution. >>> = http://docs.FreeBSD.org/cgi/mid.cgi?2C287272-7B57-4AAD-B22F-6A65D9F8677B >>>=20 >>> rick >>>=20 >>>>=20 >>>>=20 >>>> On Mon, Jan 27, 2014 at 4:58 PM, Yonghyeon PYUN <pyunyh@gmail.com> >>>> wrote: >>>>=20 >>>>> On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote: >>>>>> pyunyh@gmail.com wrote: >>>>>>> On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote: >>>>>>>> Adam McDougall wrote: >>>>>>>>> Also try rsize=3D32768,wsize=3D32768 in your mount options, >>>>>>>>> made a >>>>>>>>> huge >>>>>>>>> difference for me. I've noticed slow file transfers on NFS >>>>>>>>> in 9 >>>>>>>>> and >>>>>>>>> finally did some searching a couple months ago, someone >>>>>>>>> suggested >>>>>>>>> it >>>>>>>>> and >>>>>>>>> they were on to something. >>>>>>>>>=20 >>>>>>>> I have a "hunch" that might explain why 64K NFS reads/writes >>>>>>>> perform >>>>>>>> poorly for some network environments. >>>>>>>> A 64K NFS read reply/write request consists of a list of 34 >>>>>>>> mbufs >>>>>>>> when >>>>>>>> passed to TCP via sosend() and a total data length of around >>>>>>>> 65680bytes. >>>>>>>> Looking at a couple of drivers (virtio and ixgbe), they seem >>>>>>>> to >>>>>>>> expect >>>>>>>> no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. >>>>>>>> I >>>>>>>> think >>>>>>>> (I don't have anything that does TSO to confirm this) that >>>>>>>> NFS will >>>>>>>> pass >>>>>>>> a list that is longer (34 plus a TCP/IP header). >>>>>>>> At a glance, it appears that the drivers call m_defrag() or >>>>>>>> m_collapse() >>>>>>>> when the mbuf list won't fit in their scatter table (32 or 33 >>>>>>>> elements) >>>>>>>> and if this fails, just silently drop the data without >>>>>>>> sending it. >>>>>>>> If I'm right, there would considerable overhead from >>>>>>>> m_defrag()/m_collapse() >>>>>>>> and near disaster if they fail to fix the problem and the >>>>>>>> data is >>>>>>>> silently >>>>>>>> dropped instead of xmited. >>>>>>>>=20 >>>>>>>=20 >>>>>>> I think the actual number of DMA segments allocated for the >>>>>>> mbuf >>>>>>> chain is determined by bus_dma(9). bus_dma(9) will coalesce >>>>>>> current segment with previous segment if possible. >>>>>>>=20 >>>>>> Ok, I'll have to take a look, but I thought that an array of >>>>>> sized >>>>>> by "num_segs" is passed in as an argument. (And num_segs is set >>>>>> to >>>>>> either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).) >>>>>> It looked to me that the ixgbe driver called itself ix, so it >>>>>> isn't >>>>>> obvious to me which we are talking about. (I know that Daniel >>>>>> Braniss >>>>>> had an ix0 and ix1, which were fixed for NFS by disabling TSO.) >>>>>>=20 >>>>>=20 >>>>> It's ix(4). ixbge(4) is a different driver. >>>>>=20 >>>>>> I'll admit I mostly looked at virtio's network driver, since >>>>>> that >>>>>> was the one being used by J David. >>>>>>=20 >>>>>> Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have >>>>>> been >>>>>> cropping up for quite a while, and I am just trying to find out >>>>>> why. >>>>>> (I have no hardware/software that exhibits the problem, so I can >>>>>> only look at the sources and ask others to try testing stuff.) >>>>>>=20 >>>>>>> I'm not sure whether you're referring to ixgbe(4) or ix(4) but >>>>>>> I >>>>>>> see the total length of all segment size of ix(4) is 65535 so >>>>>>> it has no room for ethernet/VLAN header of the mbuf chain. The >>>>>>> driver should be fixed to transmit a 64KB datagram. >>>>>> Well, if_hw_tsomax is set to 65535 by the generic code (the >>>>>> driver >>>>>> doesn't set it) and the code in tcp_output() seems to subtract >>>>>> the >>>>>> size of an tcp/ip header from that before passing data to the >>>>>> driver, >>>>>> so I think the mbuf chain passed to the driver will fit in one >>>>>> ip datagram. (I'd assume all sorts of stuff would break for TSO >>>>>> enabled drivers if that wasn't the case?) >>>>>=20 >>>>> I believe the generic code is doing right. I'm under the >>>>> impression the non-working TSO indicates a bug in driver. Some >>>>> drivers didn't account for additional ethernet/VLAN header so the >>>>> total size of DMA segments exceeded 65535. I've attached a diff >>>>> for ix(4). It wasn't tested at all as I don't have hardware to >>>>> test. >>>>>=20 >>>>>>=20 >>>>>>> I think the use of m_defrag(9) in TSO is suboptimal. All TSO >>>>>>> capable controllers are able to handle multiple TX buffers so >>>>>>> it >>>>>>> should have used m_collapse(9) rather than copying entire chain >>>>>>> with m_defrag(9). >>>>>>>=20 >>>>>> I haven't looked at these closely yet (plan on doing so to-day), >>>>>> but >>>>>> even m_collapse() looked like it copied data between mbufs and >>>>>> that >>>>>> is certainly suboptimal, imho. I don't see why a driver can't >>>>>> split >>>>>> the mbuf list, if there are too many entries for the >>>>>> scatter/gather >>>>>> and do it in two iterations (much like tcp_output() does >>>>>> already, >>>>>> since the data length exceeds 65535 - tcp/ip header size). >>>>>>=20 >>>>>=20 >>>>> It can split the mbuf list if controllers supports increased >>>>> number >>>>> of TX buffers. Because controller shall consume the same number >>>>> of >>>>> DMA descriptors for the mbuf list, drivers tend to impose a limit >>>>> on the number of TX buffers to save resources. >>>>>=20 >>>>>> However, at this point, I just want to find out if the long >>>>>> chain >>>>>> of mbufs is why TSO is problematic for these drivers, since I'll >>>>>> admit I'm getting tired of telling people to disable TSO (and I >>>>>> suspect some don't believe me and never try it). >>>>>>=20 >>>>>=20 >>>>> TSO capable controllers tend to have various limitations(the >>>>> first >>>>> TX buffer should have complete ethernet/IP/TCP header, ip_len of >>>>> IP >>>>> header should be reset to 0, TCP pseudo checksum should be >>>>> recomputed etc) and cheap controllers need more assistance from >>>>> driver to let its firmware know various IP/TCP header offset >>>>> location in the mbuf. Because this requires a IP/TCP header >>>>> parsing, it's error prone and very complex. >>>>>=20 >>>>>>>> Anyhow, I have attached a patch that makes NFS use >>>>>>>> MJUMPAGESIZE >>>>>>>> clusters, >>>>>>>> so the mbuf count drops from 34 to 18. >>>>>>>>=20 >>>>>>>=20 >>>>>>> Could we make it conditional on size? >>>>>>>=20 >>>>>> Not sure what you mean? If you mean "the size of the >>>>>> read/write", >>>>>> that would be possible for NFSv3, but less so for NFSv4. (The >>>>>> read/write >>>>>> is just one Op. in the compound for NFSv4 and there is no way to >>>>>> predict how much more data is going to be generated by >>>>>> subsequent >>>>>> Ops.) >>>>>>=20 >>>>>=20 >>>>> Sorry, I should have been more clearer. You already answered my >>>>> question. Thanks. >>>>>=20 >>>>>> If by "size" you mean amount of memory in the machine then, yes, >>>>>> it >>>>>> certainly could be conditional on that. (I plan to try and look >>>>>> at >>>>>> the allocator to-day as well, but if others know of >>>>>> disadvantages >>>>>> with >>>>>> using MJUMPAGESIZE instead of MCLBYTES, please speak up.) >>>>>>=20 >>>>>> Garrett Wollman already alluded to the MCLBYTES case being >>>>>> pre-allocated, >>>>>> but I'll admit I have no idea what the implications of that are >>>>>> at this >>>>>> time. >>>>>>=20 >>>>>>>> If anyone has a TSO scatter/gather enabled net interface and >>>>>>>> can >>>>>>>> test this >>>>>>>> patch on it with NFS I/O (default of 64K rsize/wsize) when >>>>>>>> TSO is >>>>>>>> enabled >>>>>>>> and see what effect it has, that would be appreciated. >>>>>>>>=20 >>>>>>>> Btw, thanks go to Garrett Wollman for suggesting the change >>>>>>>> to >>>>>>>> MJUMPAGESIZE >>>>>>>> clusters. >>>>>>>>=20 >>>>>>>> rick >>>>>>>> ps: If the attachment doesn't make it through and you want >>>>>>>> the >>>>>>>> patch, just >>>>>>>> email me and I'll send you a copy. >>>>>>>>=20 >>>>>=20 >>>>> _______________________________________________ >>>>> freebsd-net@freebsd.org mailing list >>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-net >>>>> To unsubscribe, send any mail to >>>>> "freebsd-net-unsubscribe@freebsd.org" >>>>>=20 >>>> _______________________________________________ >>>> freebsd-net@freebsd.org mailing list >>>> http://lists.freebsd.org/mailman/listinfo/freebsd-net >>>> To unsubscribe, send any mail to >>>> "freebsd-net-unsubscribe@freebsd.org" >>=20 >> _______________________________________________ >> freebsd-net@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-net >> To unsubscribe, send any mail to >> "freebsd-net-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4AA2405B-8C52-49E1-AC33-F92762156152>