Date: Wed, 21 Aug 2013 20:40:36 +0200 From: Andre Oppermann <andre@freebsd.org> To: Luigi Rizzo <rizzo@iet.unipi.it> Cc: Lawrence Stewart <lstewart@freebsd.org>, FreeBSD Net <net@freebsd.org> Subject: Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux) Message-ID: <521509A4.8020601@freebsd.org> In-Reply-To: <20130814102109.GA63246@onelab2.iet.unipi.it> References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org> <520B24A0.4000706@freebsd.org> <520B3056.1000804@freebsd.org> <20130814102109.GA63246@onelab2.iet.unipi.it>
next in thread | previous in thread | raw e-mail | index | archive | help
On 14.08.2013 12:21, Luigi Rizzo wrote: > On Wed, Aug 14, 2013 at 05:23:02PM +1000, Lawrence Stewart wrote: >> I think (check the driver code in question as I'm not sure) that if you >> "ifconfig <if> lro" and the driver has hardware support or has been made >> aware of our software implementation, it should DTRT. > > The "lower throughput than linux" that julian was seeing is either > because of a slow (CPU-bound) sender or slow receiver. Given that > the FreeBSD tx path is quite expensive (redoing route and arp lookups > on every packet, etc.) I highly suspect the sender side is at fault. > > Then the problem remains that we should keep a copy of route and > arp information in the socket instead of redoing the lookups on > every single transmission, as they consume some 25% of the time of > a sendto(), and probably even more when it comes to large tcp > segments, sendfile() and the like. It's the locking and ref-counting overhead in the routing table and ARP table causing a lot of cache thrashing and bus lock cycles. The fix is rather simple. The routing table gets protected by a rm_lock instead of a normal lock. Individual routes no longer have their own lock and no more ref-counting. All pointers to routes and into the routing table are prohibited. Upon lookup the sought information is copied out (ifp, ifaddr, nexthop) without retaining any reference to the routing entry. Ditto for the ARP table. Because changes to the routing and ARP tables are very infrequent compared to the number of lookups performed on them, this exhibits very good cache behavior across multiple cores and cpus. No shared routing table memory is dirtied during lookup. Approaches that do NOT work (well): - flow caching where a separate entry is generated for every active connection containing direct pointers to the rtentry, arp entry and interface. Besides the pointer validity and refcounting issues it scales very poorly for a large number of "flows" exhibiting a large lookup overhead. The routing table (default and interface routes) and ARP table (a few hosts) stay at the same size and have a "constant" lookup time. - per cpu copies of routing and arp table have increased memory consumption and synchronization issues on updates especially with high core counts. - storing the rtentry and arp entry pointers in the inpcb has similar issues as the the flow table approach while periodically having to check if the route or arp entry changed. The rm_lock is the fastest, cheapest and most SMP scalable approach shown so far. I have patches against a roughly 12 month old current laying around if someone wants to brush them up and work out the final kinks. The speedup and reduction in overhead is significant. -- Andre
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?521509A4.8020601>