Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 17 Aug 2013 06:14:04 -0700 (PDT)
From:      Barney Cordoba <barney_cordoba@yahoo.com>
To:        Luigi Rizzo <rizzo@iet.unipi.it>, Lawrence Stewart <lstewart@freebsd.org>
Cc:        FreeBSD Net <net@freebsd.org>
Subject:   Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)
Message-ID:  <1376745244.6575.YahooMailNeo@web121606.mail.ne1.yahoo.com>
In-Reply-To: <20130814102109.GA63246@onelab2.iet.unipi.it>
References:  <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org> <520B24A0.4000706@freebsd.org> <520B3056.1000804@freebsd.org> <20130814102109.GA63246@onelab2.iet.unipi.it>

next in thread | previous in thread | raw e-mail | index | archive | help
Horsehockey. What are you guys running with, P4s?=0A=0AModern cpus are magn=
ificently fast. The triviality of lookups is a non-issue=A0=0Ain almost all=
 cases. The ability of modern cpus to fill a transmit queue faster=0Athan t=
he data can be transmitted is incontrovertible.=0A=0AWith TCP you have wind=
ows and things; trying to drill down to hardware=0Ainefficiencies as if you=
're running on a 200Mhz P4 is just silly.=0A=0AI abandoned hardware offload=
s back when someone tried to sell me on data=0Acompression boards; the trut=
h is that the IO overhead of copying to and from=A0=0Athe board was higher =
than the cpu cycles needed to compress the data.=0A=0A=0AThe failure to und=
erstand how IO and locks interfere with traffic flow on=A0=0Amulticore syst=
ems is the biggest problem with driver development; all of this=0Achatter a=
bout moderation is simply a waste of time; such things are completely=0Atun=
able; a task that gets far too little attention IMO. Tuning can make a worl=
d=0Aof difference if you understand what you're doing.=0A=0AThe idea that h=
aving 400K ints/second to gain a tock of throughput is an acceptable=0Atrad=
e-off is patently absurd.=0A=0AEFFICIENCY is tantamount. Throughput is almo=
st always a tuning issue.=0A=0A=0ABC=0A=0A________________________________=
=0A From: Luigi Rizzo <rizzo@iet.unipi.it>=0ATo: Lawrence Stewart <lstewart=
@freebsd.org> =0ACc: FreeBSD Net <net@freebsd.org> =0ASent: Wednesday, Augu=
st 14, 2013 6:21 AM=0ASubject: it's the output, not ack coalescing (Re: TSO=
 and FreeBSD vs Linux)=0A =0A=0AOn Wed, Aug 14, 2013 at 05:23:02PM +1000, L=
awrence Stewart wrote:=0A> On 08/14/13 16:33, Julian Elischer wrote:=0A> > =
On 8/14/13 11:39 AM, Lawrence Stewart wrote:=0A> >> On 08/14/13 03:29, Juli=
an Elischer wrote:=0A> >>> I have been tracking down a performance embarras=
sment on AMAZON EC2 and=0A> >>> have found it I think.=0A> >> Let us please=
 avoid conflating performance with throughput. The=0A> >> behaviour you go =
on to describe as a performance embarrassment is=0A> >> actually a throughp=
ut difference, and the FreeBSD behaviour you're=0A> >> describing is essent=
ially sacrificing throughput and CPU cycles for=0A> >> lower latency. That =
may not be a trade-off you like, but it is an=0A> >> important factor in th=
is discussion.=0A...=0A> Sure, there's nothing wrong with holding throughpu=
t up as a key=0A> performance metric for your use case.=0A> =0A> I'm just t=
rying to pre-empt a discussion that focuses on one metric and=0A> fails to =
consider the bigger picture.=0A...=0A> > I could see no latency reversion.=
=0A> =0A> You wouldn't because it would be practically invisible in the sor=
ts of=0A> tests/measurements you're doing. Our good friends over at HRT on =
the=0A> other hand would be far more likely to care about latency on the or=
der=0A> of microseconds. Again, the use case matters a lot.=0A...=0A> > so,=
 does "Software LRO" mean that LRO on hte NIC should be ON or OFF to=0A> > =
see this?=0A> =0A> I think (check the driver code in question as I'm not su=
re) that if you=0A> "ifconfig <if> lro" and the driver has hardware support=
 or has been made=0A> aware of our software implementation, it should DTRT.=
=0A=0AThe "lower throughput than linux" that julian was seeing is either=0A=
because of a slow (CPU-bound) sender or slow receiver. Given that=0Athe Fre=
eBSD tx path is quite expensive (redoing route and arp lookups=0Aon every p=
acket, etc.) I highly suspect the sender side is at fault.=0A=0AAck coalesc=
ing, LRO, GRO are limited to the set of packets that you=0Areceive in the s=
ame batch, which in turn is upper bounded by the=0Ainterrupt moderation del=
ay. Apart from simple benchmarks with only=0Aa few flows, it is very hard t=
hat ack/lro/gro can coalesce more=0Athan a few segments for the same flow.=
=0A=0A=A0=A0=A0 But the real fix is in tcp_output.=0A=0AIn fact, it has nev=
er been the case that an ack (single or coalesced)=0Atriggers an immediate =
transmission in the output path.=A0 We had this=0Ain the past (Silly Window=
 Syndrome) and there is code that avoids=0Asending less than 1-mtu under ap=
propriate conditions (there is more=0Adata to push out anyways, no NODELAY,=
 there are outstanding acks,=0Athe window can open further).=A0 In all thes=
e cases there is no=0Areasonable way to experience the difference in terms =
of latency.=0A=0AIf one really cares, e.g. the High Speed Trading example, =
this is=0Aa non issue because any reasonable person would run with TCP_NODE=
LAY=0A(and possibly disable interrupt moderation), and optimize for latency=
=0Aeven on a per flow basis.=0A=0AIn terms of coding effort, i suspect that=
 by replacing the 1-mtu=0Alimit (t_maxseg i believe is the variable that we=
 use in the SWS=0Aavoidance code) with 1-max-tso-segment we can probably ac=
hieve good=0Aresults with little programming effort.=0A=0AThen the problem =
remains that we should keep a copy of route and=0Aarp information in the so=
cket instead of redoing the lookups on=0Aevery single transmission, as they=
 consume some 25% of the time of=0Aa sendto(), and probably even more when =
it comes to large tcp=0Asegments, sendfile() and the like.=0A=0A=A0=A0=A0 c=
heers=0A=A0=A0=A0 luigi=0A_______________________________________________=
=0Afreebsd-net@freebsd.org mailing list=0Ahttp://lists.freebsd.org/mailman/=
listinfo/freebsd-net=0ATo unsubscribe, send any mail to "freebsd-net-unsubs=
cribe@freebsd.org"
From owner-freebsd-net@FreeBSD.ORG  Sat Aug 17 14:02:58 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id ADFB76C8
 for <net@freebsd.org>; Sat, 17 Aug 2013 14:02:58 +0000 (UTC)
 (envelope-from barney_cordoba@yahoo.com)
Received: from nm6.bullet.mail.ne1.yahoo.com (nm6.bullet.mail.ne1.yahoo.com
 [98.138.90.69])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 63D762974
 for <net@freebsd.org>; Sat, 17 Aug 2013 14:02:58 +0000 (UTC)
Received: from [98.138.90.55] by nm6.bullet.mail.ne1.yahoo.com with NNFMP;
 17 Aug 2013 14:02:51 -0000
Received: from [98.138.226.169] by tm8.bullet.mail.ne1.yahoo.com with NNFMP;
 17 Aug 2013 14:02:51 -0000
Received: from [127.0.0.1] by omp1070.mail.ne1.yahoo.com with NNFMP;
 17 Aug 2013 14:02:51 -0000
X-Yahoo-Newman-Property: ymail-3
X-Yahoo-Newman-Id: 90697.13948.bm@omp1070.mail.ne1.yahoo.com
Received: (qmail 75905 invoked by uid 60001); 17 Aug 2013 14:02:50 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024;
 t=1376748170; bh=glnQiP2oGzxd8FvuIRYGTRdMCngdPWXZKqDS2UPJgZM=;
 h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type;
 b=cYAohvHRhZyh0shM7aMDrWATO9nZmpJpqPdB07dJf3iRH2UVnzawSfCc6HlqlQlXYnvsTY6qV3gQc4R5+BS/iNibHT83LeRb3vdmrj+WYbnn/xQVy4HFqG5zUEbdjz1cbwz1Uu4ghJfw6uAjMv0MA/qjkYSzqXVu6bVJTj1o1Sk=
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com;
 h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type;
 b=cCrJDb1PBu4Kix2JoX4m98HKLvXKFsistvSy6ft7/J5NJj6kqdC8oJi1uaq3SL1h3grofRMNANt7ZleFcYNI2p9h2x5RMXRAdHbckWobDkSctLcR4tE2vyv6OPmwtAzwOsaDK3ZdzQP/ym37NV8vxEpAK+QkIeGUCNcJAwqtGNY=;
X-YMail-OSG: HAEoOZUVM1nJolpldzfg_wn1zLGimSrGvzGvsiL5xAyNnbo
 t1j0npDEroQfKtzVV23A2DgmKMv6x_5zBpmvd3CeH9vZPOiP3mkF3sCFfElA
 S4exVo6ZGWwkUzlNPX5b2r_vGR3gzQsQXU_jE4lTZnJhiaudXw5jf7EaV3iM
 1_UKV6Fg.RaTJWOvC6P.qvPjgBvhys_h5O0yD8V2XDODfR3fhuV2kQDYpak6
 oEQpmx50ulzsgEl82cZsUL1HLc9RNxbCUHRqMIVQj4sY0OWVKLn0WMfTskX4
 BtZxGdUjUJibKMEcWkaE.SUHjlJbo_MV_OTVHbvZ.A80mwAiS6QCk8czBfVh
 Uxe1D.drKu8WdOFLz6.7Gvh8eoMjocpRuHfZ5Wvy1.VRrhTRPk7gM5Mk_b65
 H3XsK3453bhXUWpLEIBC3uvu0NIb3g2LrCx4fRgVxYb7F9BBHjrTdMSYrIDq
 AWwM9FMQukmqGqIEG5cObv5NNieWCCgzZc3YtfFfRgdJnbetVJo5IPDVgeoV
 Wu_GPU3c0kPPLgO_b1S3mhWNP2tw_tOsYzn3cCdQPlMToKW8UA43nVVwhYYg
 fN_MXi96xLsUpApk1E91um5yfWKwD5YawahA3eVbN.xOk926nf2_0iBaKuoq
 QXQ.5VCIWysPQ7KuJ0zPTVHEHHVFD8RQaQqd6
Received: from [98.203.118.124] by web121601.mail.ne1.yahoo.com via HTTP;
 Sat, 17 Aug 2013 07:02:50 PDT
X-Rocket-MIMEInfo: 002.001,
 Cgo.PkVGRklDSUVOQ1kgaXMgdGFudGFtb3VudC4gVGhyb3VnaHB1dCBpcyBhbG1vc3QgYWx3YXlzIGEgdHVuaW5nIGlzc3VlLgoKCk9mIGNvdXJzZSBJIG1lYW50IHBhcmFtb3VudC4gQ29mZmVlIG1hdHRlcnMgOi18CgpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwpGcm9tOiBMdWlnaSBSaXp6byA8cml6em9AaWV0LnVuaXBpLml0PgpUbzogTGF3cmVuY2UgU3Rld2FydCA8bHN0ZXdhcnRAZnJlZWJzZC5vcmc.IApDYzogRnJlZUJTRCBOZXQgPG5ldEBmcmVlYnNkLm9yZz4gClNlbnQ6IFdlZG5lc2QBMAEBAQE-
X-Mailer: YahooMailWebService/0.8.154.571
References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org>
 <520B24A0.4000706@freebsd.org> <520B3056.1000804@freebsd.org>
 <20130814102109.GA63246@onelab2.iet.unipi.it>
 <1376745244.6575.YahooMailNeo@web121606.mail.ne1.yahoo.com>
Message-ID: <1376748170.66110.YahooMailNeo@web121601.mail.ne1.yahoo.com>
Date: Sat, 17 Aug 2013 07:02:50 -0700 (PDT)
From: Barney Cordoba <barney_cordoba@yahoo.com>
Subject: Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)
To: Luigi Rizzo <rizzo@iet.unipi.it>, Lawrence Stewart <lstewart@freebsd.org>
In-Reply-To: <1376745244.6575.YahooMailNeo@web121606.mail.ne1.yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: FreeBSD Net <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: Barney Cordoba <barney_cordoba@yahoo.com>
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>;
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 17 Aug 2013 14:02:58 -0000

=0A=0A>>EFFICIENCY is tantamount. Throughput is almost always a tuning issu=
e.=0A=0A=0AOf course I meant paramount. Coffee matters :-|=0A=0A___________=
_____________________=0AFrom: Luigi Rizzo <rizzo@iet.unipi.it>=0ATo: Lawren=
ce Stewart <lstewart@freebsd.org> =0ACc: FreeBSD Net <net@freebsd.org> =0AS=
ent: Wednesday, August 14, 2013 6:21 AM=0ASubject: it's the output, not ack=
 coalescing (Re: TSO and FreeBSD vs Linux)=0A=0A=0AOn Wed, Aug 14, 2013 at =
05:23:02PM +1000, Lawrence Stewart wrote:=0A> On 08/14/13 16:33, Julian Eli=
scher wrote:=0A> > On 8/14/13 11:39 AM, Lawrence Stewart wrote:=0A> >> On 0=
8/14/13 03:29, Julian Elischer wrote:=0A> >>> I have been tracking down a p=
erformance embarrassment on AMAZON EC2 and=0A> >>> have found it I think.=
=0A> >> Let us please avoid conflating performance with throughput. The=0A>=
 >> behaviour you go on to describe as a performance embarrassment is=0A> >=
> actually a throughput difference, and the FreeBSD behaviour you're=0A> >>=
 describing is essentially sacrificing throughput and CPU cycles for=0A> >>=
 lower latency. That may not be a trade-off you like, but it is an=0A> >> i=
mportant factor in this discussion.=0A...=0A> Sure, there's nothing wrong w=
ith holding throughput up as a key=0A> performance metric for your use case=
.=0A> =0A> I'm just trying to pre-empt a discussion that focuses on one met=
ric and=0A> fails to consider the bigger picture.=0A...=0A> > I could see n=
o latency reversion.=0A> =0A> You wouldn't because it would be practically =
invisible in the sorts of=0A> tests/measurements you're doing. Our good fri=
ends over at HRT on the=0A> other hand would be far more likely to care abo=
ut latency on the order=0A> of microseconds. Again, the use case matters a =
lot.=0A...=0A> > so, does "Software LRO" mean that LRO on hte NIC should be=
 ON or OFF to=0A> > see this?=0A> =0A> I think (check the driver code in qu=
estion as I'm not sure) that if you=0A> "ifconfig <if> lro" and the driver =
has hardware support or has been made=0A> aware of our software implementat=
ion, it should DTRT.=0A=0AThe "lower throughput than linux" that julian was=
 seeing is either=0Abecause of a slow (CPU-bound) sender or slow receiver. =
Given that=0Athe FreeBSD tx path is quite expensive (redoing route and arp =
lookups=0Aon every packet, etc.) I highly suspect the sender side is at fau=
lt.=0A=0AAck coalescing, LRO, GRO are limited to the set of packets that yo=
u=0Areceive in the same batch, which in turn is upper bounded by the=0Ainte=
rrupt moderation delay. Apart from simple benchmarks with only=0Aa few flow=
s, it is very hard that ack/lro/gro can coalesce more=0Athan a few segments=
 for the same flow.=0A=0A=A0=A0=A0 But the real fix is in tcp_output.=0A=0A=
In fact, it has never been the case that an ack (single or coalesced)=0Atri=
ggers an immediate transmission in the output path.=A0 We had this=0Ain the=
 past (Silly Window Syndrome) and there is code that avoids=0Asending less =
than 1-mtu under appropriate conditions (there is more=0Adata to push out a=
nyways, no NODELAY, there are outstanding acks,=0Athe window can open furth=
er).=A0 In all these cases there is no=0Areasonable way to experience the d=
ifference in terms of latency.=0A=0AIf one really cares, e.g. the High Spee=
d Trading example, this is=0Aa non issue because any reasonable person woul=
d run with TCP_NODELAY=0A(and possibly disable interrupt moderation), and o=
ptimize for latency=0Aeven on a per flow basis.=0A=0AIn terms of coding eff=
ort, i suspect that by replacing the 1-mtu=0Alimit (t_maxseg i believe is t=
he variable that we use in the SWS=0Aavoidance code) with 1-max-tso-segment=
 we can probably achieve good=0Aresults with little programming effort.=0A=
=0AThen the problem remains that we should keep a copy of route and=0Aarp i=
nformation in the socket instead of redoing the lookups on=0Aevery single t=
ransmission, as they consume some 25% of the time of=0Aa sendto(), and prob=
ably even more when it comes to large tcp=0Asegments, sendfile() and the li=
ke.=0A=0A=A0=A0=A0 cheers=0A=A0=A0=A0 luigi=0A_____________________________=
__________________=0Afreebsd-net@freebsd.org mailing list=0Ahttp://lists.fr=
eebsd.org/mailman/listinfo/freebsd-net=0ATo unsubscribe, send any mail to "=
freebsd-net-unsubscribe@freebsd.org"=0A____________________________________=
___________=0Afreebsd-net@freebsd.org mailing list=0Ahttp://lists.freebsd.o=
rg/mailman/listinfo/freebsd-net=0ATo unsubscribe, send any mail to "freebsd=
-net-unsubscribe@freebsd.org"
From owner-freebsd-net@FreeBSD.ORG  Sat Aug 17 15:59:13 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id EAEDA68E;
 Sat, 17 Aug 2013 15:59:13 +0000 (UTC)
 (envelope-from adrian.chadd@gmail.com)
Received: from mail-wi0-x22f.google.com (mail-wi0-x22f.google.com
 [IPv6:2a00:1450:400c:c05::22f])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 572EB2DF6;
 Sat, 17 Aug 2013 15:59:13 +0000 (UTC)
Received: by mail-wi0-f175.google.com with SMTP id hq12so1798317wib.2
 for <multiple recipients>; Sat, 17 Aug 2013 08:59:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=4I+nOnfALmRniMdZQhSS1ovi+DUx/8YLSalCAwwROUg=;
 b=fBfp9cArltjowzJrKD1Py8sWkLKcOJRV8sdX8/PYKHVdfAmJSNbymmDaj+91WTiH3n
 Xpov2vKxBJRyphyTq8SQt6W9G5b+DdDpx0m1bCXbV4F1MCeHJfZ59Ufym6KU+VEJNju+
 0F3sAtjZokMIGbmbsKwCOFFlySGm2GZtHCMIELQDdOp2lCWSUZ8cZB8b0oMPMQWj1kHG
 zv/xa+wh9c4lNvSLX2t2BAqDpU1/Q0/rySr9jZQ17+j9T9cFLx8N5toVxZYkzVQsug2I
 HNe+WerzHe28YpJXpKFrd3toIU/J88gnCy1PoA8BExJALAUXJgQ5uCiSTUVVrYGgSyXn
 Poog==
MIME-Version: 1.0
X-Received: by 10.180.8.42 with SMTP id o10mr2210836wia.0.1376755151446; Sat,
 17 Aug 2013 08:59:11 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.217.116.136 with HTTP; Sat, 17 Aug 2013 08:59:11 -0700 (PDT)
In-Reply-To: <1376748170.66110.YahooMailNeo@web121601.mail.ne1.yahoo.com>
References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org>
 <520B24A0.4000706@freebsd.org> <520B3056.1000804@freebsd.org>
 <20130814102109.GA63246@onelab2.iet.unipi.it>
 <1376745244.6575.YahooMailNeo@web121606.mail.ne1.yahoo.com>
 <1376748170.66110.YahooMailNeo@web121601.mail.ne1.yahoo.com>
Date: Sat, 17 Aug 2013 08:59:11 -0700
X-Google-Sender-Auth: 7NMZ5TjE9Ra2pKW0aU2cDmP-Pe8
Message-ID: <CAJ-VmonGeqn5qqbfvF9xWaFPYNMNSVb6VwMx+oEVSGXVid98ag@mail.gmail.com>
Subject: Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)
From: Adrian Chadd <adrian@freebsd.org>
To: Barney Cordoba <barney_cordoba@yahoo.com>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: Lawrence Stewart <lstewart@freebsd.org>, Luigi Rizzo <rizzo@iet.unipi.it>,
 FreeBSD Net <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>;
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 17 Aug 2013 15:59:14 -0000

... we get perfectly good throughput without 400k ints a second on the
ixgbe driver.

As in, I can easily saturate 2 x 10GE on ixgbe hardware with a handful of
flows. That's not terribly difficult.

However, there's a few interesting problems that need addressing:

* There's lock contention between the transmit side from userland and the
TCP timers, and the receive side with ACK processing. Under very high
traffic load a lot of lock contention stalls things. We (the royal "we",
I'm mostly just doing tooling at the moment) working on that.
* There's lock contention on the ARP, routing table and PCB lookups. The
latter will go away when we've finally implemented RSS for transmit and
receive and then moved things over to using PCB groups on CPUs which have
NIC driver threads bound to them.
* There's increasing cache thrashing from a larger workload, causing the
expensive lookups to be even more expensive.
* All the list walks suck. We need to be batching things so we use CPU
caches much more efficiently.

The idea of using TSO on the transmit side and generic LRO on the receive
side is to make the per-packet overhead less. I think we can be much more
efficient in general in packet processing, but that's a big task. :-) So,
using at least TSO is a big benefit if purely to avoid decomposing things
into smaller mbufs and contending on those locks in a very big way.

I'm working on PMC to make it easier to use to find these bottlenecks and
make the code and data more efficient. Then, likely, I'll end up hacking on
generic TSO/LRO, TX/RX RSS queue management and make the PCB group thing
default on for SMP machines. I may even take a knife to some of the packet
processing overhead.



-adrian



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1376745244.6575.YahooMailNeo>