Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 1 May 2012 11:13:03 -0700 (PDT)
From:      Barney Cordoba <barney_cordoba@yahoo.com>
To:        Sean Bruno <seanbru@yahoo-inc.com>, Juli Mallett <jmallett@FreeBSD.org>
Cc:        "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Subject:   Re: igb(4) at peak in big purple
Message-ID:  <1335895983.68943.YahooMailClassic@web126001.mail.ne1.yahoo.com>
In-Reply-To: <CACVs6=9RzaZAHx6RC4AGywTzpuc8hNrY4eD-e-AJoV32OEMVgg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
=0A=0A--- On Fri, 4/27/12, Juli Mallett <jmallett@FreeBSD.org> wrote:=0A=0A=
> From: Juli Mallett <jmallett@FreeBSD.org>=0A> Subject: Re: igb(4) at peak=
 in big purple=0A> To: "Sean Bruno" <seanbru@yahoo-inc.com>=0A> Cc: "freebs=
d-net@freebsd.org" <freebsd-net@freebsd.org>=0A> Date: Friday, April 27, 20=
12, 4:00 PM=0A> On Fri, Apr 27, 2012 at 12:29, Sean=0A> Bruno <seanbru@yaho=
o-inc.com>=0A> wrote:=0A> > On Thu, 2012-04-26 at 11:13 -0700, Juli Mallett=
 wrote:=0A> >> Queue splitting in Intel cards is done using a hash=0A> of p=
rotocol=0A> >> headers, so this is expected behavior. =A0This also=0A> help=
s with TCP and=0A> >> UDP performance, in terms of keeping packets for=0A> =
the same protocol=0A> >> control block on the same core, but for other=0A> =
applications it's not=0A> >> ideal. =A0If your application does not require=
 that=0A> kind of locality,=0A> >> there are things that can be done in the=
 driver to=0A> make it easier to=0A> >> balance packets between all queues =
about-evenly.=0A> >=0A> > Oh? :-)=0A> >=0A> > What should I be looking at t=
o balance more evenly?=0A> =0A> Dirty hacks are involved :)=A0 I've sent so=
me code to=0A> Luigi that I think=0A> would make sense in netmap (since for=
 many tasks one's going=0A> to do=0A> with netmap, you want to use as many =
cores as possible, and=0A> maybe=0A> don't care about locality so much) but=
 it could be useful=0A> in=0A> conjunction with the network stack, too, for=
 tasks that=0A> don't need a=0A> lot of locality.=0A> =0A> Basically this i=
s the deal: the Intel NICs hash of various=0A> header=0A> fields.=A0 Then, =
some bits from that hash are used to=0A> index a table.=0A> That table indi=
cates what queue the received packet should=0A> go to.=0A> Ideally you'd wa=
nt to use some sort of counter to index that=0A> table and=0A> get round-ro=
bin queue usage if you wanted to evenly saturate=0A> all=0A> cores.=A0 Unfo=
rtunately there doesn't seem to be a way to=0A> do that.=0A> =0A> What you =
can do, though, is regularly update the table that=0A> is indexed=0A> by ha=
sh.=A0 Very frequently, in fact, it's a pretty fast=0A> operation.=A0 So=0A=
> what I've done, for example, is to go through an rotate all=0A> of the=0A=
> entries every N packets, where N is something like the=0A> number of=0A> =
receive descriptors per queue divided by the number of=0A> queues.=A0 So=0A=
> bucket 0 goes to queue 0 and bucket 1 goes to queue 1 at=0A> first.=A0 Th=
en=0A> a few hundred packets are received, and the table is=0A> reprogramme=
d, so=0A> now bucket 0 goes to queue 1 and bucket 1 goes to queue 0.=0A> =
=0A> I can provide code to do this, but I don't want to post it=0A> publicl=
y=0A> (unless it is actually going to become an option for netmap)=0A> for =
fear=0A> that people will use it in scenarios where it's harmful and=0A> th=
en=0A> complain.=A0 It's potentially one more painful variation=0A> for the=
 Intel=0A> drivers that Intel can't support, and that just makes=0A> everyo=
ne=0A> miserable.=0A> =0A> Thanks,=0A> Juli.=0A=0AThat seems like a pretty =
naive approach. First, you want all of the packets in the same flows/connec=
tions to use the same channels, otherwise you'll=0Abe sending a lot of stuf=
f out of sequence. You want to balance your flows,=0Ayes, but not balance b=
ased on packets, unless all of your traffic is icmp.=0AYou also want to bal=
ance bits, not packets; sending 50 60 byte packets=0Ato queue 1 and 50 1500=
 byte packets to queue 2 isn't balancing. They'll=0Abe wildly out of order =
as well.=0A=0AAlso, using as many cores as possible isn't necessarily what =
you want to =0Ado, depending on your architecture. If you have 8 cores on 2=
 cpus, then you=0A probable want to do all of your networking on four cores=
 on one cpu. There's a big price to pay to shuffle memory between caches of=
 separate =0Acpus, splitting transactions that use the same memory space is=
 =0Acounterproductive. More  queues mean more locks, and in the end, lock c=
ontention is your biggest enemy, not cpu cycles.=0A=0AThe idea that splitti=
ng packets that use the same memory and code space =0Aamong cpus isn't a ve=
ry good one; a better approach, assuming you can=0Amicromanage, is to alloc=
ate X cores (as much as you need for your peaks)=0Ato networking, and use o=
ther cores for user space to minimize the=0Ainterruptions.=0A=0ABC



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1335895983.68943.YahooMailClassic>