Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 03 Jan 2008 01:01:38 +0100
From:      Andre Oppermann <andre@freebsd.org>
To:        "Bruce M. Simpson" <bms@FreeBSD.org>
Cc:        freebsd-net@freebsd.org, Tiffany Snyder <tiffany.snyder@gmail.com>
Subject:   Re: Routing SMP benefit
Message-ID:  <477C25E2.4080303@freebsd.org>
In-Reply-To: <477C1776.2080002@FreeBSD.org>
References:  <43B45EEF.6060800@x-trader.de> <43B47CB5.3C0F1632@freebsd.org>	<b63e753b0712281551u52894ed9mb0dd55a988bc9c7a@mail.gmail.com> <477C1434.80106@freebsd.org> <477C1776.2080002@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Bruce M. Simpson wrote:
> Andre Oppermann wrote:
>> So far the PPS rate limit has primarily been the cache miss penalties
>> on the packet access.  Multiple CPUs can help here of course for bi-
>> directional traffic.  Hardware based packet header cache prefetching as
>> done by some embedded MIPS based network processors at least doubles the
>> performance.  Intel has something like this for a couple of chipset and
>> network chip combinations.  We don't support that feature yet though.
> 
> What sort of work is needed in order to support header prefetch?

Extracting the documentation out of Intel for a first step.  It's
called Direct Cache Access (DCA).  At least in the Linux implementation
it has been intermingled with I/OAT which is an asynchronous memory
controller based DMA copy mechanism.  Don't know if they really have
to be together.  The idea of DCA is to cause the memory controller
upon DMA'ing a packet into main memory to also load it into the
CPU cache(s) right away.  For packet forwarding the first 128 bytes
are sufficient.  For server applications and TCP it may be beneficial
to prefetch the whole packet.  May cause some considerable cache
pollution though depending on usage.

Some pointers:

http://www.stanford.edu/group/comparch/papers/huggahalli05.pdf
http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=tree;f=drivers/dca;hb=HEAD
http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=tree;f=drivers/dma;hb=HEAD
http://download.intel.com/technology/comms/perfnet/download/ServerNetworkIOAccel.pdf
http://download.intel.com/design/network/prodbrf/317796.pdf

>> Many of the things you mention here are planned for FreeBSD 8.0 in the
>> same or different form.  Work in progress is the separation of the ARP
>> table from kernel routing table.  If we can prevent references to radix
>> nodes generally almost all locking can be done away with.  Instead only
>> a global rmlock (read-mostly) could govern the entire routing table.
>> Obtaining the rmlock for reading is essentially free.
> 
> This is exactly what I'm thinking, this feels like the right way forward.
> 
> A single rwlock should be fine, route table updates should generally 
> only be happening from one process, and thus a single thread, at any 
> given time.

rmlocks are even faster and the change to use ratio is also quite right.

>> Table changes
>> are very infrequent compared to lookups (like 700,000 to 300-400) in
>> default free Internet routing.  The radix trie nodes are rather big
>> and could use some more trimming to make the fit a single cache line.
>> I've already removed some stuff a couple of years ago and more can be
>> done.
>>
>> It's very important to keep this in mind: "profile, don't speculate".
> Beware though that functionality isn't sacrificed at the expense of this.
> 
> For example it would be very, very useful to be able to merge the 
> multicast routing implementation with the unicast -- with the proviso of 
> course that mBGP requires that RPF can be performed with a separate set 
> of FIB entries from the unicast FIB.
> 
> Of course if next-hops themselves are held in a container separately 
> referenced
> from the radix node, such as a simple linked list as per the OpenBSD code.

Haven't looked at the multicast code so I can't comment.  The other
stuff is just talk so far.  No work in progress, at least from my side.

> If we ensure the parent radix trie node object fits in a cache line, 
> then that's fine.
> 
> [I am looking at some stuff in the dynamic/ad-hoc/mesh space which is 
> really going to need support for multipath similar to this.]

I was looking at some parallel forwarding table for fastforward
that is highly optimized for IPv4 and cache efficiency.  It was
supposed to be 8-bit stride based (256-ary) with SSE based multi
segment longest prefix match updates.  Never managed to this past
the design state though.  And it's not one of the pressing issues.

The radix trie is pretty efficient though for being architecture
independent.  Even though the depth and variety in destination
addresses matters it never really turned out to become bottleneck
in my profile at the time.  It does have its limitations though
becoming more apparent at very high PPS and very large routing
tables as in the DFZ.

-- 
Andre




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?477C25E2.4080303>