Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 6 May 2004 14:05:18 -0400 
From:      Gerrit Nagelhout <gnagelhout@sandvine.com>
To:        'Luigi Rizzo' <rizzo@icir.org>, freebsd-current@freebsd.org, 'Robert Watson' <rwatson@freebsd.org>
Subject:   RE: 4.7 vs 5.2.1 SMP/UP bridging performance
Message-ID:  <FE045D4D9F7AED4CBFF1B3B813C85337021AB394@mail.sandvine.com>

next in thread | raw e-mail | index | archive | help
Luigi Rizzo wrote:

> On Wed, May 05, 2004 at 07:38:38PM -0400, Gerrit Nagelhout wrote:
> > Robert Watson wrote:
> ...
> > > Getting polling and SMP to play nicely would be a very 
> good thing, but
> > > isn't something I currently have the bandwidth to work on.  
> > > Don't suppose we could interest you in that? :-)
> ...
> > I won't be able to work on that feature anytime soon, but if some
> > prototyping turns out to have good results, and the mutex 
> cost issues
> > are worked out, it's quite likely that we'll try to 
> implement it.  The
> > original author of the polling code (Luigi?) may have some input on
> > this as well.
> 
> ENOTIME at the moment, but surely i would like to figure out
> first some locking issues e.g. related to the clustering of
> packets to reduce the number of locking ops.
> The other issue is the partitioning of work -- no point
> in having multiple polling loops work on the same interface.
> Possibly we might try to restructure the processing in the network
> stack by using one processor/polling loop that quickly determines
> the tasks that need work and then posts the results so that
> other processors can grab one task each. Kind of a dataflow
> architecture, in a sense.
> 
> In any case, I am really impressed by the numbers Gerrit achieved
> in the UP/4.7 case -- i never went above 800kpps though in
> my case i think i was limited by the PCI bus (64 bit, 100MHz)
> or possibly even the chipset
> 

The numbers I achieved on 4.7 were on a 2.8Ghz xeon, 64 bit 100Mhz PCI-X.
The setup was a 2 port bridge, with each port receiving & transmitting
up to 600kpps (1.2Mpps aggregate).  On this particular system, I've seen 
it as high at 700kpps when using 133Mhz and enabling hardware prefetch 
in the em driver (this is not a supported feature though and might hang 
the chip).
Using a 4 port bridge, the aggregate can go a little higher (~1.6Mpps).
In order to get this performance, I had to make quite a few tweaks
to both the em driver and the way mbufs/clusters are used.
The first bottleneck (the one 4.7 currently has) is due to the pci
accesses in the em driver.  The following changes will make this better:
1) Don't update the tail pointer for every packet added to the receive
   ring.  Only doing this once every 64 packets or so reduces a lot
   of pci accesses.  (I later noticed that the linux driver supplied
   by Intel already does this)
2) Same as 1), but for the transmit ring.  This one is a little 
   trickier because it may add some latency.  I only updated it after
   "n" packets, and at the end of em_poll.
3) Have the transmitter only write back every nth descriptor instead
   of every one.  This makes the tranmit ring cleanup code a bit more
   expensive, but it's well worth it.

After making these changes, the bottleneck will typically become 
non-cached memory accesses.  Changing em_get_buf to use the mcl_pool
cache in uipc_mbuf.c makes the receive path a little faster but it's
still not optimal.  Ideally when a new packet is added to the receive 
ring it (mbuf & cluster) doesn't have to be pulled into the cache
until the packet is filled in and is ready to be processed.  Because
mbufs are on a linked list, it gets pulled into cache to read the 
next pointer.  To avoid this, I created a cached array (stack) of 
cluster pointers (not attached to mbuf).  To add clusters to the 
receive ring, the cluster doesn't need to be accessed at all, saving
a memory read.  Once the packet is ready to be processed an mbuf
is allocated to attach this cluster to.  In the bridging code, the
number of mbufs was small enough to always be in cache and therefore
there was only one random memory lookup for every packet by the cpu.
This also made it possible to create much larger receive rings (2048)
in order to avoid dropping packets under high loads when the 
processor got "distracted" for a bit.

Also, this kind of performance was only possible using polling.  One
advantage that UP has (at least on 4.7) is the idle_poll feature.
One more optimization that I hacked together once, but have not been
able to properly implement is to use 4M pages for the mbufs and 
clusters.  I found that using software prefetches can help 
performance significantly, but on the i386 you can only prefetch
something that is in the TLB.  Since there are only 64 entries, the
odds of cluster being in the TLB upon receiving it is very small.
With 4M pages, it only takes a few pages to map all the mbufs &
clusters in the system.  The advantages are that you can prefetch,
and avoid tlb page misses & thrashing under high loads.

After all these changes, the bottleneck becomes raw cpu cycles.  That's
why I'd like to get multiple processors doing polling simultaneously.

I've always meant to submit some of these changes back to freeBSD again,
but wasn't sure how much interest there would be.  If anyone one is
interested in helping me out with this, let me know and I will try to
get that process started.

Gerrit



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?FE045D4D9F7AED4CBFF1B3B813C85337021AB394>