Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 23 Apr 2008 12:02:37 +0200
From:      "Darren Reed" <darrenr@freebsd.org>
To:        "Robert Watson" <rwatson@FreeBSD.org>
Cc:        arch@freebsd.org, freebsd-current@freebsd.org, "Christian S.J. Peron" <csjp@FreeBSD.org>
Subject:   Re: HEADS UP: zerocopy bpf commits impending
Message-ID:  <1208944957.9641.1249417345@webmail.messagingengine.com>
In-Reply-To: <20080408132058.U10870@fledge.watson.org>
References:  <20080317133029.GA19369@sub.vaned.net> <20080317134335.A3253@fledge.watson.org> <47FB586F.90606@freebsd.org> <20080408132058.U10870@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 8 Apr 2008 13:28:18 +0100 (BST), "Robert Watson"
<rwatson@FreeBSD.org> said:
> 
> On Tue, 8 Apr 2008, Darren Reed wrote:
> 
> > Is there a performance analysis of the copy vs zerocopy available? (I don't 
> > see one in the paper, just a "to do" item.)
> >
> > The numbers I'm interested in seeing are how many Mb/s you can capture 
> > before you start suffering packet loss.  This needs to be done with 
> > sequenced packets so that you can observe gaps in the sequence captured.
> 
> We've done some analysis, and a couple of companies have the zero-copy
> BPF 
> code deployed.  I hope to generate a more detailed analysis before the 
> developer summit so we can review it at BSDCan.  The basic observation is
> that 
> for quite a few types of network links, the win isn't in packet loss per
> se, 
> but in reduced CPU use, freeing up CPU for other activities.  There are a 
> number of sources of win:
> 
> - Reduced system call overhead -- as load increases, # system calls goes
> down,
>    especially if you get a two-CPU pipeline going.
> 
> - Reduced memory access, especially for larger buffer sizes, avoids
> filling
>    the cache twice (first in copyout, then again in using the buffer in
>    userspace).
> 
> - Reduced lock contention, as only a single thread, the device driver
> ithread,
>    is acquiring the bpf descriptor's lock, and it's no longer contending
>    with
>    the user thread.
> 
> One interesting, and in retrospect reasonable, side effect is that user
> CPU 
> time goes up in the SMP scenario, as cache misses on the BPF buffer move
> from 
> the read() system call to userspace.  And, as you observe, you have to
> use 
> somewhat larger buffer sizes, as in the previous scenario there were
> three 
> buffers: two kernel buffers and a user buffer, and now there are simply
> two 
> kernel buffers shared directly with user space.
> 
> The original committed version has a problem in that it allows only one
> kernel 
> buffer to be "owned" by userspace at a time, which can lead to excess
> calls to 
> select(); this has now been corrected, so if people have run performance 
> benchmarks, they should update to the new code and re-run them.
> 
> I don't have numbers off-hand, but 5%-25% were numbers that appeared in
> some 
> of the measurements, and I'd like to think that the recent fix will
> further 
> improve that.

Out of curiosity, were those numbers for single cpu/core systems
or systems with more than one cpu/core active/available?

I know the testing I did was all single threaded, so moving time
from kernel to user couldn't be expected to make a large overall
difference in a non-SMP kernel (NetBSD-something at the time.)

Darren
-- 
  Darren Reed
  darrenr@fastmail.net




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1208944957.9641.1249417345>