FreeBSD Mail Archives

Date:      Mon, 17 Mar 2008 18:45:52 +0000 (GMT)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        Julian Elischer <julian@elischer.org>
Cc:        arch@freebsd.org, freebsd-current@freebsd.org, "Christian S.J. Peron" <csjp@FreeBSD.org>
Subject:   Re: HEADS UP: zerocopy bpf commits impending
Message-ID:  <20080317183024.I80049@fledge.watson.org>
In-Reply-To: <47DEB62A.4030301@elischer.org>
References:  <20080317133029.GA19369@sub.vaned.net> <20080317134335.A3253@fledge.watson.org> <47DEB62A.4030301@elischer.org>

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--621616949-2070634317-1205779552=:80049
Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE

On Mon, 17 Mar 2008, Julian Elischer wrote:

>> Per previous posts, interested parties can find the slides on the design=
=20
>> from the BSDCan 2008 developer summit here:
>>
>>=20
>> http://www.watson.org/~robert/freebsd/2007bsdcan/20070517-devsummit-zero=
copybpf.pdf
>
> with the video of the talk at:
>
> http://www.freebsd.org/~julian/BSDCan-2007/rwatson_bpf.mov

The primary design change since that time is that we've eliminated the=20
ioctl-driven monitoring and ACKing of shared memory buffers from userspace.=
=20
All shared memory consumers must use the shared memory ACK model, and our=
=20
libpcap changes do that.  This removes redundancy (and complexity) from the=
=20
set of ioctls we've added.  I've attached the (new) text from bpf.4 below,=
=20
which I think captures the changes best.

Robert N M Watson
Computer Laboratory
University of Cambridge

BUFFER MODES
      bpf devices deliver packet data to the application via memory buffers
      provided by the application.  The buffer mode is set using the
      BIOCSETBUFMODE ioctl, and read using the BIOCGETBUFMODE ioctl.

    Buffered read mode
      By default, bpf devices operate in the BPF_BUFMODE_BUFFER mode, in wh=
ich
      packet data is copied explicitly from the kernel to user memory using=
 the
      read(2) system call.  The user process will declare a fixed buffer si=
ze
      that will be used both for sizing internal buffers and for all read(2=
)
      operations on the file.  This size is queried using the BIOCGBLEN ioc=
tl,
      and is set using the BIOCSBLEN ioctl.  Note that an individual packet
      larger than the buffer size is necessarily truncated.

    Zero=E2=80=90copy buffer mode
      bpf devices may also operate in the BPF_BUFMODE_ZEROCOPY mode, in whi=
ch
      packet data is written directly into user memory buffers by the kerne=
l,
      avoiding both system call and copying overhead.  Buffers are of fixed
      (and equal) size, page=E2=80=90aligned, and an even multiple of the p=
age size.
      The maximum zero=E2=80=90copy buffer size is returned by the BIOCGETZ=
MAX ioctl.
      Note that an individual packet larger than the buffer size is necessa=
rily
      truncated.

      The user process registers two memory buffers using the BIOCSETZBUF
      ioctl, which accepts a struct bpf_zbuf pointer as an argument:

      struct bpf_zbuf {
              void *bz_bufa;
              void *bz_bufb;
              size_t bz_buflen;
      };

      bz_bufa is a pointer to the userspace address of the first buffer tha=
t
      will be filled, and bz_bufb is a pointer to the second buffer.  bpf w=
ill
      then cycle between the two buffers starting with bz_bufa.

      Each buffer begins with a fixed=E2=80=90length header to hold synchro=
nization=20
and
      data length information for the buffer:

      struct bpf_zbuf_header {
              volatile u_int  bzh_kernel_gen; /* Kernel generation number. =
*/
              volatile u_int  bzh_kernel_len; /* Length of data in the buff=
er.=20
*/
              volatile u_int  bzh_user_gen;   /* User generation number. */
              /* ...padding for future use... */
      };

      The header structure of each buffer, including all padding, should be
      zeroed before it is passed to the ioctl.  Remaining space in the buff=
er
      will be used by the kernel to store packet data, laid out in the same
      format as with buffered read mode.

      The kernel and the user process follow a simple acknowledgement proto=
col
      via the buffer header to synchronize access to the buffer: when the
      header generation numbers, bzh_kernel_gen and bzh_user_gen, hold the =
same
      value, the kernel owns the buffer, and when they differ, userspace ow=
ns
      the buffer.

      While the kernel owns the buffer, the contents are unstable and may
      change asynchronously; while the user process owns the buffer, its co=
n=E2=80=90
      tents are stable and will not be changed until the buffer has been
      acknowledged.

      Initializing the buffer headers to all 0=E2=80=99s before registering=
 the=20
buffer
      has the effect of assigning initial ownership of both buffers to the=
=20
ker=E2=80=90
      nel.  The kernel signals that a buffer has been assigned to userspace=
 by
      modifying bzh_kernel_gen, and userspace acknowledges the buffer and
      returns it to the kernel by setting the value of bzh_user_gen to the
      value of bzh_kernel_gen.

      In order to avoid caching and memory re=E2=80=90ordering effects, the=
 user
      process must use atomic operations and memory barriers when checking =
for
      and acknowledging buffers:

      #include <machine/atomic.h>

      /*
       * Return ownership of a buffer to the kernel for reuse.
       */
      static void
      buffer_acknowledge(struct bpf_zbuf_header *bzh)
      {

              atomic_store_rel_int(&bzh=E2=80=90>bzh_user_gen,=20
bzh=E2=80=90>bzh_kernel_gen);
      }

      /*
       * Check whether a buffer has been assigned to userspace by the kerne=
l.
       * Return true if userspace owns the buffer, and false otherwise.
       */
      static int
      buffer_check(struct bpf_zbuf_header *bzh)
      {

              return (bzh=E2=80=90>bzh_user_gen !=3D
                  atomic_load_acq_int(&bzh=E2=80=90>bzh_kernel_gen));
      }

      The user process may force the assignment of the next buffer, if any =
data
      is pending, to userspace using the BIOCROTZBUF ioctl.  This allows th=
e
      user process to retrieve data in a partially filled buffer before the
      buffer is full, such as following a timeout; the process must check f=
or
      buffer ownership using the header generation numbers, as the buffer w=
ill
      not be assigned if no data was present.

      As in the buffered read mode, kqueue(2), poll(2), and select(2) may b=
e
      used to sleep awaiting the availbility of a completed buffer.  They w=
ill
      return a readable file descriptor when ownership of the next buffer i=
s
      assigned to user space.

      In the current implementation, the kernel will assign ownership of at
      most one buffer at a time to the user process.  The user processes mu=
st
      acknowledge the current buffer in order to be notified that the next
      buffer is ready for processing.  Programs should not rely on this as =
an
      invariant, as it may change in future versions.

--621616949-2070634317-1205779552=:80049--

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080317183024.I80049>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation