From owner-freebsd-net  Fri Jun 16 20:26:16 2000
Delivered-To: freebsd-net@freebsd.org
Received: from panzer.kdm.org (panzer.kdm.org [216.160.178.169])
	by hub.freebsd.org (Postfix) with ESMTP
	id 4691537B87F; Fri, 16 Jun 2000 20:25:46 -0700 (PDT)
	(envelope-from ken@panzer.kdm.org)
Received: (from ken@localhost)
	by panzer.kdm.org (8.9.3/8.9.1) id VAA57866;
	Fri, 16 Jun 2000 21:25:45 -0600 (MDT)
	(envelope-from ken)
Date: Fri, 16 Jun 2000 21:25:45 -0600
From: "Kenneth D. Merry" <ken@kdm.org>
To: net@FreeBSD.ORG
Subject: zero copy sockets and NFS code for FreeBSD
Message-ID: <20000616212545.A57840@panzer.kdm.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2i
Sender: owner-freebsd-net@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

[ This message is BCC'ed to -arch and -current, so it reaches a little
wider audience, but since it mostly deals with networking stuff, it should
probably be discussed on -net. ]

Thanks to the efforts of a number of people, zero copy sockets and NFS
patches are available for FreeBSD-current at the URL listed below.

These patches include:

 - Two sets of zero copy send code, written by Drew Gallatin
   <gallatin@FreeBSD.ORG> and Robert Picco <picco@wevinc.com>.

 - Zero copy receive code, written by Drew Gallatin.

 - Zero copy NFS code, written by Drew Gallatin.

 - Header splitting firmware for Alteon's Tigon II boards (written by me),
   based on version 12.4.11 of their firmware.  This is used in combination
   with the zero copy receive code to guarantee that the payload of TCP or
   UDP packet is placed into a page-aligned buffer.

 - Alteon firmware debugging ioctls and supporting routines for the Tigon
   driver (also written by me).  This will help anyone who is doing
   firmware development under FreeBSD for the Tigon boards.

Please note that the code is still in development, and should not be used
in a production system.  It could crash your system, you could lose data,
etc.

The Alteon firmware header splitting and debugging code was written for
Pluto Technologies (www.plutotech.com), which has kindly agreed to let me
release the code.

I'm releasing these patches now, so people can take a look at the code,
test it out, give feedback and hopefully supply patches for things that
are broken.

The code is located here:

http://people.FreeBSD.ORG/~ken/zero_copy/

The patches are based on -current from early in the day on June 13th, i.e.
before Peter's config changes.

Frequently Asked Questions:

1.	Known Problems.
2.	What is "zero copy"?
3.	How does zero copy work?
4.	What hardware does it work with?
5.	Configuration and performance tuning.
6.	Benchmarks.
7.	Possible future directions.

1.  Known Problems:

 - Robert Picco's zero copy send code (options ZCOPY) corrupts data that it
   sends.  You can verify this with the 'nttcp' port from ports/net, like
   this:

	nttcp -c -n 262144 -t -T -w 512k 10.0.0.2
	(assuming 10.0.0.2 is the target machine)

 - Running high volumes of traffic to the local machine can trigger panics.
   If the machine in question is '10.0.0.2', doing the following is enough
   to panic it:

	netperf -H 10.0.0.2
	(netperf is in ports/benchmarks)

2.  What is "zero copy"?

Zero copy is a misnomer, or an accurate description, depending on how you
look at things.

In the normal case, with network I/O, buffers are copied from the user
process into the kernel on the send side, and from the kernel into the user
process on the receiving side.

That is the copy that is being eliminated in this case.  The DMA or copy
from the kernel into the NIC, or from the NIC into the kernel is not the
copy that is being eliminated.  In fact you can't eliminate that copy
without taking packet processing out of the kernel altogether.  (i.e. the
kernel has to see the packet headers in order to determine what to do with
the payload)

Memory copies from userland into the kernel are one of the largest
bottlenecks in network performance on a BSD system, so eliminating them can
greatly increase network throughput, and decrease system load when CPU or
memory bandwidth isn't the limiting factor.

3.  How does zero copy work?

The send side and receive side zero copy code work in different ways:

The send side code takes pages that the userland program writes to a
socket, and puts a COW (Copy On Write) mapping on each page, and stuffs it
into a mbuf.  The data the user program writes must be page sized and start
on a page boundary in order for it to be run through the zero copy send
code.

If the userland program doesn't write to the page before it has been sent
out on the wire and the mbuf freed (and therefore the COW mapping revoked),
the page will be copied.  For TCP, the mbuf isn't freed until the packet is
acknowledged by the receiver.

So send side zero copy is only better than the standard case, where
userland buffers are copied into kernel buffers, if the userland program
doesn't immediately reuse the buffer.

Receive side zero copy works in a slightly different manner, and depends in
part on the capabilities of the network card in question.

One requirement for zero copy receive to work is that the chunks of data
passed up the network and socket layers have to be at least page sized, and
be aligned on page boundaries.  This pretty much means that the card has
to have a MTU of 4K or 8K in the case of the Alpha.  Gigabit Ethernet cards
using Jumbo Frames (9000 byte MTU) fall into this category.  More on that
below.

Another requirement for zero copy receive to work is that the NIC driver
needs to allocate receive side pages from a "disposeable" pool.  This means
allocating memory apart from the normal mbuf memory, and attaching it as an
external buffer to the mbuf.

It also helps if the NIC can receive packets into multiple buffers, and if
the NIC can separate the ethernet, IP, and TCP or UDP headers from the
payload.  The idea is to get the packet payload into one or more page-sized,
page-aligned buffers.

The NIC driver receives data into these buffers allocated from a
disposeable pool.  The mbuf with these buffers attached is then passed up
the network stack where the headers are removed.  Finally it reaches the
socket layer, and waits for the user to read it.  Once the user reads the
data, the kernel page is then substituted for the user's page, and the
user's page is then recycled.  This is otherwise known as "page flipping".

The page flip can only occur if both the userland buffer and kernel buffer
are page aligned, and if there is at least a page worth of data in the
source and destination.  Otherwise the data will be copied out using
copyout() in the normal manner.

4.  What hardware does it work with?

The send side zero copy code should work with most any network adapter.

The receive side code, however, requires an adapter with an MTU that is at
least a page size, due to the alignment restrictions for page substitution
(or "page flipping").

The zero-copy NFS receive-side code also requires an adapter with an
MTU that is at least page-size & which is capable of splitting the NFS
rpc header off of the payload.  Furthermore, it only works with UDP
mounts.  The server's zero-copy read response code simply maps kernel
memory into mbufs and has no special adapter or protocol requirements.

The Alteon firmware debugging code requires an Alteon Tigon II board.  If
you want the patches to the userland tools and Tigon firmware to debug it
and make it compile under FreeBSD, contact ken@FreeBSD.ORG.

5.  Configuration and performance tuning.

There are a number of options that need to be turned on for various things
to work:

options		ZERO_COPY_SOCKETS	 # Turn on zero copy send code
options		ENABLE_VFS_IOOPT	 # Turn on zero copy receive
options		NMBCLUSTERS=(512+512*32) # lots of mbuf clusters
options		TI_JUMBO_HDRSPLIT	 # Turn on Tigon header splitting

To turn on Robert Picco's zero copy send code, substitute:

options		ZCOPY			 # Robert Picco's zero copy code

for the ZERO_COPY_SOCKETS option above.

The number of mbuf clusters above works for me, your mileage may vary.  It
probably isn't necessary to allocate that many.

To get the maximum performance out of the code, here are some suggestions
on various sysctl and other parameters.  These assume you've got an
Alteon-based board, so if you're using something else, you may want to
experiment and find the optimum values for some of them:

 - Make sure the MTU on your Tigon (or other) board is set to 9000.

 - Enable RFC 1323, which allows your TCP MSS to go above 64KB:

	sysctl -w net.inet.tcp.rfc1323=1

 - Turn on vfs.ioopt to enable zero copy receive:

	sysctl -w vfs.ioopt=1

 - Increase your socket buffer size and send and receive window size for
   TCP:

	sysctl -w kern.ipc.maxsockbuf=2097152
	sysctl -w net.inet.tcp.sendspace=524288
	sysctl -w net.inet.tcp.recvspace=524288

   A send window of 512K seems to work well with 1MB Tigon boards, and a
   send window of 256K seems to work well with 512K Tigon boards.  Again,
   you may want to experiment to find the best settings for your hardware.

 - Increase UDP send space and maximum datagram size:

	sysctl -w net.inet.udp.recvspace=65535
	sysctl -w net.inet.udp.maxdgram=57344

6.  Benchmarks

One nice benchmark is netperf (www.netperf.org), which is in the benchmarks
subdirectory of the ports tree.

Netperf isn't exactly a real world benchmark, since it sends page aligned
data that is a multiple of the page size.  It is good for trying to
determine maximum throughput.

Another benchmark to try is nttcp, which is in ports/net.

Here is are some netperf numbers for TCP and UDP throughput between two
Pentium II 350's with 128MB RAM and 1MB Alteon ACEnics:

# ./netperf -H 10.0.0.1
TCP STREAM TEST to 10.0.0.1 : histogram
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

524288 524288 524288    10.01     742.46   

# ./netperf -t UDP_STREAM -H 10.0.0.1 -- -m 8192
UDP UNIDIRECTIONAL SEND TEST to 10.0.0.1 : histogram
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

 57344    8192   10.01      140396 585086     919.34
 65535           10.01       93525            612.42


As you can see, the TCP performance is 742Mbits/sec, or about
93MBytes/sec.

Drew Gallatin has achieved much higher performance with faster hardware:

This is between 2 Dell PowerEdge 4400 servers using prototype 64-bit,
66MHz PCI Myricom Lanai-9 NICs with a 2.56Gb/sec link speed.  The MTU
was 32828 bytes.  They're both uniprocessor 733MHz Xeons running a
heavily patched 4.0-RELEASE & my zero-copy code in conjunction with
Duke's Trapeze software (drive & firmware) for Myrinet adapters.  The
receiver is offloading checksums and is 60% idle, the sender is
calculating checksums and is pegged at 100% CPU.

<9:12am>wrath/gallatin:~>netperf -Hsloth-my
TCP STREAM TEST to sloth-my : histogram
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec
 
524288 524288 524288    10.00    1764.50
 

7.  Possible future directions.

Send side zero copy:

One of the obvious problems with the current send side approach is that it
only works if the userland application doesn't immediately reuse the
buffer.

In the case of many system applications, though, the application will reuse
the buffer immediately, and therefore performance will be no better than
the standard case.  Many common applications (like ftp) have been written
with the current system buffer usage in mind, so they function like this:

	while !done {
		read x bytes from disk into buffer y
		write x bytes from buffer y into the socket
	}

That makes sense if the kernel is only going to copy the data, but it
doesn't in the zero copy case.

Another problem with the current send side approach is that it requires
page sized and page aligned data in order to apply the COW mapping.  Not
all data sets fit this requirement.

One way to address both of the above problems is to implement an alternate
zero copy send scheme that uses async I/O.  With async I/O semantics, it
will be clear to the userland program that the buffer in question is not to
be used until it is returned from the kernel.

So with that approach, you eliminate the need to map the data
copy-on-write, and therefore also eliminate the need for the data to be
page sized and page aligned.

Receive side zero copy:

The main issue with the current receive side zero copy code is the size and
alignment restrictions.

One way to get around the restriction is if it were possible to do
operations similar to a page flip on buffers that are less than a page
size.

Another way to get around the restriction is to have the receiving client
pass buffers into the kernel (perhaps with an async I/O type interface) and
have the NIC DMA the data directly into the buffers the user has supplied.

One proposal for doing this is called RDMA.  There is a draft of the
specification here:

ftp://ftpeng.cisco.com/pub/rdma/draft-csapuntz-tcprdma-00.txt

Essentially RDMA allows for the sender and receiver to negotiate
destination buffer locations on the receiver.  The sender then includes the
buffer locations in a TCP header option, and the NIC can then extract the
destination location for the payload and DMA it to the appropriate place.

One drawback to this approach is that it requires support for RDMA on both
ends of the connection.

Ken
-- 
Kenneth Merry
ken@kdm.org


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message