Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 12 Mar 1997 21:21:34 -0500
From:      "David S. Miller" <davem@jenolan.rutgers.edu>
To:        ccsanady@nyx.pr.mcs.net
Cc:        hackers@FreeBSD.ORG
Subject:   Re: Solaris TPC-C benchmarks (with Oracle)
Message-ID:  <199703130221.VAA21577@jenolan.caipgeneral>
In-Reply-To: <199703130142.TAA11137@nyx.pr.mcs.net> (message from Chris Csanady on Wed, 12 Mar 1997 19:42:03 -0600)

next in thread | previous in thread | raw e-mail | index | archive | help
   Date: Wed, 12 Mar 1997 19:42:03 -0600
   From: Chris Csanady <ccsanady@nyx.pr.mcs.net>

   For starters, I'd like to get rid of the usage of mbuf chains.  This is mostly
   a simple, if time consuming task.  (i think)  It will save a bunch of copying 
   around the net code, as well as simplifying things.  The only part I'm not
   really sure about is how to do memory management with the new "pbuf's."  I
   looked at the linux code, and they call their generic kmalloc() to allocate a
   buffer the size of the packet.  This would be easier, but I dont like it. :)
   In Van Jacobsons slides from his talk, he mentions that routines call the
   output driver to get packet buffers (pbufs), not a generic allocator.

The drivers do the buffer managment in Jacobsons pbuf kernel.  So you
go:

tcp_send_fin(struct netdevice *dev, int len)
{
	struct pbuf *p;

	p = dev->alloc(len);
	[ ... ]
}

Later on you'd go:

	p->dev_free(p);

One huge tip, do _not_ just implement Jacobson's pbuf code blindly.
Anyone who even glances at those slides immediately goes "Geese he's
ignoring all issues of flow control"  I find this rather ironic for
someone who is effectively the godfather of TCP flow control.

Secondly, his fast paths for input bank on the fact that you can get
right into user context when you detect a header prediction hit.  The
only way to do this effectively on a system you'd ever want anyone to
actually run is the following:

1) Device drivers loan pbufs (ie. possibly pieces of device memory of
   driver private fixed dma buffering areas) to the networking code
   on receive.

2) Once the protocol layer detects that this pbuf can go right into
   user space, it jacks up the receiving application processes
   priority such that it becomes a real time thread.  This is because
   you must to guarentee extremely low latencies to the driver whose
   resources you are holding onto.  If the pbuf cannot be processed
   now the pbuf is copied into a new buffer and finally the orig pbuf
   is given back to the device before splnet is left via dev->free(p)

3) If we got a hit and this can go right into userspace, then when
   splnet gets left the kernel sees that whoever is currently on the
   cpu should get off such that any real time networking processes can
   eat the pbufs.

4) tcp_receive() runs in the applications context, csum_copy()'s the
   pbuf right into user space (or perhaps does a flip, this makes the
   driver-->net pbuf method interface slightly more intricate), and
   then calls p->free(p), the applications priority is lowered back
   down to what it was before the new pbuf came in.

This is all nontrivial to pull of.  One nice effect is that you
actually then have a chance of doing real networking page flipping
with the device buffer method scheme.

   In the new implementation, sorecieve, and sosend go away. :)

See my comments about flow control above, some of the code must stay.

   The new architecture also seems as if it would scale nicely with SMP.  This
   is also one of the reasons im interested in doing it.

No one has quantified that pbufs can be made to scale on SMP, it may
(and I think it will) have the same scalability problems that SLAB
allocators can have.  At a minimum you'd have to grab a per device
lock to keep track of the device pbuf pool properly, since any of the
networking code can call upon the code which needs to acquire this
lock you're probably going to need to make it a sleeping lock to get
decent performance.  Guess what?  Then you need to implement what
Solaris does which is allow interrupt handlers to sleep, in order for
it to work at all.

I'd suggest fixing the TCP timers first, they are a much larger
scalability problem than the buffering in BSD.  (IRIX scales to 2,000
connections per second, thats real connections, not some bogus Zeus
benchmark exploiting http connection reuse features etc., and they're
still using mbufs)  Then go to the time wait problem (much harder to
solve than the timers, but less painful to fix than redoing the
buffering), then fix select(), then think about pbufs.

---------------------------------------------////
Yow! 11.26 MB/s remote host TCP bandwidth & ////
199 usec remote TCP latency over 100Mb/s   ////
ethernet.  Beat that!                     ////
-----------------------------------------////__________  o
David S. Miller, davem@caip.rutgers.edu /_____________/ / // /_/ ><



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703130221.VAA21577>