Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 21 May 2004 13:23:51 -0400 (EDT)
From:      Robert Watson <rwatson@freebsd.org>
To:        Matthew Dillon <dillon@apollo.backplane.com>
Cc:        arch@freebsd.org
Subject:   Re: Network Stack Locking
Message-ID:  <Pine.NEB.3.96L.1040521122004.4759C-100000@fledge.watson.org>
In-Reply-To: <200405210103.i4L13QWT068012@apollo.backplane.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On Thu, 20 May 2004, Matthew Dillon wrote:

>     It should be noted that the biggest advantages of the distributed
>     approach are (1) The ability to operate on individual PCBs without
>     having to do any token/mutex/other locking at all, (2) Cpu locality
>     of reference in regards to cache mastership of the PCBs and related data,
>     and (3) avoidance of data cache pollution across cpus (more cpus == 
>     better utilization of individual L1/L2 caches and far greater
>     scaleability).  The biggest disadvantage is the mandatory thread switch
>     (but this is mitigated as load increases since each thread can work on
>     several PCBs without further switches, and because our thread scheduler
>     is extremely light weight under SMP conditions).  Messaging passing
>     overhead is very low since most operations already require some sort of
>     roll-up structure to be passed (e.g. an mbuf in the case of the network).

My primary concern with this approach (and the reason I'm taking somewhat
of a "wait and see what happens" attitude) is the level of inter-component
incestuousness (referred to elsewhere in this thread).  At particular
layers in the stack -- the PCBs are probably the best example -- I see the
opportunity for this sort of per-CPU unsynchronized access offering a very
clean and uncomplicated approach.

However, I'm concerned that along many of the total end-to-end paths,
there are a moderate number of pieces that will require traditional
synchronization or extensive re-writing: the route table, host cache, a
variety of "processing" packages such as netgraph, IPSEC, et al.  None of
that suggests that the per-cpu synchronization-free access in a thread
shouldn't be applied, but I'd like to see it demonstrated to be a useful
technique in a more broad sense.  One of the key implied benefits of the
approach is that it allows you to avoid significant rewriting costs for
existing code, which is appealing, but less appealing if it doesn't fall
out in the general case. 

The other concern I have is whether the message queues get deep or not: 
many of the benefits of message queues come when the queues allow
coallescing of context switches to process multiple packets.  If you're
paying a context switch per packet passing through the stack each time you
cross a boundary, there's a non-trivial operational cost to that.  So what
I'd like to see are the numbers that suggest, on a pretty functional
sample stack, that you get at least an interesting level of queuing and
therefore effective coallescing of synchronization.  I've started looking
at similar issues in the type-specific mbuf queues in the FreeBSD kernel
-- additional context switches are expensive and best avoided even if you
use explicit synchronization primitives such as mutexes.

>     In anycase, if you are seriously considering any sort of distributed
>     methodology you should also consider formalizing a messaging passing
>     API for FreeBSD.  Even if you don't like our LWKT messaging API, I
>     think you would love the DFly IPI messaging subsystem and it would be
>     very easy to port as a first step.  We use it so much now in DFly
>     that I don't think I could live without it.  e.g. for clock distribution,
>     interrupt distribution, thread/cpu isolation, wakeup(), MP-safe messaging
>     at higher levels (and hence packet routing), free()-return-to-
>     originating-cpu (mutexless slab allocator), SMP MMU synchronization
>     (the basic VM/pte-race issue with userland brought up by Alan Cox),
>     basic scheduler operations, signal(), and the list goes on and on.
>     In DFly, IPI messaging and message processing is required to be MP
>     safe (it always occurs outside the BGL, like a cpu-localized fast
>     interrupt), but a critical section still protects against reception
>     processing so code that uses it can be made very clean.

As someone who's worked with Darwin and other Mach-derived operating
systems, I see the clear appeal of message passing systems, as I think
we've discussed in other forums.  They offer substantially interesting
benefits from a security perspective also as they offer more clean
separation between components, especially userspace and the kernel. 
However, based on past experience with such systems, I'm also very
cautious about the notion.  The increased level of separation between
components can also make it harder to understand the interactions between
components in a debugging sense: for example, if your stack trace in the
TCP code only goes up to the queue receive primitive, the debugger can't
simply tell you what code originated the mbuf.

In the past, I've explored binding stack traces to messages in message
passing systems when operating in debugging mode so that the debugger
walks up to the message queue, and can then follow the stack trace from
the message to understand more about the calling context.  I've also used
this on FreeBSD in userspace -- we have local modifications to allow the
kernel to attack stack traces of the sending process to messages passed
over UNIX domain sockets so that the receiving code can grab the stack
trace as ancillary data.

The trick, though, is to make sure you're not just substituting message
queue operations and context switches for mutexes, because those both have
a moderate cost.  Many of the benefits come in reducing explicit
synchronization and then amortizing the context switch cost over multiple
instances, which helps with the cache and many other things.  So something
I'd very much like to see out of the dfbsd prototype code is a set of
measurements on queue depth at the hand-off points between layers, and
statistics on #queue operations, synchronization points, etc, amortized
over multiple deliveries. 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert@fledge.watson.org      Senior Research Scientist, McAfee Research




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.NEB.3.96L.1040521122004.4759C-100000>