From owner-freebsd-hackers@FreeBSD.ORG  Sat Oct 25 21:42:01 2003
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id B902816A4B3
	for <hackers@freebsd.org>; Sat, 25 Oct 2003 21:42:01 -0700 (PDT)
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id A1F2A43F75
	for <hackers@freebsd.org>; Sat, 25 Oct 2003 21:42:00 -0700 (PDT)
	(envelope-from robert@fledge.watson.org)
Received: from fledge.watson.org (localhost [127.0.0.1])
	by fledge.watson.org (8.12.9p2/8.12.9) with ESMTP id h9Q4etMg086027;
	Sun, 26 Oct 2003 00:40:55 -0400 (EDT)
	(envelope-from robert@fledge.watson.org)
Received: from localhost (robert@localhost)h9Q4esbX086024;
	Sun, 26 Oct 2003 00:40:54 -0400 (EDT)
	(envelope-from robert@fledge.watson.org)
Date: Sun, 26 Oct 2003 00:40:54 -0400 (EDT)
From: Robert Watson <rwatson@freebsd.org>
X-Sender: robert@fledge.watson.org
To: Matthew Dillon <dillon@apollo.backplane.com>
In-Reply-To: <200310260401.h9Q41858034072@apollo.backplane.com>
Message-ID: <Pine.NEB.3.96L.1031026001054.74063I-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
cc: Kip Macy <kmacy@fsmware.com>
cc: hackers@freebsd.org
cc: John-Mark Gurney <gurney_j@efn.org>
cc: Marcel Moolenaar <marcel@xcllnt.net>
Subject: Synchronization philosophy (was: Re: FreeBSD mail list etiquette)
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Oct 2003 04:42:01 -0000


(Subject changed to reflect the fact that it contains useful technical
content and banter, resulting in a hijacking of the thread; hope no one
minds)

On Sat, 25 Oct 2003, Matthew Dillon wrote:

>     Yes.  I'm not worried about BPF, and ucred is easy since it is
>     already 95% of the way there, though messing with ucred's ref count
>     will require a mutex or an atomic bus-locked instruction even in 
>     DragonFly!  The route table is our big issue.  TCP caches routes so we
>     can still BGL the route table and achieve 85% of the scaleable
>     performance so I am not going to worry about the route table initially.
> 
>     An example with ucred would be to passively queue it to a particular cpu
>     for action.  Lets say instead of using an atomic bus-locked instruction
>     to manipulate ucred's ref count, we instead send a passive IPI to the
>     cpu 'owning' the ucred, and that ucred is otherwise read-only.  A 
>     passive IPI, which I haven't implemented yet, is simply queueing an
>     IPI message but not actually generating an interrupt on the target cpu
>     unless the CPU->CPU software IPI message FIFO is full, so it doesn't
>     actually waste any cpu cycles and multiple operations can be executed
>     in-batch by the target.  Passive IPIs can be used for things
>     that do not require instantanious action and both bumping and releasing
>     ref counts can take advantage of it.  I'm not saying that is how
>     we will deal with ucred, but it is a definite option.

Actually, the problem isn't so much the data referenced by ucred, but the
references themselves.  Part of the issue in Darwin is that ucred
references are always gained using the p_ucred pointer in the proc
structure.  The proc structure is read and dereferenced fairly deep in the
network code (network funnel), and also in the remainder of the kernel
(kernel funnel).  In addition, there's a lock used to serialize writes to
p->p_ucred, but not to protect against reads of stale data.  Shared
structures, such as these, occur in pretty large quantity in BSD code, and
will be a problem no matter what approach to synchronization is taken. 
Moving towards message passing helps to structure the code to avoid
sharing of this sort, although it's not the only way to motivate that sort
of change.  I'm a big fan of the change in -CURRENT to use td->td_cred as
a read-only thread-local credential reference and avoid synchronization on
the credential reference--it nicely addresses the requirements for
consistency in the referenced data for the read-only cases (which are the
vast majority of uses of a credential).

There are a number of cases where moving towards a message passing
philosophy would really clean up the synchronization and parallelism
issues in FreeBSD: for example, even the relatively simple accounting file
rotation would benefit from queue-like operation to serialize the
accounting data/event stream and rotation events.  Using locks and
condition variables to perform serialization as is currently done in the
accounting code is unwieldy and bug-prone.  However, when moving to
event/message queuing, you also have to be very careful with data
ownership and referencing, as well as proper error-handling.  With
accounting, most scheduled vnode operations are asynchronous and have no
need for substantial error handling (when a process performs execve(),
regardless of whether accounting of that operation succeeds or fails,
execve() continues).  The start/stop operation, however, is intended to be
synchronous.  Happily, in the accounting case, all necessary error
checking can be performed in advance of the handoff to the accounting
thread from the user thread, but that won't always be the case...

One of the other risks that has worried me about this approach is that
explicit locking has some nice benefits from the perspective of
deadlocking and lock order management: monitoring for deadlocks and lock
orders is a well-understood topic, and the tools for tracking deadlocks
and wait conditions, as well as for performing locking and waiting safely,
are mature.  As with with the older BSD sleeping interfaces, such as
tsleep(), synchronous waits on messages are harder to mechanically track,
and resulting deadlocks resemble resource deadlocks more than lock
deadlocks...  On the other hand, some forms of tracing may be made easier. 
I've had some pretty nasty experiences trying to track deadlocks between
cooperating threads due to message waits, and found tools such as WITNESS
much easier to work with. 

In some work we're doing for one of our customers, we make extensive use
of handoff between various submitting threads and a serializing kernel
thread making use of thread-local storage to avoid explicit
synchronization.  Having dealt both with lower level lock/cv primitives
for event passing, and message passing, I have to say I'm leaning far more
towards the message passing.  However, it benefits particularly from the
approach due to its asynchronous nature... 

>     For a protocol, a protocol thread will own a PCB, so the PCB will be
>     'owned' by the cpu the protocol thread is on.  Any manipulation of the
>     PCB must occur on that cpu or otherwise be very carefully managed
>     (e.g. FIFO rindex/windex for the socket buffer and a memory barrier).
>     Our intention is to encapsulate most operations as messages to the
>     protocol thread owning the PCB.

I'll be very interested to see how this ends up working out: even in
RELENG_4, FreeBSD has sufficient apparent parallelism/preemption in the
network stack to require synchronization at most levels.  In RELENG_4,
most upward bound traffic is serialized via the netisr thread, but
downward traffic from user threads passes through the stack in parallel.
Do you anticipate handing off control to a netisr-like thread earlier than
RELENG_4 does in order to get things "into the right thread"?

One of the conceptual issues we've been wrestling with is the issue of
ordering of events: from the perspective of performance, guaranteeing the
weakest possible ordering is desirable.  However, with parallelism in the
stack, you risk introducing weaker than permitted ordering of packets
(triggering fast retransmits, etc).  In most situations, it's sufficient
to maintain source ordering: if two packets are from the same source,
their ordering should be maintained.  If they are from separate sources,
maintaining order isn't required.  This raises interesting questions about
when you want to defer processing, and when to try and "catch up" to
maintain performance and ordering.

> :past, there have been a number of exploitable security vulnerabilities due
> :to races opened up in low memory conditions, during paging, etc.  One
> :solution I was exploring was using the compiler to help track the
> :potential for functions to block, similar to the const qualifier, combined
> :with blocking/non-blocking assertions evaluated at compile-time.  However,
> :some of our current APIs (M_NOWAIT, M_WAITOK, et al) make that approach
> :somewhat difficult to apply, and would have to be revised to use a
> :compiler solution.  These potential weaknesses very much exist in an
> :explicit model, but with explicit locking, we have a clearer notion of how
> :to express assertions.
> 
>     DragonFly is using its LWKT messaging API to abstract blocking verses
>     non-blocking.  In particular, if a client sends a message using an
>     asynch interface it isn't supposed to block, but can return EASYNC if it
>     wound up queueing the message due to not being able to execute it
>     synchronous without blocking.  If a client sends a message using a
>     synchronous messaging interface then the client is telling the
>     messaging subsystem that it is ok to block.

The risk here is not so much that the first level of consumer code for an
interface can't determine if it will block or not, but that after a few
levels of abstraction, that information has disappeared (or worse, is
conditional on context in an opaque way).

>     This combined with the fact that we are using critical sections and
>     per-cpu globaldata caches that do not require mutexes to access allows
>     code to easily determine whether something might or might not block,
>     and the message structure is a convenient placemark to queue and 
>     return EASYNC deep in the kernel if something would otherwise block
>     when it isn't supposed to. 
> 
>     We also have the asynch IPI mechanism and a few other mechanisms at
>     our disposal and these cover a surprisingly large number of situations
>     in the system.  90% of the 'not sure if we might block' problem
>     is related to scheduling or memory allocation and neither of those
>     subsystems needs to use extranious mutexes, so managing the blocking
>     conditions is actually quite easy.

Again, I think it comes down to the fact that memory allocation APIs
typically offer choices to the consumer: block if the resources aren't
available, or fail.  My mood swings a bit back and forth as to what the
ideal strategy would/should be at the lowest level, but I think as you
move up the stack, the exact semantics at the bottom matter less.  The
APIs are generally clear, but it's the programming layered on top of it
that's sloppy (alternatively: at the bottom level API, the behavior is
well-documented, but as you move up the stack, the behavior typically
becomes more poorly documented).  While it's probably appropriate to say
"this is a property of poor programming, or at least, programming done
against a model that we want no longer to hold true", there's still the
issue of cleaning up the legacy code...

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert@fledge.watson.org      Network Associates Laboratories