Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 26 Oct 2003 00:09:21 -0700 (PDT)
From:      Matthew Dillon <dillon@apollo.backplane.com>
To:        Robert Watson <rwatson@freebsd.org>
Cc:        Marcel Moolenaar <marcel@xcllnt.net>
Subject:   Re: Synchronization philosophy (was: Re: FreeBSD mail list etiquette)
Message-ID:  <200310260709.h9Q79LEq034660@apollo.backplane.com>
References:  <Pine.NEB.3.96L.1031026001054.74063I-100000@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help

:...
:Moving towards message passing helps to structure the code to avoid
:sharing of this sort, although it's not the only way to motivate that sort
:of change.  I'm a big fan of the change in -CURRENT to use td->td_cred as
:a read-only thread-local credential reference and avoid synchronization on
:the credential reference--it nicely addresses the requirements for
:consistency in the referenced data for the read-only cases (which are the
:vast majority of uses of a credential).

    I saw td->td_cred in -current, but while figuring out what to do in
    DragonFly I realized that there was something fundamentally wrong
    with pulling a ucred out of a proc *OR* a thread.  What's wrong is
    that the ucred associated with the current process, or current thread,
    or passed process, or passed thread may not be the correct ucred for
    a particular operation, especially in DragonFly where I/O operations
    are going to be messages and may be aggregated into a single thread
    for a particular target, filesystem, or protocol in the case of the
    network.

    There is also the potential of writing buggy code... if the programmer
    has a thread based ucred handy he might use it whether it is the correct
    ucred or not.

    So in DragonFly I kept the ucred in the proc structure but only refer
    to it at the highest level.  The ucred is passed as an argument from
    the highest level and lower levels don't try to get it out of the proc
    or thread.  I've already fixed this for VFS.  I haven't fixed it for
    the network stack yet but it's pretty obvious to me that that is the
    only correct way of doing it, and actually not that hard to do.

    VFS was interesting... ucred was either being passed directly or pulled
    out of the proc structure inappropriately for things like VOP_WRITE,
    and indirectly via struct uio's proc/thread pointer.  And then there 
    were a dozen hacks to deal with the case where a flush or VM paging
    I/O might not occur from the originating user process, hence people
    may notice that existance of a ucred pointer in struct buf.  It's
    was pretty bad.  

    I ripped it all out.  No ucred is passed for VOP I/O any more at all.
    The ucred is expected to be established to 'open' the file (aka vnode),
    but once open the driver is expected to be able to read and write through
    that vnode without any further ucred info.  And, in fact, most filesystem
    drivers completely ignored the 'ucred' argument for VOP_READ and
    VOP_WRITE anyway.  Only NFS didn't.  For NFS I just used the root
    ucred but later wound up having to record the ucred at open to deal with
    -maproot exports that used something other then root.  But that's a damn
    sight better then strewing ucred references all over the generic I/O
    APIs.

:...
:issues in FreeBSD: for example, even the relatively simple accounting file
:rotation would benefit from queue-like operation to serialize the
:accounting data/event stream and rotation events.  Using locks and
:condition variables to perform serialization as is currently done in the
:accounting code is unwieldy and bug-prone.  However, when moving to
:event/message queuing, you also have to be very careful with data
:ownership and referencing, as well as proper error-handling.  With

    The accounting code is a really excellent example.  The solution there
    is to not to queue to a single serializing thread, but instead to queue
    to the filesystem driver which pulls out the appropriate data (which
    it will own in DragonFly's case, hence no synchronization issue), and
    then queues to the accounting file or to a thread that manages the
    accounting file.  All you would be queueing at the highest level might
    be a pointer to the vnode and not actually dereference the vnode.

:...
:regardless of whether accounting of that operation succeeds or fails,
:execve() continues).  The start/stop operation, however, is intended to be
:synchronous.  Happily, in the accounting case, all necessary error
:checking can be performed in advance of the handoff to the accounting
:thread from the user thread, but that won't always be the case...

    Synchronous ops can be queued as easily as asynchronous ops.  The only
    difference is that the caller has to wait for the synchronizing event
    to complete.  In DragonFly that would be a message that the caller
    waits for a reply on. 

    In fact, this can be advantangious.  Take filesystem snapshots for
    example.  If all VOPs for a filesystem are queued to a single thread
    representing that filesystem (as a simplified view), then you don't
    need fancy snapshot support at what would normally be considered
    the API boundary.  The message port becomes the API boundary and all
    a snapshot has to do is freeze the incoming messages and wait for the
    existing ones (who are undergoing I/O) to complete, then tell the
    underlying block device to snapshot.  Bang!  A poor-man's snapshot 
    that could be adapted to virtually any filesystem.

:One of the other risks that has worried me about this approach is that
:explicit locking has some nice benefits from the perspective of
:deadlocking and lock order management: monitoring for deadlocks and lock
:orders is a well-understood topic, and the tools for tracking deadlocks
:and wait conditions, as well as for performing locking and waiting safely,
:are mature.  As with with the older BSD sleeping interfaces, such as
:tsleep(), synchronous waits on messages are harder to mechanically track,
:and resulting deadlocks resemble resource deadlocks more than lock
:deadlocks...  On the other hand, some forms of tracing may be made easier. 
:I've had some pretty nasty experiences trying to track deadlocks between
:cooperating threads due to message waits, and found tools such as WITNESS
:much easier to work with. 

    I would agree that tracking a message, which might traverse several
    threads before being returned, is far harder then breaking into
    DDB and doing a backtrace.  On the otherhand, if you know the pointer
    to the message adding debugging fields to the message to track who
    retrieved or sent it last is quite easy.  You only have to shim
    lwkt_sendmsg(), lwkt_domsg(), lwkt_waitmsg(), and so forth.  Then
    from DDB> you just 'print *msg' from the frame of the process that is
    blocked waiting on the message.

    Deadlocking is something that a fine-grained model has to deal with
    a lot more then a serializing threaded model.  In the threaded model
    one has to contend with far *fewer* high level locks.  In DragonFly
    my hope is that the issue becomes even further removed due to the 
    IPI messaging mechanism which devolves many things which have to be
    mutexes in 5.x into simple critical sections (like spl's) in DragonFly.

:...
:thread making use of thread-local storage to avoid explicit
:synchronization.  Having dealt both with lower level lock/cv primitives
:for event passing, and message passing, I have to say I'm leaning far more
:towards the message passing.  However, it benefits particularly from the
:approach due to its asynchronous nature... 

    Message passing is far superior for asynchronous operations, when you
    don't have to wait, because there are a thousand ways to implement an
    asynch messaging backend and a thousand ways to optimize it in ways
    that can completely avoid conflicts, cache issues, mutexes, and so forth.
    With a locking abstraction you always have to execute the locking and
    unlocking primitive which can be far more expensive.

    A messaging backend, hidden by the messaging abstraction, can
    optimize for cases that would be EXTREMELY cumbersome to optimize for
    with a locking abstraction.  Take the example below... it's overkill
    but it is only an illustration.  Here the messaging backend can optimize
    for the same-cpu case, whether the client allows blocking or not (whether
    the client used lwkt_sendmsg() or lwkt_domsg(), basically), and
    potentially other optmizations.  Every single optimization is COMPLETELY
    hidden from the client that dispatched the message.

	(this function is executed in the context of the client, not the
	target, but may queue to the target)

	messaging_backend_for_a_particular_port(msg, datastructure)
	{
	    if (datastructure->cpu == mycpu) {
		crit_enter();	<<< not even this if interrupts don't mess
				<<< with the data structure
		do stuff
		crit_exit();
		return(code)
	    }
	    if (client_wants_to_send_msg_synchronously) {
		... fine, the client is allowed to block or switch ...
		... just queue it, but without a reply port ...
		lwkt_switch();	/* do a quick blip blip, see if it's done */
		if ((msg->ms_flags & MSGF_DONE) == 0) {
		    ... wasn't a quick blip blip, formally block on the
			message , wait for wakeup ...
		}
		return(msg->ms_retcode);
	    } else {
		... just queue it ...
		return(EASYNC);
	    }
	}

    On the otherhand, locking verses messaging in the synchronous case is
    a lot harder to quantify the performance for because theoretically you
    do not have the overhead of having to build a message in the
    synchronous case if you are using locking.  So even if the message itself
    implements locking internally, degenerating into, well, locking, it
    would still be slower.

    But I would argue that for certain particular inflection points in the
    system, such as VOP_READ() or VOP_WRITE() for example, a messaging
    abstraction will be far superior to an API and/or locking abstraction
    because the tiny difference in overhead is well worth all the cool things
    you can hide behind the abstraction.

    System calls fall under this category as well.  Encapsulating a system
    call in a message is trivial (because DragonFly already does it :-)),
    and only slightly slower then the direct syscall mechanism.  But the
    power one gets out of that sort of abstraction is phenominal.

:>     For a protocol, a protocol thread will own a PCB, so the PCB will be
:>     'owned' by the cpu the protocol thread is on.  Any manipulation of the
:>     PCB must occur on that cpu or otherwise be very carefully managed
:>     (e.g. FIFO rindex/windex for the socket buffer and a memory barrier).
:>     Our intention is to encapsulate most operations as messages to the
:>     protocol thread owning the PCB.
:
:I'll be very interested to see how this ends up working out: even in
:RELENG_4, FreeBSD has sufficient apparent parallelism/preemption in the
:network stack to require synchronization at most levels.  In RELENG_4,
:most upward bound traffic is serialized via the netisr thread, but
:downward traffic from user threads passes through the stack in parallel.
:Do you anticipate handing off control to a netisr-like thread earlier than
:RELENG_4 does in order to get things "into the right thread"?

    That's a good point.  The real question here is determined by how one
    defines 'parallelism'.  The reality of an N cpu system is that the only
    actual parallelism you have is N, the number of cpus you have in the
    system.  So the only effect that really matters with the serializing
    effect on downward bound network traffic is the cost of the context
    switch.  In DragonFly's case that would be a light weight kernel thread
    context switch, which is far less expensive then a heavy weight
    context switch.  But it *is* still a context switch.

    The question in regards to DragonFly is whether the messaging function
    can deal with the case where, say, a write() results in queueing an output
    packet as efficiently as 4.x would be able to deal with the case by
    calling tcp_output().  (the messaging function that is executed in
    the context of the client, that is).

    My take on the situation is that on a 2xCPU system DragonFly could 
    potentially be faster because the message that wakes up the protocol
    thread can potentially be waking up the protocol thread on a different
    cpu and return immediately to userland on the original cpu without
    userland having to wait for the protocol operation.  On a UP system
    DragonFly would be slower along this path due to the extra context
    switch, at least under low loads.  Under high loads DragonFly ought
    to be around the same speed because the messages would aggregate and
    the number of context switches into the protocol thread would actually
    go down relative to the number of write()'s being performed.

    It is also possible that DragonFly would not be slower under any 
    circumstances, especially if the TCP pipe is full, because the protocol
    thread will be woken up by the ACK packet coming back up the stack from
    the NETIF interrupt (which is effectively a software interrupt), and it
    could process the WRITE-wakeup at the same time.  In fact, this could
    potentially be faster because the WRITE going down the stack wouldn't
    even bother to run tcp_output(), it would just wakeup the protocol thread
    (send a wakeup message to it, anyway), which is a NOP if the
    protocol thread is already awake.  So a WRITE would degenerate to
    appending to the socket buf and returning to userland and the
    protocol thread would primarily be driven by ACKs coming back over the
    wire.  That's a 2-for-1 deal since ACKs are typically one per every two
    data packets.

:One of the conceptual issues we've been wrestling with is the issue of
:ordering of events: from the perspective of performance, guaranteeing the
:weakest possible ordering is desirable.  However, with parallelism in the
:stack, you risk introducing weaker than permitted ordering of packets
:(triggering fast retransmits, etc).  In most situations, it's sufficient
:to maintain source ordering: if two packets are from the same source,
:their ordering should be maintained.  If they are from separate sources,
:maintaining order isn't required.  This raises interesting questions about
:when you want to defer processing, and when to try and "catch up" to
:maintain performance and ordering.

    This is a problem if you have multiple protocol threads picking up 
    random work.  I don't think it is a problem if you partition the PCB's
    amoungst protocol threads so a particular PCB is always handled by a
    particular protocol thread.

    This is the old 'get a lock, pull work off the queue, execute it' problem,
    when one is using locks to parallelize operations on queues.  It reminds
    me of NFS's nfsiod processes, which until I fixed it had EXACTLY the
    problem you describe.  Several nfsiod processes would trip over each
    other trying to get the same VNODE lock for read-ahead and write ops.

:>     DragonFly is using its LWKT messaging API to abstract blocking verses
:>     non-blocking.  In particular, if a client sends a message using an
:>     asynch interface it isn't supposed to block, but can return EASYNC if it
:>     wound up queueing the message due to not being able to execute it
:>     synchronous without blocking.  If a client sends a message using a
:>     synchronous messaging interface then the client is telling the
:>     messaging subsystem that it is ok to block.
:
:The risk here is not so much that the first level of consumer code for an
:interface can't determine if it will block or not, but that after a few
:levels of abstraction, that information has disappeared (or worse, is
:conditional on context in an opaque way).

    Ah, but that assumes that the levels of abstraction are enclosed within
    the one (originating) thread.  With a messaging layer major inflection
    points are in different threads, so the problem becomes compartmentalized.
    An operation which MUST be asynch in thread #1 does not have to be async
    in thread #2 which picks it up.

:>     in the system.  90% of the 'not sure if we might block' problem
:>     is related to scheduling or memory allocation and neither of those
:>     subsystems needs to use extranious mutexes, so managing the blocking
:>     conditions is actually quite easy.
:
:Again, I think it comes down to the fact that memory allocation APIs
:typically offer choices to the consumer: block if the resources aren't
:available, or fail.  My mood swings a bit back and forth as to what the
:ideal strategy would/should be at the lowest level, but I think as you
:move up the stack, the exact semantics at the bottom matter less.  The
:APIs are generally clear, but it's the programming layered on top of it
:that's sloppy (alternatively: at the bottom level API, the behavior is
:well-documented, but as you move up the stack, the behavior typically
:becomes more poorly documented).  While it's probably appropriate to say
:"this is a property of poor programming, or at least, programming done
:against a model that we want no longer to hold true", there's still the
:issue of cleaning up the legacy code...
:
:Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
:robert@fledge.watson.org      Network Associates Laboratories

    In the case of FreeBSD-4.x, and 5.x as well, the problem devolves into
    the unknown quantity that locking the kernel_map represents, and the
    unknown quantity that obtaining and releasing a mutex represents.  Even
    in DragonFly, for allocations, I have to deal with M_NOWAIT and locking
    the kernel_map if the slab allocator has no free memory in the per-cpu
    globaldata area. 

    And in DragonFly I have to deal with the equivalent of mutexes in 
    regards to the IPI messaging.  Even though IPI messaging is effectively
    asynchronous, it can *STILL* stall for a bit.  If the IPI message FIFO
    between two cpus is full the IPI messaging code must spin waiting for
    it to empty a little and at the same time process any incoming IPI
    messages (to avoid IPI messaging deadlocks between two cpus).  This 
    means that any time you send an IPI message you have to assume that
    your cpu will also have processed any pending IPI messages to it.  It
    hasn't created any problems yet, because IPI operations are still
    serialized within the context of the cpu (so we don't have to worry
    about preemption), and since IPI messaging is used to serialize operations
    on a cpu anyway the fact that several IPI messages may execute when you
    send a new one does not really interfere with the abstraction.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200310260709.h9Q79LEq034660>