From owner-freebsd-hackers@FreeBSD.ORG Sat Oct 25 21:42:01 2003 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B902816A4B3 for ; Sat, 25 Oct 2003 21:42:01 -0700 (PDT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id A1F2A43F75 for ; Sat, 25 Oct 2003 21:42:00 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.9p2/8.12.9) with ESMTP id h9Q4etMg086027; Sun, 26 Oct 2003 00:40:55 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)h9Q4esbX086024; Sun, 26 Oct 2003 00:40:54 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Sun, 26 Oct 2003 00:40:54 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Matthew Dillon In-Reply-To: <200310260401.h9Q41858034072@apollo.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: Kip Macy cc: hackers@freebsd.org cc: John-Mark Gurney cc: Marcel Moolenaar Subject: Synchronization philosophy (was: Re: FreeBSD mail list etiquette) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Oct 2003 04:42:01 -0000 (Subject changed to reflect the fact that it contains useful technical content and banter, resulting in a hijacking of the thread; hope no one minds) On Sat, 25 Oct 2003, Matthew Dillon wrote: > Yes. I'm not worried about BPF, and ucred is easy since it is > already 95% of the way there, though messing with ucred's ref count > will require a mutex or an atomic bus-locked instruction even in > DragonFly! The route table is our big issue. TCP caches routes so we > can still BGL the route table and achieve 85% of the scaleable > performance so I am not going to worry about the route table initially. > > An example with ucred would be to passively queue it to a particular cpu > for action. Lets say instead of using an atomic bus-locked instruction > to manipulate ucred's ref count, we instead send a passive IPI to the > cpu 'owning' the ucred, and that ucred is otherwise read-only. A > passive IPI, which I haven't implemented yet, is simply queueing an > IPI message but not actually generating an interrupt on the target cpu > unless the CPU->CPU software IPI message FIFO is full, so it doesn't > actually waste any cpu cycles and multiple operations can be executed > in-batch by the target. Passive IPIs can be used for things > that do not require instantanious action and both bumping and releasing > ref counts can take advantage of it. I'm not saying that is how > we will deal with ucred, but it is a definite option. Actually, the problem isn't so much the data referenced by ucred, but the references themselves. Part of the issue in Darwin is that ucred references are always gained using the p_ucred pointer in the proc structure. The proc structure is read and dereferenced fairly deep in the network code (network funnel), and also in the remainder of the kernel (kernel funnel). In addition, there's a lock used to serialize writes to p->p_ucred, but not to protect against reads of stale data. Shared structures, such as these, occur in pretty large quantity in BSD code, and will be a problem no matter what approach to synchronization is taken. Moving towards message passing helps to structure the code to avoid sharing of this sort, although it's not the only way to motivate that sort of change. I'm a big fan of the change in -CURRENT to use td->td_cred as a read-only thread-local credential reference and avoid synchronization on the credential reference--it nicely addresses the requirements for consistency in the referenced data for the read-only cases (which are the vast majority of uses of a credential). There are a number of cases where moving towards a message passing philosophy would really clean up the synchronization and parallelism issues in FreeBSD: for example, even the relatively simple accounting file rotation would benefit from queue-like operation to serialize the accounting data/event stream and rotation events. Using locks and condition variables to perform serialization as is currently done in the accounting code is unwieldy and bug-prone. However, when moving to event/message queuing, you also have to be very careful with data ownership and referencing, as well as proper error-handling. With accounting, most scheduled vnode operations are asynchronous and have no need for substantial error handling (when a process performs execve(), regardless of whether accounting of that operation succeeds or fails, execve() continues). The start/stop operation, however, is intended to be synchronous. Happily, in the accounting case, all necessary error checking can be performed in advance of the handoff to the accounting thread from the user thread, but that won't always be the case... One of the other risks that has worried me about this approach is that explicit locking has some nice benefits from the perspective of deadlocking and lock order management: monitoring for deadlocks and lock orders is a well-understood topic, and the tools for tracking deadlocks and wait conditions, as well as for performing locking and waiting safely, are mature. As with with the older BSD sleeping interfaces, such as tsleep(), synchronous waits on messages are harder to mechanically track, and resulting deadlocks resemble resource deadlocks more than lock deadlocks... On the other hand, some forms of tracing may be made easier. I've had some pretty nasty experiences trying to track deadlocks between cooperating threads due to message waits, and found tools such as WITNESS much easier to work with. In some work we're doing for one of our customers, we make extensive use of handoff between various submitting threads and a serializing kernel thread making use of thread-local storage to avoid explicit synchronization. Having dealt both with lower level lock/cv primitives for event passing, and message passing, I have to say I'm leaning far more towards the message passing. However, it benefits particularly from the approach due to its asynchronous nature... > For a protocol, a protocol thread will own a PCB, so the PCB will be > 'owned' by the cpu the protocol thread is on. Any manipulation of the > PCB must occur on that cpu or otherwise be very carefully managed > (e.g. FIFO rindex/windex for the socket buffer and a memory barrier). > Our intention is to encapsulate most operations as messages to the > protocol thread owning the PCB. I'll be very interested to see how this ends up working out: even in RELENG_4, FreeBSD has sufficient apparent parallelism/preemption in the network stack to require synchronization at most levels. In RELENG_4, most upward bound traffic is serialized via the netisr thread, but downward traffic from user threads passes through the stack in parallel. Do you anticipate handing off control to a netisr-like thread earlier than RELENG_4 does in order to get things "into the right thread"? One of the conceptual issues we've been wrestling with is the issue of ordering of events: from the perspective of performance, guaranteeing the weakest possible ordering is desirable. However, with parallelism in the stack, you risk introducing weaker than permitted ordering of packets (triggering fast retransmits, etc). In most situations, it's sufficient to maintain source ordering: if two packets are from the same source, their ordering should be maintained. If they are from separate sources, maintaining order isn't required. This raises interesting questions about when you want to defer processing, and when to try and "catch up" to maintain performance and ordering. > :past, there have been a number of exploitable security vulnerabilities due > :to races opened up in low memory conditions, during paging, etc. One > :solution I was exploring was using the compiler to help track the > :potential for functions to block, similar to the const qualifier, combined > :with blocking/non-blocking assertions evaluated at compile-time. However, > :some of our current APIs (M_NOWAIT, M_WAITOK, et al) make that approach > :somewhat difficult to apply, and would have to be revised to use a > :compiler solution. These potential weaknesses very much exist in an > :explicit model, but with explicit locking, we have a clearer notion of how > :to express assertions. > > DragonFly is using its LWKT messaging API to abstract blocking verses > non-blocking. In particular, if a client sends a message using an > asynch interface it isn't supposed to block, but can return EASYNC if it > wound up queueing the message due to not being able to execute it > synchronous without blocking. If a client sends a message using a > synchronous messaging interface then the client is telling the > messaging subsystem that it is ok to block. The risk here is not so much that the first level of consumer code for an interface can't determine if it will block or not, but that after a few levels of abstraction, that information has disappeared (or worse, is conditional on context in an opaque way). > This combined with the fact that we are using critical sections and > per-cpu globaldata caches that do not require mutexes to access allows > code to easily determine whether something might or might not block, > and the message structure is a convenient placemark to queue and > return EASYNC deep in the kernel if something would otherwise block > when it isn't supposed to. > > We also have the asynch IPI mechanism and a few other mechanisms at > our disposal and these cover a surprisingly large number of situations > in the system. 90% of the 'not sure if we might block' problem > is related to scheduling or memory allocation and neither of those > subsystems needs to use extranious mutexes, so managing the blocking > conditions is actually quite easy. Again, I think it comes down to the fact that memory allocation APIs typically offer choices to the consumer: block if the resources aren't available, or fail. My mood swings a bit back and forth as to what the ideal strategy would/should be at the lowest level, but I think as you move up the stack, the exact semantics at the bottom matter less. The APIs are generally clear, but it's the programming layered on top of it that's sloppy (alternatively: at the bottom level API, the behavior is well-documented, but as you move up the stack, the behavior typically becomes more poorly documented). While it's probably appropriate to say "this is a property of poor programming, or at least, programming done against a model that we want no longer to hold true", there's still the issue of cleaning up the legacy code... Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Network Associates Laboratories