From owner-freebsd-hackers@FreeBSD.ORG Sat Oct 25 21:01:50 2003 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5354C16A4BF; Sat, 25 Oct 2003 21:01:50 -0700 (PDT) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5EF2F43FBD; Sat, 25 Oct 2003 21:01:49 -0700 (PDT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) h9Q41miF034082; Sat, 25 Oct 2003 21:01:48 -0700 (PDT) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.12.9p2/8.12.9/Submit) id h9Q41858034072; Sat, 25 Oct 2003 21:01:08 -0700 (PDT) (envelope-from dillon) Date: Sat, 25 Oct 2003 21:01:08 -0700 (PDT) From: Matthew Dillon Message-Id: <200310260401.h9Q41858034072@apollo.backplane.com> To: Robert Watson References: cc: Kip Macy cc: hackers@freebsd.org cc: John-Mark Gurney cc: Marcel Moolenaar Subject: Re: FreeBSD mail list etiquette X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Oct 2003 04:01:50 -0000 :> It's a lot easier lockup path then the direction 5.x is going, and :> a whole lot more maintainable IMHO because most of the coding doesn't :> have to worry about mutexes or LORs or anything like that. : :You still have to be pretty careful, though, with relying on implicit :synchronization, because while it works well deep in a subsystem, it can :break down on subsystem boundaries. One of the challenges I've been :bumping into recently when working with Darwin has been the split between :their Giant kernel lock, and their network lock. To give a high level :summary of the architecture, basically they have two Funnels, which behave :similarly to the Giant lock in -STABLE/-CURRENT: when you block, the lock :is released, allowing other threads to enter the kernel, and regained when :the thread starts to execute again. They then have fine-grained locking :for the Mach-derived components, such as memory allocation, VM, et al. I recall a presentation at BSDCon that mentioned that... yours I think. The interfaces we are contemplating for the NETIF (at the bottom) and UIPC (at the top) are different. We probably won't need to use any mutexes to queue incoming packets to the protocol thread, we will almost certainly use an async IPI message to queue a message holding the packet if the protocol thread is on a different cpu. On the same cpu it's just a critical section to interlock the queueing operation against the protocol thread. Protocol packet output to NETIF would use the same methodology... asynch IPI message if the NETIF is on another cpu, critical section if it is on the current cpu. The protocol itself will change from a softint to a normal thread, or perhaps a thread at softint priority. The softint is already a thread but we would separate each protocol into its own thread and have an ability to create several threads for a single protocol (like TCP) when necessary to take advantage of multiple cpus. On the UIPC side we have a choice of using a mutex to lock the socket buffer, or passing a message to the protocol thread responsible for the socket buffer (aka PCB). There are tradeoffs for both situations since if this is related to a write() it winds up being a synchronous message. Another option is to COW the memory but that might be too complex. Smaller writes could simply copyin() the data as an option, or we could treat the socket buffer as a FIFO which would allow the system call UIPC interface to append to it without holding any locks (other then a memory barrier after the copy and before updating the index), then simply send a kick-off message to the protocol thread telling it that more data is present. :Deep in a particular subsystem -- say, the network stack, all works fine. :The problem is at the boundaries, where structures are shared between :multiple compartments. I.e., process credentials are referenced by both :"halves" of the Darwin BSD kernel code, and are insufficiently protected :in the current implementation (they have a write lock, but no read lock, :so it looks like it should be possible to get stale references with :pointers accessed in a read form under two different locks). Similarly, :there's the potential for serious problems at the surprisingly frequently :occuring boundaries between the network subsystem and remainder of the :kernel: file descriptor related code, fifos, BPF, et al. By making use of :two large subsystem locks, they do simplify locking inside the subsystem, :but it's based on a web of implicit assumptions and boundary :synchronization that carries most of the risks of explicit locking. Yes. I'm not worried about BPF, and ucred is easy since it is already 95% of the way there, though messing with ucred's ref count will require a mutex or an atomic bus-locked instruction even in DragonFly! The route table is our big issue. TCP caches routes so we can still BGL the route table and achieve 85% of the scaleable performance so I am not going to worry about the route table initially. An example with ucred would be to passively queue it to a particular cpu for action. Lets say instead of using an atomic bus-locked instruction to manipulate ucred's ref count, we instead send a passive IPI to the cpu 'owning' the ucred, and that ucred is otherwise read-only. A passive IPI, which I haven't implemented yet, is simply queueing an IPI message but not actually generating an interrupt on the target cpu unless the CPU->CPU software IPI message FIFO is full, so it doesn't actually waste any cpu cycles and multiple operations can be executed in-batch by the target. Passive IPIs can be used for things that do not require instantanious action and both bumping and releasing ref counts can take advantage of it. I'm not saying that is how we will deal with ucred, but it is a definite option. :It's also worth noting that there have been some serious bugs associated :with a lack of explicit synchronization in the non-concurrent kernel model :used in RELENG_4 (and a host of other early UNIX systems relying on a :single kernel lock). These have to do with unexpected blocking deep in a :function call stack, where it's not anticipated by a developer writing :source code higher in the stack, resulting in race conditions. In the I've encountered this with softupdates, so I know what you mean. softupdates (at least in 4.x) is extremely sensitive to blocking in places where it doesn't expect blocking to happen. My free() code was occassionally (and accidently) blocking in an interrupt thread waiting on kernel_map (I've already removed kmem_map from DragonFly), and this was enough to cause softupdates to panic in its IO completion rundown once in a blue moon due to assumptions on its lock 'lk'. Synchronization is a bigger problem in 5.x then it is in DragonFly because in DragonFly most of the work is shoved over to the cpu that 'owns' the data structure via an async IPI. e.g. when you want to schedule thread X on cpu 1 and thread X is owned by cpu 2, cpu 1 will send an asynch IPI to cpu 2 and cpu 2 will actually do the scheduling. If the cpuid changes during the message transit cpu 2 will simply chase the owning cpu, forwarding it along. It doesn't matter if the cpuid is out of synch, in fact! You don't even need a memory barrier. Same goes for the slab allocator... DragonFly does not mess with the slab allocated by another cpu, it forwards the free() request to the other cpu instead. For a protocol, a protocol thread will own a PCB, so the PCB will be 'owned' by the cpu the protocol thread is on. Any manipulation of the PCB must occur on that cpu or otherwise be very carefully managed (e.g. FIFO rindex/windex for the socket buffer and a memory barrier). Our intention is to encapsulate most operations as messages to the protocol thread owning the PCB. :past, there have been a number of exploitable security vulnerabilities due :to races opened up in low memory conditions, during paging, etc. One :solution I was exploring was using the compiler to help track the :potential for functions to block, similar to the const qualifier, combined :with blocking/non-blocking assertions evaluated at compile-time. However, :some of our current APIs (M_NOWAIT, M_WAITOK, et al) make that approach :somewhat difficult to apply, and would have to be revised to use a :compiler solution. These potential weaknesses very much exist in an :explicit model, but with explicit locking, we have a clearer notion of how :to express assertions. DragonFly is using its LWKT messaging API to abstract blocking verses non-blocking. In particular, if a client sends a message using an asynch interface it isn't supposed to block, but can return EASYNC if it wound up queueing the message due to not being able to execute it synchronous without blocking. If a client sends a message using a synchronous messaging interface then the client is telling the messaging subsystem that it is ok to block. This combined with the fact that we are using critical sections and per-cpu globaldata caches that do not require mutexes to access allows code to easily determine whether something might or might not block, and the message structure is a convenient placemark to queue and return EASYNC deep in the kernel if something would otherwise block when it isn't supposed to. We also have the asynch IPI mechanism and a few other mechanisms at our disposal and these cover a surprisingly large number of situations in the system. 90% of the 'not sure if we might block' problem is related to scheduling or memory allocation and neither of those subsystems needs to use extranious mutexes, so managing the blocking conditions is actually quite easy. :In -CURRENT, we make use of thread-based serialization in a number of :places to avoid explicit synchronization costs (such as in GEOM for :processing work queues), and we should make more use of this practice. :I'm particularly interested in the use of interface interrupt threads :performing direct dispatch as a means to maintain interface ordering of :packets coming in network interfaces while allowing parallelism in network :processing (you'll find this in use in Sam's netperf branch currently). : :Robert N M Watson FreeBSD Core Team, TrustedBSD Projects :robert@fledge.watson.org Network Associates Laboratories I definitely think that -current should explore a greater roll for threading subsystems. Remember that many operations can be done asynchronously and thus do not actually require synchronous context switches or blocking. A GEOM strategy routine is a good example, since it must perform I/O and I/O *ALWAYS* blocks or takes an interrupt at some point. However, you need to be careful because not all operations truely need to be run in a threaded subsystem's thread context. This is why DragonFly's LWKT messaging subsystem uses the Amiga's BeginIo abstraction for dispatching a message, which allows the target port to execute messages synchronously in the context of the caller if it happens to be possible to do so without blocking. The advantage of this is that we can start out by always queueing the message (thereby guarenteeing that queue mode operation will always be acceptable), and then later on we can optimize paricular messages (such as read()'s that are able to lock and access the VM object's page cache without blocking, in order to avoid switching to a filesystem thread unnecessarily). I'm sure we will hit issues but so far it has been smooth sailing. -Matt Matthew Dillon