From owner-freebsd-hackers@FreeBSD.ORG Tue Oct 28 03:53:58 2003 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 70A5516A4CE; Tue, 28 Oct 2003 03:53:58 -0800 (PST) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id 75D7543FBF; Tue, 28 Oct 2003 03:53:57 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from user-2ivfl2a.dialup.mindspring.com ([165.247.212.74] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 1AESQ4-0002qL-00; Tue, 28 Oct 2003 03:53:53 -0800 Message-ID: <3F9E5686.2D92E598@mindspring.com> Date: Tue, 28 Oct 2003 03:44:06 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Robert Watson References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a45bf002efc08320df3e36f54bf4753e0e387f7b89c61deb1d350badd9bab72f9c350badd9bab72f9c cc: Kip Macy cc: Matthew Dillon cc: hackers@freebsd.org cc: John-Mark Gurney cc: Marcel Moolenaar Subject: Re: Fine grained locking (was: FreeBSD mail list etiquitte) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Oct 2003 11:53:58 -0000 Robert Watson wrote: > On Sat, 25 Oct 2003, Matthew Dillon wrote: > > It's a lot easier lockup path then the direction 5.x is going, and > > a whole lot more maintainable IMHO because most of the coding doesn't > > have to worry about mutexes or LORs or anything like that. > > You still have to be pretty careful, though, with relying on implicit > synchronization, because while it works well deep in a subsystem, it can > break down on subsystem boundaries. One of the challenges I've been > bumping into recently when working with Darwin has been the split between > their Giant kernel lock, and their network lock. To give a high level > summary of the architecture, basically they have two Funnels, which behave > similarly to the Giant lock in -STABLE/-CURRENT: when you block, the lock > is released, allowing other threads to enter the kernel, and regained when > the thread starts to execute again. They then have fine-grained locking > for the Mach-derived components, such as memory allocation, VM, et al. > > Deep in a particular subsystem -- say, the network stack, all works fine. > The problem is at the boundaries, where structures are shared between > multiple compartments. I.e., process credentials are referenced by both > "halves" of the Darwin BSD kernel code, and are insufficiently protected > in the current implementation (they have a write lock, but no read lock, > so it looks like it should be possible to get stale references with > pointers accessed in a read form under two different locks). Similarly, > there's the potential for serious problems at the surprisingly frequently > occuring boundaries between the network subsystem and remainder of the > kernel: file descriptor related code, fifos, BPF, et al. By making use of > two large subsystem locks, they do simplify locking inside the subsystem, > but it's based on a web of implicit assumptions and boundary > synchronization that carries most of the risks of explicit locking. Are you maybe looking at the Jaguar (10.2) code base here? I personally fixed a number of these types of issues. Mostly, these types of issues come down to the fact that funnels, like the BGL in FreeBSD, are actually code locks, not data locks. FWIW, it looks like the 10.3 (Panther) code is available from the Darwin (7.0) project now: http://developer.apple.com/darwin/ Mixing code and data locks is where most problems happen; it's one of the reasons that the BGL and the push-down to fine grain locking in FreeBSD is, I think, the wrong way to go. The destination is right, but the path is probably wrong, and things will get much worse with the FreeBSD approach before they get any better. Part of the problem is FreeBSD's mutex implementation is complicated, and permits recursion (thereby tacitly encouraging recursion). I personally think that should result in a panic -- though you could maybe make an argument for it on the basis of pool mutexes, where the same mutex pool is used for all your locks. But FreeBSD doesn't really do that effectively, so it's kind of a lose/lose situation. It's really, really hard to put multiple code locks into a kernel, and get things right. That's the problem with the FreeBSD approach: having a BGL in the first place implicitly turns all your data locks that have to coexist with the BGL code lock into code locks as well. I think the right approach would have been to start with a single pool mutex implementation, allowing recursion, and a macro wrapper for grabbing/releasing locks so that the locks could be substituted for per-data item locks instead of pool locks, and the per-data item locks would *disallow* recursion. A smart appoach to that would be to have a structure prefix on all kernel structures, such that you could abuse a cast and use a single mutex implementation for all data items. > It's also worth noting that there have been some serious bugs associated > with a lack of explicit synchronization in the non-concurrent kernel model > used in RELENG_4 (and a host of other early UNIX systems relying on a > single kernel lock). These have to do with unexpected blocking deep in a > function call stack, where it's not anticipated by a developer writing > source code higher in the stack, resulting in race conditions. In the > past, there have been a number of exploitable security vulnerabilities due > to races opened up in low memory conditions, during paging, etc. I think this is a general problem. I noted early on that FreeBSD was going to have issues in this regard once there was real kernel threading support. The KSE model is much less prone to triggering the race conditions, but any model where the same process can be in the kernel multiple times is problematic. Particularly since the BSD code doesn't really support the concept of cancellation of a blocking system call in progress (you don't have a means of doing the wakeup to fail the tsleep's with a recognizable error code that could mean cancellation to back out state). It's likely that libthr could demonstrate the issues much more quickly, so at least in that respect it's a good test jig for needing to deal with these issues sooner rather than later. Really, I think the BSD code needs to be refactored. The problem is an inability to back out state once you are down one or more calls in the stack. The Mach code is much better in this regard. All the locking, if multiple data objects need to be locked, has to happen at the same level in the call graph. That's some serious hacking that has to happen. > One solution I was exploring was using the compiler to help track the > potential for functions to block, similar to the const qualifier, combined > with blocking/non-blocking assertions evaluated at compile-time. However, > some of our current APIs (M_NOWAIT, M_WAITOK, et al) make that approach > somewhat difficult to apply, and would have to be revised to use a > compiler solution. These potential weaknesses very much exist in an > explicit model, but with explicit locking, we have a clearer notion of how > to express assertions. Absolutely a great idea! Nothing beats proper tools. Tools don't forget to look at some functions, etc.. If you are getting into the asserts (which I think is a great idea), then you will want to change all the functions where you have them to single-entry, single-exit (more code refactoring). I tried to do this with some of the FS patches I did back in 1995/1996, but the changes were seen as gratuitous. Maybe now it's an idea whose time has come. This will likely involve some additional complexity (i.e. use of negative logic), but the payoff would definitely be worth it (and you'll get better performance on top of that, if you run on a processor supporting branch prediction, or your compiler supports tail-call optimization -- and GCC does). > In -CURRENT, we make use of thread-based serialization in a number of > places to avoid explicit synchronization costs (such as in GEOM for > processing work queues), and we should make more use of this practice. > I'm particularly interested in the use of interface interrupt threads > performing direct dispatch as a means to maintain interface ordering of > packets coming in network interfaces while allowing parallelism in network > processing (you'll find this in use in Sam's netperf branch currently). I don't know if I agree with the idea that this type of serialization is the right thing to do; it's basically implying code locks (again) which leaves you in the same hole as the BGL. Personally, I'd like to see more reference counting of objects (effectively, a read-lock, as you noted above). I don't think you can safely discard an object just because you have thread serialization. For example, I think it's very possible to shoot your foot off in FreeBSD right now with a simple multithreaded program that forces a close race against a blocking system call (e.g. a read on a socket). This would be a much more frequent occurance on an MP box, and as CPU affinity and negaffinity of threads in a single process to max out concurrency go in, the problem will go critical, all by itself, and become totally unlivable for the majority of FreeBSD users (it doesn't help that most Intel CPUs these days are Hyperthreaded, so even if you are on a UP system, you're on an MP system, which is just going to exacerbate the problem). Unfortunately, there's a lot of threaded code that depends on the Sun behaviour (full cancellation support; Sun already refactored their code back in 1993/1994), particularly Java code tends to do this a lot, and expect an exception will get thrown. Java is evil. -- Terry