From owner-freebsd-hackers@FreeBSD.ORG  Tue Oct 28 03:53:58 2003
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 70A5516A4CE; Tue, 28 Oct 2003 03:53:58 -0800 (PST)
Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net
	[207.217.120.188])	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 75D7543FBF; Tue, 28 Oct 2003 03:53:57 -0800 (PST)
	(envelope-from tlambert2@mindspring.com)
Received: from user-2ivfl2a.dialup.mindspring.com ([165.247.212.74]
	helo=mindspring.com)
	by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128)
	(Exim 3.33 #1)	id 1AESQ4-0002qL-00; Tue, 28 Oct 2003 03:53:53 -0800
Message-ID: <3F9E5686.2D92E598@mindspring.com>
Date: Tue, 28 Oct 2003 03:44:06 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Robert Watson <rwatson@freebsd.org>
References: <Pine.NEB.3.96L.1031025211002.83249C-100000@fledge.watson.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a45bf002efc08320df3e36f54bf4753e0e387f7b89c61deb1d350badd9bab72f9c350badd9bab72f9c
cc: Kip Macy <kmacy@fsmware.com>
cc: Matthew Dillon <dillon@apollo.backplane.com>
cc: hackers@freebsd.org
cc: John-Mark Gurney <gurney_j@efn.org>
cc: Marcel Moolenaar <marcel@xcllnt.net>
Subject: Re: Fine grained locking (was: FreeBSD mail list etiquitte)
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Oct 2003 11:53:58 -0000

Robert Watson wrote:
> On Sat, 25 Oct 2003, Matthew Dillon wrote:
> >     It's a lot easier lockup path then the direction 5.x is going, and
> >     a whole lot more maintainable IMHO because most of the coding doesn't
> >     have to worry about mutexes or LORs or anything like that.
> 
> You still have to be pretty careful, though, with relying on implicit
> synchronization, because while it works well deep in a subsystem, it can
> break down on subsystem boundaries.  One of the challenges I've been
> bumping into recently when working with Darwin has been the split between
> their Giant kernel lock, and their network lock.  To give a high level
> summary of the architecture, basically they have two Funnels, which behave
> similarly to the Giant lock in -STABLE/-CURRENT: when you block, the lock
> is released, allowing other threads to enter the kernel, and regained when
> the thread starts to execute again. They then have fine-grained locking
> for the Mach-derived components, such as memory allocation, VM, et al.
> 
> Deep in a particular subsystem -- say, the network stack, all works fine.
> The problem is at the boundaries, where structures are shared between
> multiple compartments.  I.e., process credentials are referenced by both
> "halves"  of the Darwin BSD kernel code, and are insufficiently protected
> in the current implementation (they have a write lock, but no read lock,
> so it looks like it should be possible to get stale references with
> pointers accessed in a read form under two different locks). Similarly,
> there's the potential for serious problems at the surprisingly frequently
> occuring boundaries between the network subsystem and remainder of the
> kernel: file descriptor related code, fifos, BPF, et al.  By making use of
> two large subsystem locks, they do simplify locking inside the subsystem,
> but it's based on a web of implicit assumptions and boundary
> synchronization that carries most of the risks of explicit locking.

Are you maybe looking at the Jaguar (10.2) code base here?

I personally fixed a number of these types of issues.  Mostly,
these types of issues come down to the fact that funnels, like
the BGL in FreeBSD, are actually code locks, not data locks.

FWIW, it looks like the 10.3 (Panther) code is available from the
Darwin (7.0) project now:

	http://developer.apple.com/darwin/

Mixing code and data locks is where most problems happen; it's one
of the reasons that the BGL and the push-down to fine grain locking
in FreeBSD is, I think, the wrong way to go.  The destination is
right, but the path is probably wrong, and things will get much
worse with the FreeBSD approach before they get any better.  Part
of the problem is FreeBSD's mutex implementation is complicated,
and permits recursion (thereby tacitly encouraging recursion).  I
personally think that should result in a panic -- though you could
maybe make an argument for it on the basis of pool mutexes, where
the same mutex pool is used for all your locks.  But FreeBSD doesn't
really do that effectively, so it's kind of a lose/lose situation.

It's really, really hard to put multiple code locks into a kernel,
and get things right.  That's the problem with the FreeBSD approach:
having a BGL in the first place implicitly turns all your data
locks that have to coexist with the BGL code lock into code locks
as well.  I think the right approach would have been to start with
a single pool mutex implementation, allowing recursion, and a macro
wrapper for grabbing/releasing locks so that the locks could be
substituted for per-data item locks instead of pool locks, and the
per-data item locks would *disallow* recursion.  A smart appoach to
that would be to have a structure prefix on all kernel structures,
such that you could abuse a cast and use a single mutex implementation
for all data items.


> It's also worth noting that there have been some serious bugs associated
> with a lack of explicit synchronization in the non-concurrent kernel model
> used in RELENG_4 (and a host of other early UNIX systems relying on a
> single kernel lock).  These have to do with unexpected blocking deep in a
> function call stack, where it's not anticipated by a developer writing
> source code higher in the stack, resulting in race conditions.  In the
> past, there have been a number of exploitable security vulnerabilities due
> to races opened up in low memory conditions, during paging, etc.

I think this is a general problem.  I noted early on that FreeBSD
was going to have issues in this regard once there was real kernel
threading support.  The KSE model is much less prone to triggering
the race conditions, but any model where the same process can be in
the kernel multiple times is problematic.  Particularly since the
BSD code doesn't really support the concept of cancellation of a
blocking system call in progress (you don't have a means of doing
the wakeup to fail the tsleep's with a recognizable error code that
could mean cancellation to back out state).  It's likely that libthr
could demonstrate the issues much more quickly, so at least in that
respect it's a good test jig for needing to deal with these issues
sooner rather than later.

Really, I think the BSD code needs to be refactored.  The problem is
an inability to back out state once you are down one or more calls
in the stack.  The Mach code is much better in this regard.  All the
locking, if multiple data objects need to be locked, has to happen
at the same level in the call graph.  That's some serious hacking
that has to happen.


> One solution I was exploring was using the compiler to help track the
> potential for functions to block, similar to the const qualifier, combined
> with blocking/non-blocking assertions evaluated at compile-time.  However,
> some of our current APIs (M_NOWAIT, M_WAITOK, et al) make that approach
> somewhat difficult to apply, and would have to be revised to use a
> compiler solution.  These potential weaknesses very much exist in an
> explicit model, but with explicit locking, we have a clearer notion of how
> to express assertions.

Absolutely a great idea!  Nothing beats proper tools.  Tools don't
forget to look at some functions, etc..

If you are getting into the asserts (which I think is a great idea),
then you will want to change all the functions where you have them
to single-entry, single-exit (more code refactoring).  I tried to
do this with some of the FS patches I did back in 1995/1996, but
the changes were seen as gratuitous.  Maybe now it's an idea whose
time has come.  This will likely involve some additional complexity
(i.e. use of negative logic), but the payoff would definitely be
worth it (and you'll get better performance on top of that, if you
run on a processor supporting branch prediction, or your compiler
supports tail-call optimization -- and GCC does).


> In -CURRENT, we make use of thread-based serialization in a number of
> places to avoid explicit synchronization costs (such as in GEOM for
> processing work queues), and we should make more use of this practice.
> I'm particularly interested in the use of interface interrupt threads
> performing direct dispatch as a means to maintain interface ordering of
> packets coming in network interfaces while allowing parallelism in network
> processing (you'll find this in use in Sam's netperf branch currently).

I don't know if I agree with the idea that this type of serialization
is the right thing to do; it's basically implying code locks (again)
which leaves you in the same hole as the BGL.  Personally, I'd like
to see more reference counting of objects (effectively, a read-lock,
as you noted above).  I don't think you can safely discard an object
just because you have thread serialization.

For example, I think it's very possible to shoot your foot off in
FreeBSD right now with a simple multithreaded program that forces a
close race against a blocking system call (e.g. a read on a socket).

This would be a much more frequent occurance on an MP box, and as
CPU affinity and negaffinity of threads in a single process to max
out concurrency go in, the problem will go critical, all by itself,
and become totally unlivable for the majority of FreeBSD users (it
doesn't help that most Intel CPUs these days are Hyperthreaded, so
even if you are on a UP system, you're on an MP system, which is
just going to exacerbate the problem).

Unfortunately, there's a lot of threaded code that depends on the
Sun behaviour (full cancellation support; Sun already refactored
their code back in 1993/1994), particularly Java code tends to do
this a lot, and expect an exception will get thrown.  Java is evil.

-- Terry