Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 1 Jul 1999 01:00:36 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        bright@rush.net (Alfred Perlstein)
Cc:        smp@freebsd.org, tlambert@primenet.com
Subject:   Re: async call gates
Message-ID:  <199907010100.SAA13352@usr09.primenet.com>
In-Reply-To: <Pine.BSF.3.96.990630102649.14320b-100000@cygnus.rush.net> from "Alfred Perlstein" at Jun 30, 99 10:47:54 am

next in thread | previous in thread | raw e-mail | index | archive | help
> Mr Lambert, I noticed you've been pushing for async call gates functionality
> in FreeBSD.  I have several assumptions about how this would work that
> makes it expensive to accomplish, perhaps you or someone on this list 
> can clarify by explanation and/or a URL to some paper or a book reference.

It's my design, from around 1993.  I wanted to think of CPU's as
resources which crawled over code, like spiders over a web.  It
steals liberally from the VMS idea of asynchronous system traps.


Peter's explanation is pretty much spot-on.  The assumption that the
call contexts (you called them 'threads') have to exist for the calls
to be made in the first place is a wrong turn.



At the simplest level, there are three kinds of system calls:

1)	Calls that will never block
2)	Calls that will always block
3)	Calls that may or may not block

It would be trivial to add a flag to the sysent[] structure for
each system call to indicate what category it was in, if you had
a mind to optimize behaviour for type #1 calls.


Now for each system call, there is some context:

1)	The calling process' address space
2)	Certain aspects of the calling process, unrelated to the
	address space, per se.  This is a subset of the contents
	of the proc structure: credentials, PID, etc.
3)	A kernel stack
4)	A list of system resources, accumulated as they are held


This arrangement is symmetric between the kernel and user space,
as we will see below.


Now for the process model in user space.

Let's assume that all existing system calls which are of type "will
always block" or of type "may or may not block" can now be executed
asynchronously, and that this is not an attribute of a particular
system call, but is an attribute of the call mechanism itself.  We
might also place in this category some of the "will never block"
calls, if they take "a long time" to complete.

This, in a nutshell, is the "async call gate".


This provides us with two interesting concurrency models which we
can use in user space:

1)	We can have a process with multiple outstanding
	asynchronous system calls pending simultaneously.

	This is the "aio" model, except we are not limited
	to the POSIX subset "aioread" and "aiowrite".  We
	can, instead, take "takes a long time" operations,
	such as "open", "sync", "getdents", "connect",
	"syslog", and so on, and also make *them* asynchronously.

2)	We can implement on top of the asynchronous (non-blocking)
	calls, a set of primitives that emulate blocking calls.

	This emulation is done by making an async call and then
	putting the caller to sleep until the async call returns.
	Rather than giving up our quantum, however, we change to
	another set of registers, another program counter, and
	another stack.

	This implementation is called "user space threading using
	a call conversion scheduler".

The advantage of this threads model over a kernel threads model
is that, so long as we have work pending, we can utilize the
full measure of our CPU quantum without taking a full context
switch overhead.  This was the original promise that the threads
model made to us when it first appeared, and then renigged upon
when threads moved into the kernel.

The main problem with kernel threads is that kernel threads are
not really threads, per se, they are KSE's -- Kernel Schedulable
Entities.  We call them KSE's to make no distinction between the
work-to-do for a thread vs. a traditional process.  The result
is that kernel threads compete for quantum as processes, both
with other threads in the same thread group ("multithreaded
process"), and with other threads in other thread groups (a
thread group with but a single member is a traditional UNIX
process).

The main benefit of kernel threads is SMP scalability for programmer
parallelized processes: processes which have been intentionally
multithreaded (as opposed to using asynchronous calls) in order
to increase their concurrency.

There is really no benefit (and in fact, a large amount of cache
busting and context switch overhead) to using kernel threads as
opposed to asynchronous calls in order to achieve SMP scaling.
It is a lot of work to try and get the scheduler to intentionally
minimize address space changes.  You can do opportunistic state
change avoidance when you are switching between one thread in
a process group and another thread in the same process group,
but that really buys you very little.  More complex soloutions
lead to starvation deadlocks and other bad behaviour.

This leaves one area where kernel threads still have better
SMP scalability: when most of the cycles are spent in user
space code.


Right now, in SMP FreeBSD, each CPU can be in user space at the
same time; in fact, given the Big Giant Lock(tm), multiple CPU's
can only be in the kernel simultaneously under some rigidly
controlled circumstances.  BSDI gets around the reentrancy
issue for interrupts by moving the interrupt handling to kernel
threads; that's one way to do it, but it has some real bad
behaviour compared to "top" and "bottom" drivers, like in NT
and Solaris' SMP implementations.

What kernel threads promise, and what make people willing to put
up with their obvious drawbacks, is the ability for multiple
CPU's to be in user space _in the same process_ at the same time.

This really impresses some people, when it probably shouldn't.

This seems to really be a feat of legerdemain; in fact, it's not
that hard, and you don't need kernel threads to get there, only
the ability to queue scheduler reservations for user space code
already in a ready-to-run state.  This is best accomplished with
a user space hook into a scheduler activation mechanism.  We can
see this rather handily by asking ourselves "What is a user space
process?"; the answer is:

1)	A set of registers, including:

	o	A stack
	o	A program counter

2)	An address space

In other words, the same mechanism that is used by the call
conversion to switch between one "blocked" user space thread and
another "ready to run" user space thread can be used to dispatch
a second (or third or fourth or Nth) CPU into the user space
code as well.  All we are missing is a tiny bit of glue in the
user space call conversion scheduler.  This glue says:

1)	When you get a chance, "return" a quantum to user space
	to this address, which happens to be my user space
	scheduler dispatcher.

2)	I have an aversion to this request being enqueued on
	any of the ready-to-run lists for CPU's where I already
	have one of these requests enqueued, or which are
	currently running in my user space at my behest.

The first is trivial to code up: it's called "vfork".  One might
make certain optimizations based on the system call architecture
divorcing the kernel stack and other context from the process
structure itself; I would advise this.  The result is significantly
lighter weight than the current vfork, since the entirety of the
context is actually just the user space context.  The only real
"trick" is to ensure that your scheduler dispatcher has a stack
available for every one of these which may return to user space;
you wouldn't want to use a handy user space thread's stack for
this, since you want to eliminate quantum bias.

The second takes a bit more work.  A very rough implementation
is trivial, however: a 32 bit bitmask associated with the
process context, with one bit per available processor.


What's the result?  The result is a highly SMP scalable multiple
CPU utilizing user space threading system, which neatly sidesteps
the context switch address space thrashing and N:M kernel:user
space thread mapping problems.


The kernel reentrancy issues for fault/call/interrupt can be
handled seperately, without a lot of overly complicated (by
kernel threads) effort.  Ideally, the BSDI approach would *NOT*
be used; it's dated.  Here's a reference from 1991:

	Lazy Task Creation: A Technique for Increasing the
	    Granularity of Parallel Programs
	E. Mohr, D.A. Kranz, and R.H. Halstead, Jr.
	IEEE Transactions on Parallel and Distributed Systems,
	    July 1991, pages 264-280

And there are other references (which I don't have immediately
at hand) showing this approach to be more than a decade and a
half behind the current state of the art.  Some people have
already referenced more recent work on the FreeBSD SMP list.
Late-binding resources is a useful technique, but it's not
the only technique, and for the particular implementation
(interrupts -- yes, I know that I'm suggesting lazy binding of
blocking contexts when I suggest an async call gate), it's
probably not the best technique.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199907010100.SAA13352>