Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 19 Oct 1995 18:12:35 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        leisner@sdsp.mc.xerox.com (Marty Leisner)
Cc:        julian@ref.tfs.com, cimaxp1!jb@werple.net.au, leisner@sdsp.mc.xerox.com, hackers@FreeBSD.ORG, jb@cimlogic.com.au
Subject:   Re: NetBSD/FreeBSD (pthreads)
Message-ID:  <199510200112.SAA03776@phaeton.artisoft.com>
In-Reply-To: <9510192300.AA03362@willow.mc.xerox.com> from "Marty Leisner" at Oct 19, 95 03:58:57 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> In message <199510192246.PAA23918@ref.tfs.com>,   you write:
> >> I'm curious about why you *need* kernel threads.
> >
> >usually it's for several blocking IO streams..
> 
> pthreads handles this...

No it doesn't.

> But it does this via (I think...) all I/O being nonblocking, and if
> it would block it then run another thread....

This is how it runs, period: It changes blocking operations into
non-blocking operations plus a context switch.

If you get a real blocking operation (ie: one outside the model, etc.),
then all threads are blocked because the context switcher can't run
unless it converts the call.  Like fstatfs/statfs on a remote but
down server (yit's not a select()'able operation).

> I haven't looked at the code for a while, but kernel threads would be
> much more efficient (if a task blocks on I/O, it blocks inside
> the kernel, rather then muxing on select...)

It would be more efficient for actual blocking I/O.  The way to hack
this easily is an exec loader and alternate call gate, with rewritten
trap code.  If one thread entered on another threads blocked resource
(ie: two I/O's pending on the same fd), then you'd be screwed.

> I agree you can implement user-space pthreads efficiently if 
> multiple processor are compute bound...but this is not
> why you have threads...

Right.

> Also, each kernel call will need context switches, which slow things
> down...

The context switch will occur, period.  It makes little real difference
given the current context switch overhead whether it happens in the
kernel or in user space (except if it's in the kernel, you get more
time quanta to play with, which is good if it's your process and bad
if it's someone else's process.  8-)).

There are certain things we can do in delaying full context switching,
like lazy switching of FPU state, etc., but that'd benefit regular
context switching as well.

The main thing that's avoided is the page table swap and ptde invalidation,
which can be avoided in common using async I/O to do your threading.

The ideal for a threaded app is:

1)	Never blocked waiting for processor resource while
	quanta remains (requires kernel threads and a 1:1
	correspondance between kernel and user space threads
	in case all user space threads make blocking calls).

2)	Never voluntarily context switch (requires the use
	of async operations for all blocking operations, and
	a loose binding between kernel/user threads -- in
	other words, the thread scheduler will have to live
	in user space and have kernel space controls).

The user space threads buy you the ability to completely consume
your process quantum, as long as you convert blocking calls into
non-blocking calls plus a context switch.

The kernel space threads buy you the ability to compete for quanta
with other processes as if you were actually multiple processes, as
well as buying you the ability to not context switch all user space
threads when a blocking operation forces a voluntary context switch
of one kernel thread.

Page table entry and FPU state caching are not entirely effective
with multiple kernel threads, unless you manage both association
in the scheduling queue (to ensure one kernel thread follows the
other to cause a PTE "cache hit") and (in the SMP case), you
ensure processor preference binding (since a PTE entry and FPU
state are bound to the processor).

There is very little difference, other than page table entry management,
between "the ideal situation" and multiple processes with the ability
to share heap and open descriptor tables.

This doesn't translate to "ideal" for other processes on the system,
since average time between when they get the processor will be
increased by each threaded "process" taking asclose to its full quanta
as the mapping from synchronous to asynchronus calls + context switch
will allow.

NetWare for UNIX actually uses the multiple process/shared context
model (the SVR4 model does not have the ability to context switch
in user space on async I/O with kernel thread management).


For a machine dedicated to providing a small set of specific services,
the kernel threading model is actually inferior to the user space
threading (async I/O) model.

The shared context/kernel thread model, on the other hand, scales well
to SMP, whereas the user space threading will not benefit from multiple
processors.  SMP scaling is, of course, dependent on better than low
grain parallelism ...it requires kernel multiple entrancy to be
effectively utilized for scaling parallelizable tasks.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199510200112.SAA03776>