From owner-freebsd-hackers  Fri Dec  6 16:45:30 1996
Return-Path: <owner-hackers>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.4/8.8.4) id QAA21033
          for hackers-outgoing; Fri, 6 Dec 1996 16:45:30 -0800 (PST)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.8.4/8.8.4) with SMTP id QAA21028
          for <freebsd-hackers@freebsd.org>; Fri, 6 Dec 1996 16:45:28 -0800 (PST)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id RAA24721; Fri, 6 Dec 1996 17:25:20 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199612070025.RAA24721@phaeton.artisoft.com>
Subject: Re: clone()/rfork()/threads (Re: Inferno for FreeBSD)
To: bakul@plexuscom.com (Bakul Shah)
Date: Fri, 6 Dec 1996 17:25:19 -0700 (MST)
Cc: terry@lambert.org, freebsd-hackers@freebsd.org
In-Reply-To: <199612062111.QAA22343@chai.plexuscom.com> from "Bakul Shah" at Dec 6, 96 04:11:51 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> > It will, however, be unfair to threads within the group.
> 
> I guess we are being `fair' to different things!
> 
> If you treat all the threads/processes as equal then you are right.
> But is being fair to threads is always the right thing to do?

If it's not, then I guarantee you, I will probably use processes
with shared context, rather than programming with threads.  It is
in my best interest as a vendor to make my application seem as
fast as I can possibly make it appear, even if it means starving
out processes which are not my application.


> Also, if you allow different priority levels, realtime/best-effort
> scheduling etc. then you are already abandoning the `all processes
> are equal' model (or, if you wish, `refining' it).

The RT on the timer events for process B, even though A has execution
contexts which it is prepared to run, and quantum remaining, is an
issue of timer granularity vs. quantum granularity.  Certain system
calls make guarantees that they can't really live up to without it.


> Equal fairness to all processes made sense in a '70s style
> ``multi-user'' system where each user ran one process (and to be
> fair to your users you had to have fair scheduling of processes).
> But now that is no longer the common case.  Processes running on
> your PC are all running to serve _you_.  Processes running on a
> server are perhaps closer to the multi-user case but even then there
> one-to-one correspondence between clients and processes is not a
> given.  In fact different people may want to chose different
> policies for their FTP/Web/Database servers (if running on the same
> machine).  Even for an FTP server, chances are, you want a bigger
> slice of processing for your named clients compared to anonymous
> clients.  Within a given class all clients are treated equal but
> none of the above scheme can be implemented with an equal fairness
> to all threads scheduler.

It can if the program can decide fairness in the context of which
threads it chooses to schedule on the next available quantum.  The
fairness of "how much quantum does the program have to play with"
is a seperate issue (as it should be).

I may be allowed 5 out of 160 quantums, but they are not allocated
to me contiguously.  That's the point.

> One needs a very different concept of fairness and it even
> means different things in different contexts.

Yes.  Seperate "awarded quantum" from "amount of context utilization
of a single quantum, per context".


> An example.  If some application is implemented in two different
> ways, once as a single process (with purely user level threads) and
> once as a multi-process (using kernel level threads), then with
> traditional scheduling the second implementation will get many more
> processor time quanta than the first one.  This is not fair to the
> users of the first implementation.  You'd likely say: "Tough luck!
> That is how things work".  But IMHO an implementation choice should
> not have such a drastic side-effect.

I agree.  I would further seperate the "multiple kernel thread case"
from the "multiple process case".  The two are only apparently the
same if you are measuring "competition for real quantum".  For a
threaded process, the real masure is "competition for effective
quantum".

A "perfect" scheduler in user space would have a quantum "goal" of
"one quantum award per blockable context, with remaining quantum
returned to the system".  This means that if I have 10 threads,
and I can run two contexts until block on a given real quantum,
then in effect, I should have 5 real quantum assigned to the process
(thread group) for a net total of 10 "effective quanta": the same as
I would have if there were one unthreaded process competing for
quanta normally.


> One should be able to use threads as just another `software
> structuring' choice and have separate control over how the CPU is
> used.

Yes.  I would not go so far as to allow a user to say "I need to compete
as if I were 1000 processes".  The user thread abstraction could be
implemented in the kernel at the trap entry interface; the scheduler
would not be under direct user control unless the user had special
privileges granted (just like the timer calls would not make RT
guarantees for better than quantum resoloution without RT privs).

Effectively, this means implementing prioritized round-robin in user
space, using a "Yield"-style interface to implement user space thread
context switching.


> At any rate, I don't think one can design a scheduler that will be
> fair under all circumstances.  The best one can do is provide
> mechanisms or hooks to implement a wide variety of fairness
> policies, with some help from the user.  IMHO being fair to threads
> does not serve that purpose.

The *only* point in being fair to threads is to not disincent their
use.  The USL folks (New Jersey) wanted the NetWare for UNIX server
on UnixWare to use their threading abstraction.  We (the NWU architecture
team) refused because of the fairness issue.  Writing a new scheduling
class to work around brain-damage in their architecture was not really
a viable option.  Instead, we went to shared user space context and
multiple processes.  Since the processes were anonymous (identical)
work-to-do model engines, we could better overcome the process context
switch problem by using FILO scheduling of read returns at the incoming
NCP packet mux (my design: "hot engine scheduling").  Basically, the
last guy to make a call was likely to have his mappings currently active,
so the call was not blocked, and the first entry was dequeued.  You
replace "send/recv" with an ioctl that combines the two operations: it
does a send and blocks for an incoming NCP, which could be handled by
any service engine.

The resulting server beat the performance of the Native NetWare server
on the exact same hardware (same disk even) by 10% or more.  This was
after the 10% hit from moving from UnixWare 1.x to 2.x, and the 15%
hit on top of that from moving from monolithic ethernet drivers to
ODI drivers.  This wasn't publicized much (but it should have been).


I can say with confidence, that not including the change to the kernel
to support sharing of descriptor tables and using shared memory for
global state, and instead using a threaded model, would have resulted
in significantly *worse* performance for the product.  The SVR4 threads
were (*are*) simply not up to the job of replacing processes.


> Back to thread groups.
> 
> Note that in such a scheme there needs to be a syscall or something
> for a thread to become part of a "schedulable" thread group.
> Ideally only __cooperating__ threads would want to belong to the
> same group.

Yeah; this call is "create_thread".  8-).

> In the scenarios you presented, the N ftpds are *not* cooperating
> with each other so there'd be no point in making them belong to the
> same group.

Actually, if they are using mmap() on files to save the user->kernel
copy overhead, then sharing the context between multiple clients
pulling down the same files would be a valuable optimization.  If
we replace "ftp" with "tftp" and say the clients are a bunch of X
terminals getting boot and font files, then it's a much better example.

Also, the "theory" is that even with a normal ftpd with non-cooperating
other ftpd's, you supposedly "save process context switch overhead"
(I think you agree with me that this is a special case fallacy).


> But in general you should be able to use a larger quanta allocation
> for process groups you want (so I am not suggesting one quantum per
> thread group).  If you are worried about responsiveness, put
> interactive processes at a higher priority.

The problem is in transparently exposing this facility *without* opening
up denial-of-service attacks or similar holes.  I think the trap-level
cooperative user space thread scheduler is one promising way of
hiding the tuning behind a protection domain.  You can grant a
licence to cross this domain to particular special cases, but the
default should be that the threaded server process competes on an equal
footing with the multiprocess server.


> > > The above idea can be extended to multi processors fairly easily.
> > > Though multi-processor schedulers that can also do realtime
> > > scheduling (and appropriately deal with priority inversion) are not
> > > easy.
> > 
> > Heh.  "locking nodes in a directed acylic graph representing a lock
> > heirarchy" will address the priority inversion handily -- assuming
> 
> I was talking about scheduling realtime threads on an MP system,
> running at different priority levels but which may want to share
> objects.  When a high prio. thread has to wait because a low
> priority thread holds a lock it wants, we have priority inversion.
> I don't think fair scheduling is easy when you consider these
> issues.

If you detect a collision with a lower priority holder of the resource,
you can lend priority at that time.  Priority lending is the classic
mechanism for resolving priority inversion based deadlocks.  The
problem is detecting the deadlock condition when it occurs; without
a graphical soloution, it's a nearly insoluable problem (unless you
simply preempt the resource, and that's got other problems).


> Anyway, I don't see how computing lock hierarchy helps solve the
> priority inversion problem.  Priority inheritance is one known
> solution.  `Non-blocking synchronization' (NBS) is another way of
> solving the priority inversion (as well as a host of other
> problems).

If you lock resources when you grab them and unlock them when you
release them, then priority inversion becomes detectable.  What you
do about it is only meaningful after it has been detected.  Doing
the detection is the hard part.

> 
> BTW, you may wish to browse the NBS thread (a different kind of
> thread) on comp.os.research.  NBS may be something worth using
> (instead of mutexs and locks) in the MP version of FreeBSD.  Also
> read Greenwald and Cheriton's "The Synergy Between Non-blocking
> Synchronization and Operating System Structure" (accessible via
> http://www-dsg.stanford.edu/Publications.html)

Thanks for the references; I will look them up...


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.