From owner-freebsd-arch  Fri Nov  5 15:42:45 1999
Delivered-To: freebsd-arch@freebsd.org
Received: from ns1.yes.no (ns1.yes.no [195.204.136.10])
	by hub.freebsd.org (Postfix) with ESMTP id 300CB14E58
	for <freebsd-arch@freebsd.org>; Fri,  5 Nov 1999 15:42:34 -0800 (PST)
	(envelope-from eivind@bitbox.follo.net)
Received: from bitbox.follo.net (bitbox.follo.net [195.204.143.218])
	by ns1.yes.no (8.9.3/8.9.3) with ESMTP id AAA26008
	for <freebsd-arch@freebsd.org>; Sat, 6 Nov 1999 00:42:12 +0100 (CET)
Received: (from eivind@localhost)
	by bitbox.follo.net (8.8.8/8.8.6) id AAA01820
	for freebsd-arch@freebsd.org; Sat, 6 Nov 1999 00:42:12 +0100 (MET)
Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132])
	by hub.freebsd.org (Postfix) with ESMTP id B44A21533B
	for <freebsd-arch@FreeBSD.ORG>; Fri,  5 Nov 1999 15:40:25 -0800 (PST)
	(envelope-from tlambert@usr06.primenet.com)
Received: (from daemon@localhost)
	by smtp02.primenet.com (8.8.8/8.8.8) id QAA06699;
	Fri, 5 Nov 1999 16:39:39 -0700 (MST)
Received: from usr06.primenet.com(206.165.6.206)
 via SMTP by smtp02.primenet.com, id smtpd006612; Fri Nov  5 16:39:35 1999
Received: (from tlambert@localhost)
	by usr06.primenet.com (8.8.5/8.8.5) id QAA09960;
	Fri, 5 Nov 1999 16:39:29 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199911052339.QAA09960@usr06.primenet.com>
Subject: Re: Threads goals version III
To: eischen@vigrid.com (Daniel M. Eischen)
Date: Fri, 5 Nov 1999 23:39:29 +0000 (GMT)
Cc: rcarter@chomsky.Pinyon.ORG, hasty@rah.star-gate.com,
	tlambert@primenet.com, julian@whistle.com, freebsd-arch@freebsd.org
In-Reply-To: <38224942.6447A8B2@vigrid.com> from "Daniel M. Eischen" at Nov 4, 99 10:04:34 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> > One uses pthread_*sched* routines to modify scheduling
> > attributes for individual threads.  Whether or not those threads
> > can get process or system scope, as in the spec, or just process
> > scope, that is the question.
> > 
> > If I groked Terry's first missive then the process should first
> > set scheduling attributes via sched_setscheduler, if it needs
> > something other than SCHED_OTHER (default per process scheduling).
> > I'm still studying whether or not this is good enough; i.e., can
> > individual threads in a process with a SCHED_FIFO or SCHED_RR
> > policies meet QoS goals, and does it provide the flexibility
> > needed by an application structured as Daniel's is.
> 
> I don't think it does.  I think under Terry's scheme, it wouldn't
> be possible to bind a thread to a KSE and place it in a different
> scheduler class.  It would still be possible to run [user] threads
> in different scheduler classes without being bound to a KSE, but
> this wouldn't reserve a quantum in which the threads could execute.
> So if the application on the whole was at or near it's process
> limits, the threads in the real-time (or some other than the
> default) scheduler class wouldn't be able to run.
> 
> You could make allowances for this situation and manage accounting
> differently for a process running under multiple scheduling classes,
> but it still seems simpler to allow (with correct privileges) system
> scope threads to be bound to KSEs with their own quantum.


OK; I think we're going too deep into implementation details at this
point in the game; but I will address the questions.

---

Ideally, one would just take the quantum one is given, and say
"Oh please sir, may I have another?" in the procesess' best
"Oliver Twist" imitation.  The quantum belongs to the system,
and it is only through the munificence of the system that a
grovelling process should be granted a quantum at all.

---

In the real world, I realize that some people have implemented code
that is dependant on how threads have been implemented, and depends,
_almost_ on having kernel threads as backing objects for user space
threads, in order to achieve priority banding in what should
probably have been two applciations (i.e. PTHRREAD_SCOPE_SYSTEM was
a truly bad idea, and now we have to live with/through it).


To not get into too much detail, the processes threads are initially
scheduled to run on one CPU, and the kernel scheduler code is
modified to implement gross CPU affinity, and gross load balancing
through process migration at the time a process is placed into a
per CPU ready-to-run queue.  The sleep queues remain global, and
are not modified, while the process is given a simple bitmap and
an involuntary context switch call pointer, so that the user space
scheduler is never logically interrupted before it runs to completion.


In this model, there is very little which is actually done to the
kernel sceduler code in terms of creating specific affinity code,
and as a result, there is very little that needs to be done to
address the starvation and process group affinity issues, partial
quantum usage, etc.: all the problems that arise when you start
using kernel threads and allowing them to block on you and start
trying to map N user space threads to M kernel space threads, for
random values of M and N.

One could think of running processes on this kernel, ignoring
threads for the moment, and achieving high CPU affinity and high
load scaling through decreased cache busting as a natural result
of the natural CPU affinity that falls out of this model.

One could also see that things like the SetiAtHome and RSA and DES
challenge code could run much more successfully and equitably in
such an environment, and achieving significant improvement over
the current SMP scheduling implementation.

---

Now for the SMP scalable multithreaded application...

For SMP scaling, what you are effectively doing is placing a
reservation not in a single read-to-run queue, but the read-to-run
queue of multiple CPUs.

In the simplest case, this is done incrementally by a user space
scheduler making an explicit request to be placed in a second
queue, e.g. "reserve me a quantum on a CPU that I'm not already
reserving quanta on, please".  A simple implementation of this
would be:

	CREATE_NEW_USER_SCHEDULER_STACK
	CHANGE_TO_NEW_USER_SCHEDULER_STACK
	rfork( RFADDCPU)
	CHANGE_TO_PRIOR_USER_SCHEDULER_STACK

The point of this would be to:

1)	Create a new user space stack, so multiple CPUs can
	run in the user space scheduler simultaneously

2)	Set the new user space stack active

3)	Ask another CPU to return to user space on the new
	stack; this request returns immediately

4)	Continue processing on the previous user space stack,
	since we are still on the previous CPU

This would be easier with another parameter so that we could
pass the alternate stack down with the rfork(2) call, but it's
good enough for now.

---

Multiple scheduler reservations, and PTHREAD_SCOPE_SYSTEM...

Can each of these scheduler reservations be at different system
scheduling priorities?

Yes.

The implementation of system scoped priorities in this situation
is now trivial:  If I'm using a "realtime" system scoped priority
scheduler reservation, then I, as the user space scheduler, only
choose threads to run based on their membership in the list of
threads for which that priority is important.

I know that something is a praticular priority, because I
know the scheduler stack on which the priority was associated:
it's the currently running stack from the return from the
kernel into the scheduler code in user space.


If all of these threads are blocked on events, then I go to
sleep via an explicit yield, confident that when the events
unblock, they will do so at the high priority and on the
scheduling conext on which the call was made.  So when I
instantiate the high system priority scheduler reservation,
I may have to call an explicit yield before I can start
running a thread at that priority.

The binding of scheduler reservations (kernel call contexts,
including a kernel stack, not something as heavyweight as the
current kernel threads implementation) in specific system
sheduling classes to user space threads in a given process...
well, that's essentially a process-level policy decision, to
be made by any process that has a high enough priviledge to
request such a high priority reservation.

This is really a very minimal spreading of the trust model,
since you have to trust the whole process to create all its
threads at the right priority, if you trust it to create one
at a high (e.g. "realtime") priority.

As far as "binding" goes, you end up with high and low priority
quanta coming back into user space on different user space
scheduler stacks -- multiple CPUs/scheduling classes into user
space simultaneously -- and what user space thread the scheduler
chooses to run as a result is really up to it and the policy
set by the programmer.

Clearly, you could abuse this, and just as clearly, certain
scheduling classes, like "realtime", cut across CPU migration
boundaries, so I would expect PTHREAD_SCOPE_SYSTEM to be a
last resort for almost all well behaved programs.  Failure to
do this would pretty surely cause excessive migration between
CPUs of the "realtime" scheduler reservation.


This is more detail than I wanted to get into outside the scope
of a coherent document (it's nearing the size of a white paper
right now 8-(...), but since I'm going to be at an IBM UML
calls with Archie and Julian all next week, and studying the
book this weekend, I didn't want to leave things hanging.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message