From owner-freebsd-arch Fri Nov 5 15:42:45 1999 Delivered-To: freebsd-arch@freebsd.org Received: from ns1.yes.no (ns1.yes.no [195.204.136.10]) by hub.freebsd.org (Postfix) with ESMTP id 300CB14E58 for ; Fri, 5 Nov 1999 15:42:34 -0800 (PST) (envelope-from eivind@bitbox.follo.net) Received: from bitbox.follo.net (bitbox.follo.net [195.204.143.218]) by ns1.yes.no (8.9.3/8.9.3) with ESMTP id AAA26008 for ; Sat, 6 Nov 1999 00:42:12 +0100 (CET) Received: (from eivind@localhost) by bitbox.follo.net (8.8.8/8.8.6) id AAA01820 for freebsd-arch@freebsd.org; Sat, 6 Nov 1999 00:42:12 +0100 (MET) Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132]) by hub.freebsd.org (Postfix) with ESMTP id B44A21533B for ; Fri, 5 Nov 1999 15:40:25 -0800 (PST) (envelope-from tlambert@usr06.primenet.com) Received: (from daemon@localhost) by smtp02.primenet.com (8.8.8/8.8.8) id QAA06699; Fri, 5 Nov 1999 16:39:39 -0700 (MST) Received: from usr06.primenet.com(206.165.6.206) via SMTP by smtp02.primenet.com, id smtpd006612; Fri Nov 5 16:39:35 1999 Received: (from tlambert@localhost) by usr06.primenet.com (8.8.5/8.8.5) id QAA09960; Fri, 5 Nov 1999 16:39:29 -0700 (MST) From: Terry Lambert Message-Id: <199911052339.QAA09960@usr06.primenet.com> Subject: Re: Threads goals version III To: eischen@vigrid.com (Daniel M. Eischen) Date: Fri, 5 Nov 1999 23:39:29 +0000 (GMT) Cc: rcarter@chomsky.Pinyon.ORG, hasty@rah.star-gate.com, tlambert@primenet.com, julian@whistle.com, freebsd-arch@freebsd.org In-Reply-To: <38224942.6447A8B2@vigrid.com> from "Daniel M. Eischen" at Nov 4, 99 10:04:34 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > > One uses pthread_*sched* routines to modify scheduling > > attributes for individual threads. Whether or not those threads > > can get process or system scope, as in the spec, or just process > > scope, that is the question. > > > > If I groked Terry's first missive then the process should first > > set scheduling attributes via sched_setscheduler, if it needs > > something other than SCHED_OTHER (default per process scheduling). > > I'm still studying whether or not this is good enough; i.e., can > > individual threads in a process with a SCHED_FIFO or SCHED_RR > > policies meet QoS goals, and does it provide the flexibility > > needed by an application structured as Daniel's is. > > I don't think it does. I think under Terry's scheme, it wouldn't > be possible to bind a thread to a KSE and place it in a different > scheduler class. It would still be possible to run [user] threads > in different scheduler classes without being bound to a KSE, but > this wouldn't reserve a quantum in which the threads could execute. > So if the application on the whole was at or near it's process > limits, the threads in the real-time (or some other than the > default) scheduler class wouldn't be able to run. > > You could make allowances for this situation and manage accounting > differently for a process running under multiple scheduling classes, > but it still seems simpler to allow (with correct privileges) system > scope threads to be bound to KSEs with their own quantum. OK; I think we're going too deep into implementation details at this point in the game; but I will address the questions. --- Ideally, one would just take the quantum one is given, and say "Oh please sir, may I have another?" in the procesess' best "Oliver Twist" imitation. The quantum belongs to the system, and it is only through the munificence of the system that a grovelling process should be granted a quantum at all. --- In the real world, I realize that some people have implemented code that is dependant on how threads have been implemented, and depends, _almost_ on having kernel threads as backing objects for user space threads, in order to achieve priority banding in what should probably have been two applciations (i.e. PTHRREAD_SCOPE_SYSTEM was a truly bad idea, and now we have to live with/through it). To not get into too much detail, the processes threads are initially scheduled to run on one CPU, and the kernel scheduler code is modified to implement gross CPU affinity, and gross load balancing through process migration at the time a process is placed into a per CPU ready-to-run queue. The sleep queues remain global, and are not modified, while the process is given a simple bitmap and an involuntary context switch call pointer, so that the user space scheduler is never logically interrupted before it runs to completion. In this model, there is very little which is actually done to the kernel sceduler code in terms of creating specific affinity code, and as a result, there is very little that needs to be done to address the starvation and process group affinity issues, partial quantum usage, etc.: all the problems that arise when you start using kernel threads and allowing them to block on you and start trying to map N user space threads to M kernel space threads, for random values of M and N. One could think of running processes on this kernel, ignoring threads for the moment, and achieving high CPU affinity and high load scaling through decreased cache busting as a natural result of the natural CPU affinity that falls out of this model. One could also see that things like the SetiAtHome and RSA and DES challenge code could run much more successfully and equitably in such an environment, and achieving significant improvement over the current SMP scheduling implementation. --- Now for the SMP scalable multithreaded application... For SMP scaling, what you are effectively doing is placing a reservation not in a single read-to-run queue, but the read-to-run queue of multiple CPUs. In the simplest case, this is done incrementally by a user space scheduler making an explicit request to be placed in a second queue, e.g. "reserve me a quantum on a CPU that I'm not already reserving quanta on, please". A simple implementation of this would be: CREATE_NEW_USER_SCHEDULER_STACK CHANGE_TO_NEW_USER_SCHEDULER_STACK rfork( RFADDCPU) CHANGE_TO_PRIOR_USER_SCHEDULER_STACK The point of this would be to: 1) Create a new user space stack, so multiple CPUs can run in the user space scheduler simultaneously 2) Set the new user space stack active 3) Ask another CPU to return to user space on the new stack; this request returns immediately 4) Continue processing on the previous user space stack, since we are still on the previous CPU This would be easier with another parameter so that we could pass the alternate stack down with the rfork(2) call, but it's good enough for now. --- Multiple scheduler reservations, and PTHREAD_SCOPE_SYSTEM... Can each of these scheduler reservations be at different system scheduling priorities? Yes. The implementation of system scoped priorities in this situation is now trivial: If I'm using a "realtime" system scoped priority scheduler reservation, then I, as the user space scheduler, only choose threads to run based on their membership in the list of threads for which that priority is important. I know that something is a praticular priority, because I know the scheduler stack on which the priority was associated: it's the currently running stack from the return from the kernel into the scheduler code in user space. If all of these threads are blocked on events, then I go to sleep via an explicit yield, confident that when the events unblock, they will do so at the high priority and on the scheduling conext on which the call was made. So when I instantiate the high system priority scheduler reservation, I may have to call an explicit yield before I can start running a thread at that priority. The binding of scheduler reservations (kernel call contexts, including a kernel stack, not something as heavyweight as the current kernel threads implementation) in specific system sheduling classes to user space threads in a given process... well, that's essentially a process-level policy decision, to be made by any process that has a high enough priviledge to request such a high priority reservation. This is really a very minimal spreading of the trust model, since you have to trust the whole process to create all its threads at the right priority, if you trust it to create one at a high (e.g. "realtime") priority. As far as "binding" goes, you end up with high and low priority quanta coming back into user space on different user space scheduler stacks -- multiple CPUs/scheduling classes into user space simultaneously -- and what user space thread the scheduler chooses to run as a result is really up to it and the policy set by the programmer. Clearly, you could abuse this, and just as clearly, certain scheduling classes, like "realtime", cut across CPU migration boundaries, so I would expect PTHREAD_SCOPE_SYSTEM to be a last resort for almost all well behaved programs. Failure to do this would pretty surely cause excessive migration between CPUs of the "realtime" scheduler reservation. This is more detail than I wanted to get into outside the scope of a coherent document (it's nearing the size of a white paper right now 8-(...), but since I'm going to be at an IBM UML calls with Archie and Julian all next week, and studying the book this weekend, I didn't want to leave things hanging. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message