From owner-freebsd-hackers Fri Dec 6 16:45:30 1996 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.4/8.8.4) id QAA21033 for hackers-outgoing; Fri, 6 Dec 1996 16:45:30 -0800 (PST) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.8.4/8.8.4) with SMTP id QAA21028 for ; Fri, 6 Dec 1996 16:45:28 -0800 (PST) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id RAA24721; Fri, 6 Dec 1996 17:25:20 -0700 From: Terry Lambert Message-Id: <199612070025.RAA24721@phaeton.artisoft.com> Subject: Re: clone()/rfork()/threads (Re: Inferno for FreeBSD) To: bakul@plexuscom.com (Bakul Shah) Date: Fri, 6 Dec 1996 17:25:19 -0700 (MST) Cc: terry@lambert.org, freebsd-hackers@freebsd.org In-Reply-To: <199612062111.QAA22343@chai.plexuscom.com> from "Bakul Shah" at Dec 6, 96 04:11:51 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > > It will, however, be unfair to threads within the group. > > I guess we are being `fair' to different things! > > If you treat all the threads/processes as equal then you are right. > But is being fair to threads is always the right thing to do? If it's not, then I guarantee you, I will probably use processes with shared context, rather than programming with threads. It is in my best interest as a vendor to make my application seem as fast as I can possibly make it appear, even if it means starving out processes which are not my application. > Also, if you allow different priority levels, realtime/best-effort > scheduling etc. then you are already abandoning the `all processes > are equal' model (or, if you wish, `refining' it). The RT on the timer events for process B, even though A has execution contexts which it is prepared to run, and quantum remaining, is an issue of timer granularity vs. quantum granularity. Certain system calls make guarantees that they can't really live up to without it. > Equal fairness to all processes made sense in a '70s style > ``multi-user'' system where each user ran one process (and to be > fair to your users you had to have fair scheduling of processes). > But now that is no longer the common case. Processes running on > your PC are all running to serve _you_. Processes running on a > server are perhaps closer to the multi-user case but even then there > one-to-one correspondence between clients and processes is not a > given. In fact different people may want to chose different > policies for their FTP/Web/Database servers (if running on the same > machine). Even for an FTP server, chances are, you want a bigger > slice of processing for your named clients compared to anonymous > clients. Within a given class all clients are treated equal but > none of the above scheme can be implemented with an equal fairness > to all threads scheduler. It can if the program can decide fairness in the context of which threads it chooses to schedule on the next available quantum. The fairness of "how much quantum does the program have to play with" is a seperate issue (as it should be). I may be allowed 5 out of 160 quantums, but they are not allocated to me contiguously. That's the point. > One needs a very different concept of fairness and it even > means different things in different contexts. Yes. Seperate "awarded quantum" from "amount of context utilization of a single quantum, per context". > An example. If some application is implemented in two different > ways, once as a single process (with purely user level threads) and > once as a multi-process (using kernel level threads), then with > traditional scheduling the second implementation will get many more > processor time quanta than the first one. This is not fair to the > users of the first implementation. You'd likely say: "Tough luck! > That is how things work". But IMHO an implementation choice should > not have such a drastic side-effect. I agree. I would further seperate the "multiple kernel thread case" from the "multiple process case". The two are only apparently the same if you are measuring "competition for real quantum". For a threaded process, the real masure is "competition for effective quantum". A "perfect" scheduler in user space would have a quantum "goal" of "one quantum award per blockable context, with remaining quantum returned to the system". This means that if I have 10 threads, and I can run two contexts until block on a given real quantum, then in effect, I should have 5 real quantum assigned to the process (thread group) for a net total of 10 "effective quanta": the same as I would have if there were one unthreaded process competing for quanta normally. > One should be able to use threads as just another `software > structuring' choice and have separate control over how the CPU is > used. Yes. I would not go so far as to allow a user to say "I need to compete as if I were 1000 processes". The user thread abstraction could be implemented in the kernel at the trap entry interface; the scheduler would not be under direct user control unless the user had special privileges granted (just like the timer calls would not make RT guarantees for better than quantum resoloution without RT privs). Effectively, this means implementing prioritized round-robin in user space, using a "Yield"-style interface to implement user space thread context switching. > At any rate, I don't think one can design a scheduler that will be > fair under all circumstances. The best one can do is provide > mechanisms or hooks to implement a wide variety of fairness > policies, with some help from the user. IMHO being fair to threads > does not serve that purpose. The *only* point in being fair to threads is to not disincent their use. The USL folks (New Jersey) wanted the NetWare for UNIX server on UnixWare to use their threading abstraction. We (the NWU architecture team) refused because of the fairness issue. Writing a new scheduling class to work around brain-damage in their architecture was not really a viable option. Instead, we went to shared user space context and multiple processes. Since the processes were anonymous (identical) work-to-do model engines, we could better overcome the process context switch problem by using FILO scheduling of read returns at the incoming NCP packet mux (my design: "hot engine scheduling"). Basically, the last guy to make a call was likely to have his mappings currently active, so the call was not blocked, and the first entry was dequeued. You replace "send/recv" with an ioctl that combines the two operations: it does a send and blocks for an incoming NCP, which could be handled by any service engine. The resulting server beat the performance of the Native NetWare server on the exact same hardware (same disk even) by 10% or more. This was after the 10% hit from moving from UnixWare 1.x to 2.x, and the 15% hit on top of that from moving from monolithic ethernet drivers to ODI drivers. This wasn't publicized much (but it should have been). I can say with confidence, that not including the change to the kernel to support sharing of descriptor tables and using shared memory for global state, and instead using a threaded model, would have resulted in significantly *worse* performance for the product. The SVR4 threads were (*are*) simply not up to the job of replacing processes. > Back to thread groups. > > Note that in such a scheme there needs to be a syscall or something > for a thread to become part of a "schedulable" thread group. > Ideally only __cooperating__ threads would want to belong to the > same group. Yeah; this call is "create_thread". 8-). > In the scenarios you presented, the N ftpds are *not* cooperating > with each other so there'd be no point in making them belong to the > same group. Actually, if they are using mmap() on files to save the user->kernel copy overhead, then sharing the context between multiple clients pulling down the same files would be a valuable optimization. If we replace "ftp" with "tftp" and say the clients are a bunch of X terminals getting boot and font files, then it's a much better example. Also, the "theory" is that even with a normal ftpd with non-cooperating other ftpd's, you supposedly "save process context switch overhead" (I think you agree with me that this is a special case fallacy). > But in general you should be able to use a larger quanta allocation > for process groups you want (so I am not suggesting one quantum per > thread group). If you are worried about responsiveness, put > interactive processes at a higher priority. The problem is in transparently exposing this facility *without* opening up denial-of-service attacks or similar holes. I think the trap-level cooperative user space thread scheduler is one promising way of hiding the tuning behind a protection domain. You can grant a licence to cross this domain to particular special cases, but the default should be that the threaded server process competes on an equal footing with the multiprocess server. > > > The above idea can be extended to multi processors fairly easily. > > > Though multi-processor schedulers that can also do realtime > > > scheduling (and appropriately deal with priority inversion) are not > > > easy. > > > > Heh. "locking nodes in a directed acylic graph representing a lock > > heirarchy" will address the priority inversion handily -- assuming > > I was talking about scheduling realtime threads on an MP system, > running at different priority levels but which may want to share > objects. When a high prio. thread has to wait because a low > priority thread holds a lock it wants, we have priority inversion. > I don't think fair scheduling is easy when you consider these > issues. If you detect a collision with a lower priority holder of the resource, you can lend priority at that time. Priority lending is the classic mechanism for resolving priority inversion based deadlocks. The problem is detecting the deadlock condition when it occurs; without a graphical soloution, it's a nearly insoluable problem (unless you simply preempt the resource, and that's got other problems). > Anyway, I don't see how computing lock hierarchy helps solve the > priority inversion problem. Priority inheritance is one known > solution. `Non-blocking synchronization' (NBS) is another way of > solving the priority inversion (as well as a host of other > problems). If you lock resources when you grab them and unlock them when you release them, then priority inversion becomes detectable. What you do about it is only meaningful after it has been detected. Doing the detection is the hard part. > > BTW, you may wish to browse the NBS thread (a different kind of > thread) on comp.os.research. NBS may be something worth using > (instead of mutexs and locks) in the MP version of FreeBSD. Also > read Greenwald and Cheriton's "The Synergy Between Non-blocking > Synchronization and Operating System Structure" (accessible via > http://www-dsg.stanford.edu/Publications.html) Thanks for the references; I will look them up... Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.