Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 15 Feb 2001 02:56:07 -0500
From:      Seth Leigh <seth@pengar.com>
To:        Terry Lambert <tlambert@primenet.com>
Cc:        freebsd-smp@FreeBSD.ORG
Subject:   Re: possible problem with SMP?
Message-ID:  <5.0.2.1.0.20010215025043.027a3008@hobbiton.shire.net>
In-Reply-To: <200102150708.AAA19435@usr08.primenet.com>
References:  <5.0.2.1.0.20010214234050.026959f8@hobbiton.shire.net>

next in thread | previous in thread | raw e-mail | index | archive | help
I just read pages 407 to 410 of the Jim Mauro/Richard McDougall "Solaris 
Internals" book and it talks specifically about scheduler 
activations.  They mention how they are used by the kernel scheduler to 
keep LWPs from a given process running throughout the rest of their quantum 
even when one of them blocks, and some other things.  They specifically 
talk about Solaris Doors as being the mechanism by which the kernel calls 
up into the threads library.  I will have to read a lot more in the next 
day or so about Solaris scheduler activations until I know exactly what 
they are and what they do.  I am still only on page 150 of the book.  It's 
not easy reading.  ;-)  But I will finish reading it, and then I will read 
it all the way through again, since I really do want to understand this stuff.

Terry, have you looked into the Solaris Doors mechanism, and do you think 
that implementing a facility like that in FreeBSD would clean up and 
improve the way FreeBSD plans to implement the threads library upcalls, and 
other possible places for upcalls as well?  Despite my as-yet limited 
understanding, it sounds promising to me.

Seth


At 07:08 AM 2/15/2001 +0000, Terry Lambert wrote:
>I'm certainly not a spokesperson for this, but I'm very well
>versed on threads, and participated in the design discussions
>for the FreeBSD and (while at USL) SVR4 threading, as well as
>the Usenet discussions on the Solaris 2.5 threading and beyond.
>I guess that entitles me to express myself as much as the next
>guy...
>
>
> > OK, having read this, I am still not sure of how this scheme is any better
> > than the way it is done, for example, in Solaris, with the exception that
> > this scheme will prevent a single process from competing unfairly with
> > other processes simply by creating a large number of threads.
>
>You need to consider that threads were originally designed to
>reduce context switch overhead.  SMP scaling is merely a side
>effect of some particular implementations.
>
>The benefit is that it in fact delivers on the context switch
>overhead promise.  In a design where the threads are independently
>scheduled, the probability of getting another thread in the same
>process is proportional to the number of threads in the process (N)
>vs. the total number of threads (M) is (N-1):(M).  that means that
>as a system becomes heavily loaded, (M) -> (M)>>(N-1), and the
>absolute probability value (N-1)/(M) approaches zero percent.  On
>the other hand, where quantum is remaining, an activations (or async
>system call) implementation, the probablity remains 1:1 ...100%.
>
>The beauty of this increases, when you realize that the method
>of implementation has the same SMP scalability as that of the
>Solaris approach.
>
>Solaris 2.8 comes much closer to achieving the same ends, but
>still has the same overhead scaling issues for N:M (N threads on
>M CPUs, in this case, even if you optimally limit a program to
>M scheduling objects regardless of how big N gets).
>
>
> > To be honest, if assuring that individual processes compete with
> > each other fairly for cpu time irrespective of the number of
> > threads created is very important, then running processes under
> > the Solaris Resource Manager scheduling class under different
> > Lnodes sounds like a perfectly reasonable way to do it.  In fact,
> > I would imagine that it would be possible simply to implement a
> > new scheduling class that eliminates the unfair competition
> > between processes but avoids most of the complexity of the Solaris
> > Resource Manager simply by calculating the priorities of all of
> > the runnable lwps on the system such that no single process' lwps
> > get more time on the cpus in general than is fair.
>
>Frankly, I think that SVR4 style scheduling classes are a very
>useful feature for an OS to have.  However, the primary goals of
>the design are not to enforce fairness, though permitting it to
>be enforced was certainly a consideration, in the discussions I
>attended.
>
>Historically, SVR4 has had a tendency to use scheduling classes
>in the classical "if the only tool you have is a hammer..." way;
>in particular, the idea of a fixed scheduling class was implemented
>because a badly behaved process could thrash the buffer cache,
>forcing the X server pages out of core, to the point of the X
>server being unusable as an interface.  This broke the cognitive
>tactile/visual association that made the interface work (or what
>Ed Lane, a friend of mine, and one of the people who complained
>loadly until it was at least worked around, called "move mouse,
>wiggle cursor".  The fixed scheduling class addressed this by
>providing the X server with sufficient quanta in any set of K
>quanta to thrash its own pages back in: a workaround, since the
>OS wastes much time page thrashing, which it would not waste, if
>the problem were addressed directly (e.g. with a per vnode working
>set quota).  Unfortunately, non-kernel methods of addressing the
>problem are unlikely to be successful; the linker was one of the
>main offending programs: it mapped object files, and seeked around
>in them as if they were memory.  Changing the linker behaviour (and
>the behaviour of all other offendors) was not really a viable
>option (a working set quota on the vnode would have forced it to
>steal pages from itself, instead of from the LRU lists of swapped
>out processes, once it hit the working set size limit).
>
>As a side note to this tangent, it's pretty amusing to note that
>the most recent Solaris release "reinvents" the buffer cache,
>and deunifies it once again, to get around heavy usage starvation
>problems, which could just as easily have been fixed with those
>same working set quotas, or, as FreeBSD has recently started doing,
>by maintaining purpose-dedicated minimum reserves.
>
>In any case, fairness is a benefit, and permitting enforcement of
>fairness was a design goal, but I don't look at it as a primary
>design criteria, and you probably shouldn't, either, when trying
>to compare implementations.
>
>
> > Anyhow, perhaps I just haven't seen the light.  I should probably go back
> > and re-read the paper on the web site, since I just don't really see how
> > this is "The Right Way" over how Solaris does this, for example.
>
>I really recommend that you do.  Read it in the context of the
>idea that the CPU should be spending as much time as possible
>in user space, and that time spent in kernel space on context
>switches or I/O waits is nothing but overhead.
>
>At the same time, there are some subtleties that are easy to
>overlook that you should also keep in mind.  I think the number
>one of these is that a threaded program that should be a threaded
>program, philosophically speaking, will not have a lot of
>inter-thread contention.  In other words, seperate threads have
>seperate locality of reference for data, and most reasources
>are not contended.
>
>Again, the ability to communicate in a single address space among
>several threads is a side effect of implementation, just as SMP
>scalability turned out to be a side effect (even if a desirable
>one).  If you look at the NT and 95/98 threads implementations,
>you'll see that they make heavy usage of thread local storage to
>instnace COM objects, and that objects instanced after the
>creation of a new thread do not exist in (are not mapped into)
>the address space of the creating thread, without explicitly
>marshalling the instance data (anyone who has tried to use the
>LDAP API in a multithreaded environment under the Microsoft OS
>and toolchain, can attest that the connection created in one
>thread is not automatically marshalled; the technical reason is
>that the DLL of the created library doesn't have thread_attach
>and thread_detach entry points which force the data marshalling
>to occur transparently).  The address space sharing, when and if
>it exists at all, is an artifact of the implementation.
>
>To put this into perspective, note that most SMP scaling for
>MESI architecture shared memory multiprocessing is anecdotally
>held to "fall off" and become "below the point of diminishing
>returns" at only 4 processors.  Some implementations will claim
>8 processors, and many papers will claim that NUMA and similar
>compromise data-flow architectures, which can't support true
>symmetry, are required to exceed 4 processors.  Yet in the early
>90s, Sequent manufactured both the Symmetry and Balance series
>of servers (680x0 and 80[4|5]86), and were able to support
>MESI cache coherency, with continuing returns for 32 processors
>in a single box.
>
>What this says to me that the assumption that a certain amount
>of IPI overhead "is inevitable" is really based on assumptions
>that may be valid about the architecture of SVR4 ES/MP or SMP
>SVR4.2, but which were not valid for Dynix and other SMP and
>MP OSs of the time.  In other words, contention for shared
>resourvces is not so much an artifact of SMP, as it is of a
>particular SMP implementation.
>
>Part of that has to be the threads library, which is intended,
>at least these days, to provide SMP scalability for applications,
>instead of merely increasing utilized quantum and I/O concurrency
>(the original reasons for threads, and "liblwp" on SunOs 4.x: a
>call conversion based user space threads library, which was not
>SMP scalable, the basis of the University of Washington "User
>Space Threads and SPARC Register Windows" paper).
>
>
> > One question I have is this.  Since I assume you aren't going to be 
> messing
> > with the system clock interupts, but you seem to be adding your own timers
> > to the UTS to handle preemption at the user level, aren't you guys adding
> > more wasted overhead dealing with timer interupts?
>
>Scheduler activations are not timer interrupts.  And they are only
>created in the context of blocking calls.  While my personal
>preference would be to put all blocking call POSIX semantics
>into a library, and make all system calls non-blocking, I do
>understand the compatability issue that lead to the use of the
>more complicated activations framework.
>
>Really, you should read the "Scheduler Activations" paper, which
>discusses these issues in detail.  If you look under the covers
>of the cooperative kernel and user space scheduler in Solaris 2.8,
>I think you'll find a number of similarities.
>
>There is some additional protection domain crossing -- actually,
>two calls per activation -- that would not be there with a fully
>async system call interface, with call contexts dual-mapped in
>both user and kernel space, but when you compare that overhead
>to a high probability of a heavy weight context switch, the
>benefits should be obvious.
>
>Note: The only place this will not be true will be on a system
>with an intentionally homogeneous load, to ensure that all of the
>contention will be between threads in a single process.  It is my
>experience, and, I think, the general consensus, that this does
>not represent a real world load that one could expect on any
>system capable of doing useful work.  To be CPU bound enough to
>require multiple processors in the first place, in this I/O bound
>world we live in today, really means non-homogeneaity of the code
>competing for quanta.
>
>
> > I don't see the logic in the argument "letting the kernel scheduler handle
> > the scheduling of all the threads (via lwps) is bad, because it overloads
> > the kernel scheduler, so we are going to do all of this thread scheduling
> > using a user-level thread scheduler".
>
>The issue is one of who is responsible for non-default policy
>implementation.  If you put this in the kernel, then you make
>the code arbitrarily complex.  The problem you are then faced
>with is how to institute CPU affinity for threads in a process,
>and how to implement threads group affinity to prefer threads
>in one process, when a kernel sleep (voluntary context switch)
>takes place?  Almost any way you cut it, it is nearly impossible
>to implement thread group affinity in a scheduler, once something
>is already on a scheduling queue, without leading to starvation
>of threads not in the threads group.  The closest approach (as is
>described in "UNIX for Modern Architectures" is a two queue, two
>handed clock algorithm (effectively, a three handed clock, with
>preferential queue insertion of ready-to-run "LWPs", to use Sun
>terminology).  This makes the scheduler incredibly bursty with
>regard to your threaded process, and at the same time turns it
>into a fair-share scheduler, meaning that it ignores priorities
>when trying to decide what to run next.
>
>
> > If you can write a user-level thread scheduler that won't get
> > "overloaded" by a large number of threads, why can't you just
> > write the kernel-level scheduler the same way?  I suppose if
> > the answer is that the existing FreeBSD scheduler is just too
> > clumsy to provide good scalability over multiple cpus and a
> > large number of runnable lwps, then wouldn't the first job be to
> > improve the scheduler, rather than simply try to shift all the
> > load into a UTS?
>
>The answer is that it's not possible to build a scheduler that
>implements priority based fairness, when user processes are
>permitted to request an unbounded amount of quantum.
>
>The best you can do is implement quantum allocation fairness,
>and then implement a policy based decision on which thread in
>a group of threads will get a quantum, once it is assigned to
>a process.
>
>Technically, it's possible to implement this all in the kernel,
>but it significantly and unnecesssarily increases the complexity.
>In the case of calls which can be run to completion without
>blocking in the kernel, it also increases the amount of protection
>domain crossing.
>
>Again, you want to think about how you can spend the most time
>you can possibly spend, actually running user code, instead of
>operating system.  In terms of "fairness", accounting scheduler
>overhead for implementation of policy decisions against the
>process which initiated the need for such overhead, is eminently
>fair.
>
>
>If you want to look at the simplest case, a multithreaded
>application running on a uniprocessor box, the lowest possible
>overhead is obtained by staying in user space -- not crossing
>the protection domain at all -- and context switching between
>threads within the same process, in user space, for as long as
>the quantum holds out.  Obviously, aio calls beat non-blocking
>calls, since a non-blocking call that would block results in a
>fault or other kernel operation occurring, and returning to the
>caller with EWOULDBLK, completing sucessfully the next time it
>is attempted (an extra protection domain crossing, compared to
>AIO), but any approach beats context switching to a different
>competing process, and losing the remainder of your quantum.
>
>Ask yourself how to make this work with the SMP side effect
>of some implementations becoming a design goal instead of an
>unintended side effect.
>
>It's only when you try to make things complicated, by adding
>other processors, and then insisting that they be additive in
>a single multithreaded program, instead of two programs, that
>things start to get complicated.  And then you have to remember
>your first principles: why you had threads in the first place,
>and not how you are now able to abuse them to achieve primitive
>and generally grossly inefficient SMP scaling.
>
>If you want to achieve SMP scalability, at least be honest; and
>design to achieve it efficiently as possible, not as a desirable
>side effect of the abuse of a technology that _happens_ to result
>in _some_ scalability.
>
>Another good reference in this regard, by the way, is the Linus
>response to more recent Microsoft vs. Linux benchmarks, in which
>he admits scalability problems in Linux (which uses the original
>SVR4 kernel threading model, pretty much unaltered), and states
>with regard to early claims of benchmark bias "...we were
>arrogant..." (referring to the fact that MS OSs can in fact beat
>Linux fairly, in some benchmarks).
>
>
>
>                                         Terry Lambert
>                                         terry@lambert.org
>---
>Any opinions in this posting are my own and not those of my present
>or previous employers.
>
>
>To Unsubscribe: send mail to majordomo@FreeBSD.org
>with "unsubscribe freebsd-smp" in the body of the message




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5.0.2.1.0.20010215025043.027a3008>