Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 15 Feb 2001 07:08:32 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        seth@pengar.com (Seth Leigh)
Cc:        freebsd-smp@FreeBSD.ORG
Subject:   Re: possible problem with SMP?
Message-ID:  <200102150708.AAA19435@usr08.primenet.com>
In-Reply-To: <5.0.2.1.0.20010214234050.026959f8@hobbiton.shire.net> from "Seth Leigh" at Feb 15, 2001 12:01:30 AM

next in thread | previous in thread | raw e-mail | index | archive | help
I'm certainly not a spokesperson for this, but I'm very well
versed on threads, and participated in the design discussions
for the FreeBSD and (while at USL) SVR4 threading, as well as
the Usenet discussions on the Solaris 2.5 threading and beyond.
I guess that entitles me to express myself as much as the next
guy...


> OK, having read this, I am still not sure of how this scheme is any better 
> than the way it is done, for example, in Solaris, with the exception that 
> this scheme will prevent a single process from competing unfairly with 
> other processes simply by creating a large number of threads.

You need to consider that threads were originally designed to
reduce context switch overhead.  SMP scaling is merely a side
effect of some particular implementations.

The benefit is that it in fact delivers on the context switch
overhead promise.  In a design where the threads are independently
scheduled, the probability of getting another thread in the same
process is proportional to the number of threads in the process (N)
vs. the total number of threads (M) is (N-1):(M).  that means that
as a system becomes heavily loaded, (M) -> (M)>>(N-1), and the
absolute probability value (N-1)/(M) approaches zero percent.  On
the other hand, where quantum is remaining, an activations (or async
system call) implementation, the probablity remains 1:1 ...100%.

The beauty of this increases, when you realize that the method
of implementation has the same SMP scalability as that of the
Solaris approach.

Solaris 2.8 comes much closer to achieving the same ends, but
still has the same overhead scaling issues for N:M (N threads on
M CPUs, in this case, even if you optimally limit a program to
M scheduling objects regardless of how big N gets).


> To be honest, if assuring that individual processes compete with
> each other fairly for cpu time irrespective of the number of
> threads created is very important, then running processes under
> the Solaris Resource Manager scheduling class under different
> Lnodes sounds like a perfectly reasonable way to do it.  In fact,
> I would imagine that it would be possible simply to implement a
> new scheduling class that eliminates the unfair competition 
> between processes but avoids most of the complexity of the Solaris
> Resource Manager simply by calculating the priorities of all of
> the runnable lwps on the system such that no single process' lwps
> get more time on the cpus in general than is fair.

Frankly, I think that SVR4 style scheduling classes are a very
useful feature for an OS to have.  However, the primary goals of
the design are not to enforce fairness, though permitting it to
be enforced was certainly a consideration, in the discussions I
attended.

Historically, SVR4 has had a tendency to use scheduling classes
in the classical "if the only tool you have is a hammer..." way;
in particular, the idea of a fixed scheduling class was implemented
because a badly behaved process could thrash the buffer cache,
forcing the X server pages out of core, to the point of the X
server being unusable as an interface.  This broke the cognitive
tactile/visual association that made the interface work (or what
Ed Lane, a friend of mine, and one of the people who complained
loadly until it was at least worked around, called "move mouse,
wiggle cursor".  The fixed scheduling class addressed this by
providing the X server with sufficient quanta in any set of K
quanta to thrash its own pages back in: a workaround, since the
OS wastes much time page thrashing, which it would not waste, if
the problem were addressed directly (e.g. with a per vnode working
set quota).  Unfortunately, non-kernel methods of addressing the
problem are unlikely to be successful; the linker was one of the
main offending programs: it mapped object files, and seeked around
in them as if they were memory.  Changing the linker behaviour (and
the behaviour of all other offendors) was not really a viable
option (a working set quota on the vnode would have forced it to
steal pages from itself, instead of from the LRU lists of swapped
out processes, once it hit the working set size limit).

As a side note to this tangent, it's pretty amusing to note that
the most recent Solaris release "reinvents" the buffer cache,
and deunifies it once again, to get around heavy usage starvation
problems, which could just as easily have been fixed with those
same working set quotas, or, as FreeBSD has recently started doing,
by maintaining purpose-dedicated minimum reserves.

In any case, fairness is a benefit, and permitting enforcement of
fairness was a design goal, but I don't look at it as a primary
design criteria, and you probably shouldn't, either, when trying
to compare implementations.


> Anyhow, perhaps I just haven't seen the light.  I should probably go back 
> and re-read the paper on the web site, since I just don't really see how 
> this is "The Right Way" over how Solaris does this, for example.

I really recommend that you do.  Read it in the context of the
idea that the CPU should be spending as much time as possible
in user space, and that time spent in kernel space on context
switches or I/O waits is nothing but overhead.

At the same time, there are some subtleties that are easy to
overlook that you should also keep in mind.  I think the number
one of these is that a threaded program that should be a threaded
program, philosophically speaking, will not have a lot of
inter-thread contention.  In other words, seperate threads have
seperate locality of reference for data, and most reasources
are not contended.

Again, the ability to communicate in a single address space among
several threads is a side effect of implementation, just as SMP
scalability turned out to be a side effect (even if a desirable
one).  If you look at the NT and 95/98 threads implementations,
you'll see that they make heavy usage of thread local storage to
instnace COM objects, and that objects instanced after the
creation of a new thread do not exist in (are not mapped into)
the address space of the creating thread, without explicitly
marshalling the instance data (anyone who has tried to use the
LDAP API in a multithreaded environment under the Microsoft OS
and toolchain, can attest that the connection created in one
thread is not automatically marshalled; the technical reason is
that the DLL of the created library doesn't have thread_attach
and thread_detach entry points which force the data marshalling
to occur transparently).  The address space sharing, when and if
it exists at all, is an artifact of the implementation.

To put this into perspective, note that most SMP scaling for
MESI architecture shared memory multiprocessing is anecdotally
held to "fall off" and become "below the point of diminishing
returns" at only 4 processors.  Some implementations will claim
8 processors, and many papers will claim that NUMA and similar
compromise data-flow architectures, which can't support true
symmetry, are required to exceed 4 processors.  Yet in the early
90s, Sequent manufactured both the Symmetry and Balance series
of servers (680x0 and 80[4|5]86), and were able to support
MESI cache coherency, with continuing returns for 32 processors
in a single box.

What this says to me that the assumption that a certain amount
of IPI overhead "is inevitable" is really based on assumptions
that may be valid about the architecture of SVR4 ES/MP or SMP
SVR4.2, but which were not valid for Dynix and other SMP and
MP OSs of the time.  In other words, contention for shared
resourvces is not so much an artifact of SMP, as it is of a
particular SMP implementation.

Part of that has to be the threads library, which is intended,
at least these days, to provide SMP scalability for applications,
instead of merely increasing utilized quantum and I/O concurrency
(the original reasons for threads, and "liblwp" on SunOs 4.x: a
call conversion based user space threads library, which was not
SMP scalable, the basis of the University of Washington "User
Space Threads and SPARC Register Windows" paper).


> One question I have is this.  Since I assume you aren't going to be messing 
> with the system clock interupts, but you seem to be adding your own timers 
> to the UTS to handle preemption at the user level, aren't you guys adding 
> more wasted overhead dealing with timer interupts?

Scheduler activations are not timer interrupts.  And they are only
created in the context of blocking calls.  While my personal
preference would be to put all blocking call POSIX semantics
into a library, and make all system calls non-blocking, I do
understand the compatability issue that lead to the use of the
more complicated activations framework.

Really, you should read the "Scheduler Activations" paper, which
discusses these issues in detail.  If you look under the covers
of the cooperative kernel and user space scheduler in Solaris 2.8,
I think you'll find a number of similarities.

There is some additional protection domain crossing -- actually,
two calls per activation -- that would not be there with a fully
async system call interface, with call contexts dual-mapped in
both user and kernel space, but when you compare that overhead
to a high probability of a heavy weight context switch, the
benefits should be obvious.

Note: The only place this will not be true will be on a system
with an intentionally homogeneous load, to ensure that all of the
contention will be between threads in a single process.  It is my
experience, and, I think, the general consensus, that this does
not represent a real world load that one could expect on any
system capable of doing useful work.  To be CPU bound enough to
require multiple processors in the first place, in this I/O bound
world we live in today, really means non-homogeneaity of the code
competing for quanta.


> I don't see the logic in the argument "letting the kernel scheduler handle 
> the scheduling of all the threads (via lwps) is bad, because it overloads 
> the kernel scheduler, so we are going to do all of this thread scheduling 
> using a user-level thread scheduler".

The issue is one of who is responsible for non-default policy
implementation.  If you put this in the kernel, then you make
the code arbitrarily complex.  The problem you are then faced
with is how to institute CPU affinity for threads in a process,
and how to implement threads group affinity to prefer threads
in one process, when a kernel sleep (voluntary context switch)
takes place?  Almost any way you cut it, it is nearly impossible
to implement thread group affinity in a scheduler, once something
is already on a scheduling queue, without leading to starvation
of threads not in the threads group.  The closest approach (as is
described in "UNIX for Modern Architectures" is a two queue, two
handed clock algorithm (effectively, a three handed clock, with
preferential queue insertion of ready-to-run "LWPs", to use Sun
terminology).  This makes the scheduler incredibly bursty with
regard to your threaded process, and at the same time turns it
into a fair-share scheduler, meaning that it ignores priorities
when trying to decide what to run next.


> If you can write a user-level thread scheduler that won't get
> "overloaded" by a large number of threads, why can't you just
> write the kernel-level scheduler the same way?  I suppose if 
> the answer is that the existing FreeBSD scheduler is just too
> clumsy to provide good scalability over multiple cpus and a
> large number of runnable lwps, then wouldn't the first job be to
> improve the scheduler, rather than simply try to shift all the
> load into a UTS?

The answer is that it's not possible to build a scheduler that
implements priority based fairness, when user processes are
permitted to request an unbounded amount of quantum.

The best you can do is implement quantum allocation fairness,
and then implement a policy based decision on which thread in
a group of threads will get a quantum, once it is assigned to
a process.

Technically, it's possible to implement this all in the kernel,
but it significantly and unnecesssarily increases the complexity.
In the case of calls which can be run to completion without
blocking in the kernel, it also increases the amount of protection
domain crossing.

Again, you want to think about how you can spend the most time
you can possibly spend, actually running user code, instead of
operating system.  In terms of "fairness", accounting scheduler
overhead for implementation of policy decisions against the
process which initiated the need for such overhead, is eminently
fair.


If you want to look at the simplest case, a multithreaded
application running on a uniprocessor box, the lowest possible
overhead is obtained by staying in user space -- not crossing
the protection domain at all -- and context switching between
threads within the same process, in user space, for as long as
the quantum holds out.  Obviously, aio calls beat non-blocking
calls, since a non-blocking call that would block results in a
fault or other kernel operation occurring, and returning to the
caller with EWOULDBLK, completing sucessfully the next time it
is attempted (an extra protection domain crossing, compared to
AIO), but any approach beats context switching to a different
competing process, and losing the remainder of your quantum.

Ask yourself how to make this work with the SMP side effect
of some implementations becoming a design goal instead of an
unintended side effect.

It's only when you try to make things complicated, by adding
other processors, and then insisting that they be additive in
a single multithreaded program, instead of two programs, that
things start to get complicated.  And then you have to remember
your first principles: why you had threads in the first place,
and not how you are now able to abuse them to achieve primitive
and generally grossly inefficient SMP scaling.

If you want to achieve SMP scalability, at least be honest; and
design to achieve it efficiently as possible, not as a desirable
side effect of the abuse of a technology that _happens_ to result
in _some_ scalability.

Another good reference in this regard, by the way, is the Linus
response to more recent Microsoft vs. Linux benchmarks, in which
he admits scalability problems in Linux (which uses the original
SVR4 kernel threading model, pretty much unaltered), and states
with regard to early claims of benchmark bias "...we were
arrogant..." (referring to the fact that MS OSs can in fact beat
Linux fairly, in some benchmarks).



					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200102150708.AAA19435>