Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Sep 2002 23:51:58 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Rik van Riel <riel@conectiva.com.br>
Cc:        "Bill Huey (Hui)" <billh@gnuppy.monkey.org>, Julian Elischer <julian@elischer.org>, freebsd-arch@freebsd.org
Subject:   Re: New Linux threading model
Message-ID:  <3D8C170E.A84E922B@mindspring.com>
References:  <Pine.LNX.4.44L.0209202254280.1857-100000@imladris.surriel.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Rik van Riel wrote:
> On Fri, 20 Sep 2002, Terry Lambert wrote:
> > In the design I'm talking about, there are is only a single lock,
> > and that lock is on a migration queue, into which you *push*
> > entries.  This is different from both the Linux scheduler, and the
> > current FreeBSD scheduler, and the current FreeBSD scheduler with
> > Alfred's old (last year) affinity patches applied.
> 
> Very nice design, IF the cost of contention is higher than
> the cost of having CPUs sit idle until the next rescheduling
> happens on one of the busier CPUs...
> 
> Say that you have a moderately loaded box with, on average,
> about as many runnable tasks as you have CPUs. These tasks
> will, due to statistics, not always be evenly distributed
> across CPUs.

This will never happen in the real world, where the steady state
of system load (as a measure of the number of processes, on average,
in the ready-to-run state) is > 1.  8-).  It also makes a really
serious assumption about the number of CPUs, or a seriously invalid
one about what the average load of a desktop or internet server sits
at.  But say we accept that, for the sake of argument, even though
we know that threads will drastically inflate the number of things
in "ready to run" state, even if we decide to call those things
something other than "processes".


> Say that distribution is near-perfect and this results in
> only 1% CPU idle time.
> 
> Your scheme would win ONLY if the lock contention in the
> pull model is responsible for more than that 1% CPU time.

Not true.

There is a common myth in the x86 world: it's that shared memory
multiprocessors based on the x86 architecture stop scaling after
4 CPUs.

In point of fact, Sequent built 32 processor 486, and later,
Pentium systems, which were shared memory multiprocessors.
They scaled significantly beyond that point, no problem.  I
use to program applications on these beasts, and the 68K based
ones as well (i.e. both Sequent Symmetry and Sequent Balance
machines).

So what's the barrier in the SVR4 and Solaris -- and Linux,
and FreeBSD -- OS architectures that makes everyone believe
and propagate this myth?

It all comes down to contention for shared resources.  And
these days, that boils down to, in increasing order of
contention:

o	TLB/L1 cache coherency
o	L2 cache memory
o	System memory
o	The I/O bus(ses)

...it's really, really important to remember that the "I" in
"MESI" stands for "Invalidate".  8-).


> If we can keep the CPUs busier by pulling tasks onto idle
> CPUs and the locking overhead is less than 1%, it's a win.


With respect, most modern systems are never CPU-bound these
days; when they *are* CPU bound, it's because there is a
stall barrier that introduced by an I/O or main memory access.

Consider a very fast Intel system these days; it's common to
find machines that run at ~2.1GHz.  The fastest front-side
memory bus in common use is 433MHz -- one fifth of that speed.

Now that a simple clock -- a compare and exchange; you are
talking a factor of either 10 or 15 on the access, vs. the
instructions used in the access, because there is a barrier
that requires that the MESI cache coherency is mantained on
the lock contents: 5 in, 5, out, and (maybe) another 5 in.

*Maybe*, if you limited yourself to non-shared-memory systems,
you can get the speed win that you wanted; that basically
means "no hyperthreading", or it means "NUMA".


Now let's "solve" the general SMP shared memory multiprocessor
scaling problem, in its entirety.

The normal way to solve this is: don't share the memory.  A
difference way of putting this is "A resource which is not
shared, is a resource which is not contended".

How do you achieve this virtual state, on real hardware, when
the real hardware *does* share resources?

The easy way to do this is to break up ownership of the system
resources so that the contention is minimized.  This means:

o	Eliminate locking from common non-failure code paths.

o	Eliminate resource contention by assigning resources
	to per processor pools, which can be refilled from,
	and flushed to, contended system-wide pools -- *only*
	when necessary.

o	Eliminate locking from failure code paths

o	Eliminate cache discard (TLB shootdown, L1 and L2
	cache invalidation, reloading of data from devices,
	if the data is already in memory)

o	Assign data channels to specific CPUs, rather than
	running in virtual wire mode for interrupt processing,
	or transiently assigning interrupts to a single CPU
	simply because it has acquired a global lock

o	Enforce cache locality (common code path coherency,
	cache coloring, etc.; Ingo Molnar, one of the authors
	of the paper cited earlier, has done this for the Tux
	in-kernel HTTP server, so that the common code path
	all fits in cache: TCP stack, web server, and all)
 
...in short, pretty much everything Sequent did to Dynix, reaching
a peak between 1992 and 1994 or so, and what SGI has been arguing
for inclusion in Linux, based on the patch sets they've provided.

It's all fine and good to argue CPU time, when there isn't a clock
multiplier, or when your CPU time is your highest contended resource.

But this is only true, even on CPU intensive applications, if those
applications are inherently parallelizable -- the kind of applications
where you can hook a bunch of toy computers together, and get the same
results that you would get from a real supercomputer.  Even then, it
only really works if there is isolated locality of reference for the
problem subsets between the CPUs: if everyone is writing data into
the same page in memory, then it's going to be a contended resource,
and it's going to bottleneck the computations in contention for the
resource.

In the common case, main memory is generally a full order of
magnitude slower than L1 cache.  And disk memory is a full order
of magnitude slower than that.  And for any network application,
any disk I/O you do is going to contend with the network for I/O
bus time.


So, back on the "Subject:" line...

The benefits of seperating the scheduler queues, without the
introduction of locks, except in the exceptional code paths,
and delaying that, if possible, to act as a smoothing function
for transient load spikes (e.g. make the figure of merit for
CPU load be a weighted moving average of the last N snapshots
of the instantaneous load) ...is not the be-all, end-all of
optimizations.  On the other hand, it doesn't suck, and it's
not something you can so easily dismiss by claiming that the
CPU load is "only" 99%.  In other words, we are preventing
objects in one ownership domain from moving to another one,
except under extraoridnary circumstances, because we know that
that's where contention arises.

Another part of the answer is going to a memory allocator that
use per CPU pools of memory, which are allocated only rarely
from the system common pool, and which only rarely span the
same locality as that of another CPU... so that locks are not
needed, TLB shootdown doesn't occur, and the L2 and L1 cache
contents are rarely invalidated by the MMU, simply because they
have been accessed by a different CPU.  Again: making changes
in ownership extrordinary.

Yeah, there's some obvious optimizations we can get on this, too.
One is to allocate the memory and other resources, as necessary,
to each CPU (real or "hyperreal"), and don't give it back when
the free list hits a high watermark.  Instead, leave it there,
until the system pool hits a low watermark, and then signal the
CPUs that there is an insufficiency.  This basically eliminates
all the thrashing until there is a shortage -- if there ever is --
of a given contended resource.

If you want to argue this, the thing to do is to write the code
to ensure tunable contention levels, and to gather statistics
on the real-world performance of the algorithm *under contention*;
because the chance for memory contention are greater than 5 times
the CPU load contention -- the average over the last 10 years has
been pretty steady at 10 times, actually, and has fluctuated as
high as 20 times.  And the memory bus isn't the slowest point in
data flow through a computer, it's I/O.

So force the contention issue, and measure how the algorithm you
are defending degrades.

If you want to look at L2 contention directly, then a post-processor
that counts the use of the LCK prefix would be useful -- it's most
useful, if all your mutex code gets inlined, so you can count real
occurances, rather than referenced occurrances.

In most cases, if you are grabbing a lock, it's because you have
failed to nail the problem, or you are working around having to
rewrite all of the code.  Neither one of these is actually damning,
but it's good to know when you're doing it.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D8C170E.A84E922B>