From owner-freebsd-arch  Mon Dec 13 16:33:51 1999
Delivered-To: freebsd-arch@freebsd.org
Received: from ns1.yes.no (ns1.yes.no [195.204.136.10])
	by hub.freebsd.org (Postfix) with ESMTP id 0F06214BC2
	for <freebsd-arch@freebsd.org>; Mon, 13 Dec 1999 16:33:49 -0800 (PST)
	(envelope-from eivind@bitbox.follo.net)
Received: from bitbox.follo.net (bitbox.follo.net [195.204.143.218])
	by ns1.yes.no (8.9.3/8.9.3) with ESMTP id BAA17956
	for <freebsd-arch@freebsd.org>; Tue, 14 Dec 1999 01:33:47 +0100 (CET)
Received: (from eivind@localhost)
	by bitbox.follo.net (8.8.8/8.8.6) id BAA58648
	for freebsd-arch@freebsd.org; Tue, 14 Dec 1999 01:33:46 +0100 (MET)
Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135])
	by hub.freebsd.org (Postfix) with ESMTP id 1EAC7151BE
	for <freebsd-arch@FreeBSD.ORG>; Mon, 13 Dec 1999 16:33:34 -0800 (PST)
	(envelope-from tlambert@usr08.primenet.com)
Received: (from daemon@localhost)
	by smtp05.primenet.com (8.9.3/8.9.3) id RAA24422;
	Mon, 13 Dec 1999 17:33:04 -0700 (MST)
Received: from usr08.primenet.com(206.165.6.208)
 via SMTP by smtp05.primenet.com, id smtpdAAAN0ayEV; Mon Dec 13 17:32:52 1999
Received: (from tlambert@localhost)
	by usr08.primenet.com (8.8.5/8.8.5) id RAA27290;
	Mon, 13 Dec 1999 17:33:12 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199912140033.RAA27290@usr08.primenet.com>
Subject: Re: Thread scheduling
To: nate@mt.sri.com
Date: Tue, 14 Dec 1999 00:33:12 +0000 (GMT)
Cc: chuckr@picnic.mat.net, adsharma@sharmas.dhs.org,
	freebsd-arch@freebsd.org
In-Reply-To: <199912110455.VAA24095@mt.sri.com> from "Nate Williams" at Dec 10, 99 09:55:29 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> > OK, let me state it again.  I wasn't asking if it was a good thing to
> > share out threads among multiple processors, because the advantages of
> > using a multiple CPU system *as* a multiple CPU seem obvious enough not to
> > need asking.
> 
> Except that it's possible that you may want to limit multiple-CPU's to
> multiple processes for cache reasons.  Once could argue that for certain
> classes of threaded applications, you'd get a better hit rate by
> sticking with a single CPU for *all* of the threads, although this
> wouldn't be true for all threaded applications.

The MESI cache coherency protocol solves most of this; the reload
would only be from L2 to L1 cache (slow enough on most systems, I
grant you).

I don't really know what cache coherency protocol is used by the
4 CPU Alpha box Mike Smith has.  I do know that the dual PPC-603
"BeBox" and several similar boxes didn't have anything like APIC
support, and so they removed the L2 cache entirely, and used the
cache signalling lines to implement MEI coherency.


> It depends on the application.

Right.

I think, though, for most threaded applications, what you'd really
want is negative affinity, where you want your quantum reservations
for different threads to be most likely to be on a different CPU.
This has two effects:

o	It allows multiple CPUs to be executing in user space in
	the same program, simultaneously

o	With a per CPU scheduler, you get rid of The Big Giant
	Lock(tm), in favor of a per CPU scheduler lock, that is
	only held by a different CPU when migrating processes
	for load balancing reasons.  Even so, only the two CPUs
	involved in the migration get stalled, and, frankly,
	they most likely do not get stalled at all, so long as
	you stagger your quantum clock between the CPUs.

This assumes that each thread is not pounding heavily on globally
shared state with other threads (i.e. each thread has limited
locality, and most locality is not in a contention domain).  For
the rare application that doesn't meet this (most likely, as a
result of a bad design; very few problems would require this as
part of their soloution), it might make sense to lockstep all, or
more likely, just the badly behaved threads, onto the same processor.


> > I was asking to see if it would be a good thing to add a
> > strong bias to the system, in such a way as to make the co-scheduling of
> > threads among the different processors so that all processors that are
> > made available to the program's threads are executing in that address
> > space as the same moment in time.
> 
> I wouldn't think it would help for cache rate and/or CPU usage, but
> that's just a gut feeling and not based on anything else.  Each CPU in
> an SMP system has it's own cache, so what happens on another CPU
> shouldn't effect how the one CPU performs.
> 
> Adding this bias wouldn't help, and may in fact make things worse (see
> above).


I would go further: I believe that it would make the scheduler
significantly more complicated than it has to be, as well as
setting The Big Giant Lock(tm) deeper into cement.

It would also tend to result in starvation of other processes,
unless you delayed scheduling, since accelerating it each time
another thread in the process became ready to run would have the
effect of double-booking quantum for the process.

Finally, it would also tend to leave the processors partially
idle (stalled) as a result of voluntary context switches not
occurring at exactly the same place in all threads (i.e. a cache
miss on a read from disk putting the caller to sleep).


> > Not a guarantee, but would it be a good thing to have them
> > "co-scheduled" (or a bias towards that likelihood).
> 
> But, I can't see any advantage to have them co-scheduled.

Gang scheduling is more appropriate to non-NUMA multiprocessors
working on data flow problems.  8-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message