From owner-cvs-src@FreeBSD.ORG  Tue Oct 18 15:31:39 2005
Return-Path: <owner-cvs-src@FreeBSD.ORG>
X-Original-To: cvs-src@FreeBSD.org
Delivered-To: cvs-src@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 236BC16A41F;
	Tue, 18 Oct 2005 15:31:39 +0000 (GMT)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 961F143D45;
	Tue, 18 Oct 2005 15:31:38 +0000 (GMT)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (unknown [192.168.48.2])
	by phk.freebsd.dk (Postfix) with ESMTP id D8B4EBC84;
	Tue, 18 Oct 2005 15:31:31 +0000 (UTC)
To: Scott Long <scottl@samsco.org>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Tue, 18 Oct 2005 08:34:52 MDT."
	<4355080C.302@samsco.org> 
Date: Tue, 18 Oct 2005 17:31:31 +0200
Message-ID: <69026.1129649491@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: cvs-src@FreeBSD.org, src-committers@FreeBSD.org,
	Andrew Gallatin <gallatin@cs.duke.edu>, cvs-all@FreeBSD.org,
	David Xu <davidxu@FreeBSD.org>
Subject: Re: cvs commit: src/sys/amd64/amd64 cpu_switch.S machdep.c 
X-BeenThere: cvs-src@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: CVS commit messages for the src tree <cvs-src.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-src>,
	<mailto:cvs-src-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/cvs-src>
List-Post: <mailto:cvs-src@freebsd.org>
List-Help: <mailto:cvs-src-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-src>,
	<mailto:cvs-src-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 18 Oct 2005 15:31:39 -0000

In message <4355080C.302@samsco.org>, Scott Long writes:

[At the risk of repeating myself once more...]

>Steering mutliple TSC's together isn't that hard and there are plenty of
>examples, as you point out.  Accounting for the changes due to thermal
>and power management (note that this isn't the same problem as suspend
>and resume) is what worries me.

It all depends what you mean by "hard" and what benefit you expect
to arrive at.

One of the things you have to realize is that once you go down this
road you need a lot of code for all the conditionals.

For instance you need to make sure that every new timestamp you
hand out not prior to another one, no matter what is happening to
the clocks.

Imagine one CPU throttling because of heat, that CPU will be handing
out timestamps in the past until the TSC slowdown has been corrected,
meanwhile the other CPU in the system churns on at full speed.

To solve this, you need to pessimize every timestamp with an intercpu
lock to compare against the previous timestamp and if less you have
to do the Lamport-trick and return the "previous timestamp + epsilon".

Then there is the question of how you adapt, a stepwise adaptation
is hard to get right without overshoot, and stability is far from
a given.

Dave Mills implemented a scheme on Alpha to have a per-cpu PLL which
where clocked by a common interrupt from the RTC.  The results were
interesting, but hardly revolutionary, and performance wise it sucked.

So, yes, it may not be "hard" in the "write an OS from scratch" sense
of "hard", but it is certainly far from trivial, comes with a heavy
penalty in complexity and a notable shortage of successful prior art.


One of the things we pride ourselves off in FreeBSD is stability,
and the current code (finally!) provides that:  It has been a long
time since we last hard timecounter issues with broken hardware.

But if people are certain their TSC's are good and sound, they can
override the default safe selection of ACPI with a sysctl, and in
doing so, they can take a calculated risk.

That, IMO, is the correct "FreeBSD way" to handle this:

   "Safe out of the box. Informed tweaking may be profitable."

I would hate to have to go to the other side where some fraction
of users which happen to use hardware with problems in this space
will have to disable something to get stable operation or to
avoid unexplained undesirable transient phenomena.

>> It seems like reading ACPI-fast is "only" 3us or so, but when the ctx
>> switch is otherwise 4us, it adds up. i8254 is much worse on this
>> system (6.5us).

i8254 is always bad, and about as bad as it can.  Mostly because
of the need to disable interrupts (Actually, that's a critical
section today, isn't it ?) and also hobbled by the three 8 bit
ISA-bus(-like) accesses needed.

>>  > I wonder if moving to HZ=1000 on amd64 and i386 was really all that good
>>  > of an idea.

The main benefit was getting more precise timeouts, something we have
at various times thought about implementing with deadline counters
on platforms that have it.  Nobody has done it though.


So, instead of looking for "quick fixes", lets look at this with a
designers or architects view:

On a busy system the scheduler works hundred thousand times per
second, but on most systems nobody ever looks at the times(2) data.

The smart solution is therefore to postpone the heavy stuff into
times(2) and make the scheduler work as fast as it can.

So the scheduler should read the TSC and schedule in TSC-ticks.

times(2) will then have to convert this to clock_t compatible
numbers.

According the The Open Group, clock_t is in microseconds by means
of historical standards mistakes.

However, I can see nowhere that would collide with an interpretation
that said "clock_t is microseconds PROVIDED the cpu had run at full
speed", so a simple one second routine to latch the highest number
of TSC-tics we've seen in a second would be sufficient to generate
the conversion factor.

And in many ways this would be a much more useful metric to offer
(in top(1)) than the current rubber-band-cpu-seconds.

Poul-Henning

[1] A problem with this plan of course is that some CPU's don't
have TSCs, but a fallback mechanism to use whatever timecounter is
active as TSC.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.