From owner-cvs-src@FreeBSD.ORG  Thu Oct 20 12:55:54 2005
Return-Path: <owner-cvs-src@FreeBSD.ORG>
X-Original-To: cvs-src@freebsd.org
Delivered-To: cvs-src@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id D19AA16A41F;
	Thu, 20 Oct 2005 12:55:53 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.115])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 7DB5043D5A;
	Thu, 20 Oct 2005 12:55:51 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.0.87])
	by mailout2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id
	j9KCtbVE013693; Thu, 20 Oct 2005 22:55:37 +1000
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id
	j9KCtMGX011464; Thu, 20 Oct 2005 22:55:23 +1000
Date: Thu, 20 Oct 2005 22:55:23 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
In-Reply-To: <23346.1129796829@critter.freebsd.dk>
Message-ID: <20051020215101.Y874@delplex.bde.org>
References: <23346.1129796829@critter.freebsd.dk>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Scott Long <scottl@samsco.org>, src-committers@freebsd.org,
	Andrew Gallatin <gallatin@cs.duke.edu>, cvs-src@freebsd.org,
	cvs-all@freebsd.org, David Xu <davidxu@freebsd.org>
Subject: Re: cvs commit: src/sys/amd64/amd64 cpu_switch.S machdep.c 
X-BeenThere: cvs-src@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: CVS commit messages for the src tree <cvs-src.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-src>,
	<mailto:cvs-src-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/cvs-src>
List-Post: <mailto:cvs-src@freebsd.org>
List-Help: <mailto:cvs-src-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-src>,
	<mailto:cvs-src-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 20 Oct 2005 12:55:54 -0000

On Thu, 20 Oct 2005, Poul-Henning Kamp wrote:

> In message <20051020155911.C99720@delplex.bde.org>, Bruce Evans writes:
>
>>> One of the things you have to realize is that once you go down this
>>> road you need a lot of code for all the conditionals.
>>>
>>> For instance you need to make sure that every new timestamp you
>>> hand out not prior to another one, no matter what is happening to
>>> the clocks.
>>
>> Clocks are already incoherent in many ways:
>> - the times returned by the get*() functions incoherent with the ones
>>   returned by the functions that read the hardware, because the latter
>>   are always in advance of the former and the difference is sometimes
>>   visible at the active resolution.
>
> Sorry Bruce, but this is just FUD:  The entire point of the get*
> familiy of functions is to provide "good enough" timestamps, very
> fast, for code that knows it doesn't need better than roughly 1/hz
> precision.

This bug shows that the get* functions don't actually provide "good
enough" timestamps, even for what is probably their primary use --
ffs file times are probably their primary use, and these only need
a resolution of 1 second; however, they need to be accurate relative
to other clocks, and a precision of ~1/hz doesn't provide enough
accuracy due to implementation details.

>>   visible at the active resolution.  POSIX tests of file times have
>>   been reporting this incoherency since timecounters were implemented.
>>   The tests use time() to determine the current time and stat() to
>>   determine file times.  In the sequence:
>>
>>         t1 = time(...):
>>         sleep(1)
>>         touch(file);
>>         stat(file);
>>         t2 = mtime(file);
>>
>>   t2 should be < t1, but the bug lets t2 == t1 happen.
>
> t2 == t1 is not illegal.

It is just invalid and of low quality for file systems that provide a
resolution of 1 second in their timestamps.  The sleep of 1 second in
there is specific to such file systems; it is to ensure that at least
1 second has elapsed between the time() and the touch().

> The morons who defined a non-extensible timestamp format obviously
> didn't belive in Andy Moore, but given a sufficiently fast computer
> the resolution of the standardized timestamps prevents t2 > t1 in
> the above test code.

POSIX specifies the resolution for file times but doesn't specify their
accuracy AFAIK (not far).  Quality of implementation specifies their
accuracy.  The above is a simple test for strict monotonicity of file
times that happens to test for accuracy and coherency too.  This
monotonicity is very easy to get right.  sleep(3) is required to sleep
for at least 1 second.  nanosleep(2) is sloppy about this -- it uses
a get* function so it risks similar bugs, but I think none here since
the extra tick in the timeout provides a sufficient margin for error.
After sleeping for at least 1 second, the time has surely advanced by
1 second and timestamps taken by a coherent clock will see this.  With
a time(2) in it, the test would just not see incoherencies of 1 second.

>> - times are incoherent between threads unless the threads use their
>>   own expensive locking to prevent this.  This is not very different
>>   from timestamps being incoherent between CPUs unless the system uses
>>   expensive locking to prevent it.
>
> Only if the get* family of functions is used in places where they
> shouldn't be.  I belive there is a sysctl which determines if it
> is used for vfs timestamp.  The default can be changed if necessary.

This point is for all the functions.  A timestamp taken by 1 thread
might not be used until after many timestamps are taken and used by
other threads.  Naive comparison of these timestamps would then give
apparent incoherencies.  It is up to the threads to provide synchronization
points if they want to compare times.  More interestingly, there is
no need to keep the timecounters seen by different threads perfectly
in sync except at synchronization points, since any differences would
be indistinguishable frome ones caused be unsynchronized preemption.
(Strict real time to ~nanoseconds accuracy wouldn't work for either.)

I use the sysctl in POSIX tests to as not to keep seeing the the file
times bugs, but I sometimes forget to use it so I get remined of the
bugs anyway.  IIRC, I got jdp to change the sysctl a bit to handle
more cases.   He wanted an option for more resolution and I wanted one
to unbreak seconds resolution.  The implementation actually uses
the get* functions for seconds and 1/hz resolution and the non-get*
functions for microseconds and nanoseconds resolution.  So I use an
unnecessarily high resolution to avoid the bug.

>>> On a busy system the scheduler works hundred thousand times per
>>> second, but on most systems nobody ever looks at the times(2) data.
>>
>> More like 1000 times a second.  Even stathz = 128 gives too many decisions
>> per second for the 4BSD scheduler, so it is divided down to 16 per second.
>> Processes blocking on i/o may cause many more than 128/sec calls to the
>> scheduler, but there should be nothing much to decide then.
>
> I'm regularly running into 5 digits in the Csw field in systat -vm.
> I don't know what events you talk about, but they are clearly not
> the same as the ones I'm talking about.

I just looked at csw values on machines in the freebsd cluster.  They
may be underpowered and not heavily used, but they are more active
than any machine that I run and may be representative of general server
machines.  On hub a few hours ago, csw was a transient 100-500 and the
average since boot time was 1010.  The count since boot time may have
overflowed but the average is reasonable.  hub has been up for 236
days and an average of 1010/sec gives a count of just below INT_MAX.

The 128/16 events is for timekeeping for scheduling.  4BSD does little
more than incrememnt a tick count here.  ULE does a bit more.  Then
there are the rescheduling every second for 4BSD, and more distributed
rescheduling for ULE.  On context switches, the scheduler has (or
should have) little to do.  It is context switching itself that makes
the timestamps that become too expensive when csw is high.

> The problem here is context-switch time, and while we can argue if
> this is really scheduler related or not, the fact that the scheduler
> decides which thread to context-switch to should be enough to
> avoid a silly discussion of semantics.

The problem is still unrelated to (non-broken) schedulers.  Most context
switches happens because something blocks on i/o or is preempted by
an interrupt handler (it's a very low level of scheduling -- just
interrupt priority -- that allows the preemption, so I don't count
it as part of scheduling).  So unavoidable context switches can happen
a lot on busy machines and the scheduler can't/shouldn't affect their
count except possibly to reduce it a bit.  Given that they happen a lot
on some systems, they should be as efficient as possible.  I think the
timecounter part of their inefficiency is not very important except in
the usual case of a slow timecounter.  Losses from busted caches may
dominate.

>> So the current pessimizations from timecounter calls in mi_switch()
>> are an end result of general pessimizations of swtch() starting in
>> 4.4BSD.  I rather like this part of the pessimizations...
>
> It's so nice to have you back in action Bruce :-)

I don't plan to stay very active.

Bruce