From owner-cvs-all@FreeBSD.ORG  Sun Nov 27 12:02:45 2005
Return-Path: <owner-cvs-all@FreeBSD.ORG>
X-Original-To: cvs-all@FreeBSD.org
Delivered-To: cvs-all@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id ABCC716A41F;
	Sun, 27 Nov 2005 12:02:45 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 9E70143D66;
	Sun, 27 Nov 2005 12:02:33 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.0.87])
	by mailout1.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id
	jARC2UI3031385; Sun, 27 Nov 2005 23:02:30 +1100
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id
	jARC2QK5023159; Sun, 27 Nov 2005 23:02:27 +1100
Date: Sun, 27 Nov 2005 23:02:26 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Robert Watson <rwatson@FreeBSD.org>
In-Reply-To: <20051127095921.A81764@fledge.watson.org>
Message-ID: <20051127212014.U28222@delplex.bde.org>
References: <1647.1133083921@critter.freebsd.dk>
	<20051127095921.A81764@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: cvs-src@FreeBSD.org, Poul-Henning Kamp <phk@phk.freebsd.dk>,
	src-committers@FreeBSD.org, cvs-all@FreeBSD.org
Subject: Re: cvs commit: src/sys/sys time.h src/sys/kern kern_time.c 
X-BeenThere: cvs-all@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: CVS commit messages for the entire tree <cvs-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-all>,
	<mailto:cvs-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/cvs-all>
List-Post: <mailto:cvs-all@freebsd.org>
List-Help: <mailto:cvs-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-all>,
	<mailto:cvs-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 27 Nov 2005 12:02:46 -0000

On Sun, 27 Nov 2005, Robert Watson wrote:

> On Sun, 27 Nov 2005, Poul-Henning Kamp wrote:
>
>> In message <200511270055.jAR0tIkF032480@repoman.freebsd.org>, Robert Watson 
>> writes:
>> 
>>>  This offers a minimum update rate of 1/HZ,
>>>  but in practice will often be more frequent due to the frequency of
>>>  time stamping in the kernel:
>> 
>> Not quite...
>> 
>> The precision is guaranteed to be no worse than 1msec, and is unlikely to 
>> be significantly better.

Actually, it is guaranteed to be no better than (2, tc_tick + 1) / HZ,
and is unlikely to be significantly different to tc_tick/HZ.  The
defaults are HZ = 1000 and
tc_tick = max(1, <whatever to make the precision no better than 1 ms>),
but HZ = 1000 is bogusly large and configuring it to a smaller value
gives a worse precision; OTOH, if HZ is even larger than 1500 or so
then the precision can be made better by configuring tc_tick to 1 using
systtl.

> Sadly, for some workloads it will be -- we update the cached time stamp for 
> every kernel context switch, so workloads that trigger lots of context 
> switches will also trigger time stamp updates.  I.e., loopback network 
> traffic due to the netisr and user space context switches, high in-bound 
> network triffic due to ithread and netisr context switches, etc.

No, that's not how the update of the cached timestamp works.  The update is
very (*) expensive since it requires synchronization (**) with other CPUs,
so it is only done every tc_tick hardclock interrupts.

(*) Perhaps no more expensive than mutex locking generally.

(**) The synchronization is is essentially by the hardclock interrupt
handler being non-preemptible by itself, and the generation count to
indicate that resycnhroization is needed.  Both of these seem to be
buggy.
(1) tc_windup() has no explicit locking, so it can run concurrently
     on any number of CPUs, with N-1 of the CPUs calling it via
     clock_settime() and 1 calling it via hardclock (this one may also
     be the same as one already in it).  I doubt that the generation
     stuff is enough to prevent problems here, especially with bug (2).
(2) The generation count stuff depends on writes as seen by other CPUs
     being ordered.  This is not true on many current CPUs.  E.g., on
     amd64, writes are ordered as seen by readers on the current CPU,
     but they may be held in a buffer and I think the buffer can be
     written an any order to main memory.  I think this only gives a
     tiny race window.  There is a mutex lock in all (?) execution paths
     soon after tc_windup() returns, and this serves to synchronize writes.

BTW, I have been working on optimizing libm and noticed that out of
order execution makes timestamps very slippery even on a single CPU.
Causality is broken unless there is a synchronization point.  On i386's,
rdtsc is not a serializing instruction, so timestamps made using it,
while correct relative to each other (assuming that the tsc clock
doesn't jump), may be made many cycles before or after the previous
instruction in the instruction stream completes.  Even accesses to an
off-CPU hardware clock might not be serializing, though they will take
so long that the CPU probably has time to complete all previously
issued instructions and thus they may give a timestamp that is sure
to be after the completion of previous instructions.  This shows that
on many machines for many purposes it is fundamentally impossible to
take timestamps that are more than 2 of efficient, synchronized and
useful (they aren't useful if they have too large an impact on what is
being measured).

Perhaps this incoherency within a single CPU can be turned into a
feature: timestamps are inherently incomparable unless there is a
synchronization point between them.  Code that needs perfect coherency
needs to issue a heavyweight synchonization call before making timestamps;
code that just wants monotonic timestamps like rdtsc gives needs to
do less: if a timecounter call is only the same CPU then nothing needs
to be done if the timecounter is the TSC, but if it is on a different
CPU then much more needs to be done (transparently to the caller).
The timestamps made in mi_switch() are always on the same CPU (`switchtime'
is per-CPU), so they don't need any synchronization to use a per-CPU
timecounter like a TSC; they just need a timecounter with a stable
clock.  It also shouldn't matter that they may be wrong by several
cycles due to out of order execution and no serialization.  Hopefully
the wrongness averages out, and it would be wronger to use a serializing
instructions to "fix" the counts -- this would just waste time by
preventing overlap of intruction execution between the old and new
threads.

Bruce