From owner-svn-src-all@FreeBSD.ORG  Thu Mar 24 16:56:07 2011
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C14BE1065670;
	Thu, 24 Mar 2011 16:56:07 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au
	[211.29.132.184])
	by mx1.freebsd.org (Postfix) with ESMTP id 484C48FC08;
	Thu, 24 Mar 2011 16:56:06 +0000 (UTC)
Received: from c122-107-125-80.carlnfd1.nsw.optusnet.com.au
	(c122-107-125-80.carlnfd1.nsw.optusnet.com.au [122.107.125.80])
	by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	p2OGu29Y021511
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Fri, 25 Mar 2011 03:56:04 +1100
Date: Fri, 25 Mar 2011 03:56:02 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: David Malone <dwmalone@maths.tcd.ie>
In-Reply-To: <201103241503.aa30875@walton.maths.tcd.ie>
Message-ID: <20110325020734.P1958@besplex.bde.org>
References: <201103241503.aa30875@walton.maths.tcd.ie>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: src-committers@FreeBSD.org, John Baldwin <jhb@FreeBSD.org>,
	svn-src-all@FreeBSD.org, Bruce Evans <brde@optusnet.com.au>,
	svn-src-head@FreeBSD.org, Jung-uk Kim <jkim@FreeBSD.org>
Subject: Re: svn commit: r219676 - head/sys/x86/x86 
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
	user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 24 Mar 2011 16:56:07 -0000

On Thu, 24 Mar 2011, David Malone wrote:

>> If atomic 64-bit read/write is possible and it is absolutely
>> necessary, it has to be set via machdep.foo_freq, not
>> kern.timecounter.foo.frequency.

Not really.  The i386 TSC frequency variable can be 32 bits.  That
works in almost all cases now, and you can easily scale the variable
so that a 32-bit frequency works for CPUs up to 8GHz, which should be
enough until i386 mode goes away.  amd64 already has sufficiently
atomic 64-bit read/write for this, perhaps even without an explicit
atomic op.  OTOH, kern.timecounter.foo.frequency is also 64 bits,
and needs this and atomic accesses more that the i386 TSC variable,
since it is MI and there are more divisions by it.  Explicit atomic
ops are never used for timecounter variables, but I think some is
needed in the following places:
- switching of timehands.  Atomic ops are supposed to be avoided by
   writing to an inactive timehands and than switching to it using
   a single atomic write, but the write for this is not explicitly
   atomic.  Without an atomic write with certain acquire/release
   semantics whose details I can never remember, the writes before
   the switch may go out after the one for the switch.
- switching of more global things.  These include the switch from
   one timecounter to another, and changing the frequency within a
   timecounter, and stepping the time.  Something like timecounter
   generations could be used to reduced to reduce the locking
   requirements for this, but currently the locking for this is null.
   For stepping the time, the missing locking is even documented by
   an XXX comment:

% /*
%  * Step our concept of UTC.  This is done by modifying our estimate of
%  * when we booted.
%  * XXX: not locked.
%  */
% void
% tc_setclock(struct timespec *ts)
% {
% 	struct timespec tbef, taft;
% 	struct bintime bt, bt2;
% 
% 	cpu_tick_calibrate(1);

This only has a tiny race.  cpu_tick_calibrate() is normally called only
every 16 seconds on 1 designated CPU.  This spacing out of calls to it
gives it significant "time-domain locking".  And passing 1 to it mostly
preserves this locking, by telling it to not do any calibration on this
call, but to reset on this call and wait for another call or 2 to let
things stabilize.  But it must write a global variable for the reset,
and the access for this is not atomic.  I think it should do slightly
more, and arrange for recalibration as soon as possible, after >= 1
but <= 2 seconds.  Now the recalibration might not be delayed at all
(due to the race), or might be delayed by only 1 cycle, or may be 16
seconds away.

% 	nanotime(&tbef);

Safe to call at any time, modulo other timecounter bugs.  This depends
on the update of the generation count being sufficiently atomic, and
on read accesses to boottime being atomic, which they clearly aren't.

% 	timespec2bintime(ts, &bt);

Utility function operating on local variables, so safe.

% 	binuptime(&bt2);
% 	bintime_sub(&bt, &bt2);

Like the previous 2 calls, but safer since boottime is not accessed.

% 	bintime_add(&bt2, &boottimebin);

Unsafe read access to boottimebin.

% 	boottimebin = bt;

Unsafer write access to boottimebin.  This is the main write access to it.
Changing it at all is what makes the read accesses to it unsafe.  Changing
it at all after setting it initially is also wrong.  The boot time is a
time so it cannot actually be changed by any later event, but it is
changed here as a hack to make the current time come out right, even when
this makes the boot time come out wrong.  The boot time needs to be fixed
if it is initially wrong, but should not change after that.  This wouldn't
simplify the locking here, since an adjustment would still be needed.

When this is called from settime(), Giant is held.  This prevents concurrent
calls to here from userland.  Concurrent calls may still be possible, since
this is also called from resume methods.  I don't know if resume methods
are locked by Giant or anything.

% 	bintime2timeval(&bt, &boottime);

Unsafe write access to boottime.  Like the one to boottimebin.

% 
% 	/* XXX fiddle all the little crinkly bits around the fiords... */
% 	tc_windup();

tc_windup() is normally called only every tc_tick ticks on a not-so-
designated CPU, and we break the time-domain locking from that as
above.  tc_windup() begins by operating on the "next" generation of
timehands, so it has obvious races if it is called concurrently.  It
depends on the time-domain locking to prevent this, but here we don't
even try to wait for things to stablize like we do for the much less
important cpu_tock_calibrate() call.

Calling tc_windup() a lot from here also risks cycling through the
timehands too rapidly even if the calls don't overlap.  Userland can
exercise this bug easily at securelevel <= 1 by calling settimeofday()
in a loop.  securelevel > 1 gives (IMO bogus) restrictions and a kernel
printf service rate-limited to only once per second.

OTOH, if tc_windup() were safe to call at any time, calling it here
would be the right thing to do.  Note that updates of the TSC timecounter
frequency variable are propagated from the TSC machdep sysctl to the
timecounter layer in a less racy manner.  The machdep sysctl just
copies the machdep variable to the timecounter variable and depends
on the tc_windup() call propagating it later.  This avoids concurrent
calls and only has a tiny race window which is neverthless easy to hit
by calling the machdep sysctl in a loop.

tc_windup() used to be called (via tc_ticktock()) only from hardclock()
on the BSP.  Now tc_ticktock() is also called from hardclock_anycpu().
tc_ticktock() is now passed an arg which should prevent too-rapid
cycling of the timehands, but I don't see anything to limit the calls
to the BSP or to prevent calls very close together.  This is unlike the
new call to cpu_tick_calibrate() from hardclock_anycpu().  That is
limited to the designated CPU.

% 	nanotime(&taft);

Back to only races for only read accesses.

% 	if (timestepwarnings) {
% 		log(LOG_INFO,
% 		    "Time stepped from %jd.%09ld to %jd.%09ld (%jd.%09ld)\n",
% 		    (intmax_t)tbef.tv_sec, tbef.tv_nsec,
% 		    (intmax_t)taft.tv_sec, taft.tv_nsec,
% 		    (intmax_t)ts->tv_sec, ts->tv_nsec);
% 	}
% 	cpu_tick_calibrate(1);

Like the call at the start.

% }

> A number of people have asked me about adjusting the frequency of
> other time counters - it would seem to make sense to allow this at
> the timecounter layer. It also seems possible that people might
> want the timecounters view of the frequency to be different from
> other uses of a counter.

I can't see any reason for the frequencies to be different, except
transiently to keep changes to them smooth.  Anyway, it should be
possible to change them all to the same frequency, and having a single
operation for this is good for minimizing transients, races and
complexity.

Bruce