From owner-svn-src-all@FreeBSD.ORG  Fri Apr  8 14:54:34 2011
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3CBBF106566C;
	Fri,  8 Apr 2011 14:54:34 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au
	[211.29.132.189])
	by mx1.freebsd.org (Postfix) with ESMTP id CC3488FC12;
	Fri,  8 Apr 2011 14:54:33 +0000 (UTC)
Received: from c122-106-155-58.carlnfd1.nsw.optusnet.com.au
	(c122-106-155-58.carlnfd1.nsw.optusnet.com.au [122.106.155.58])
	by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	p38EsVSi016581
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sat, 9 Apr 2011 00:54:31 +1000
Date: Sat, 9 Apr 2011 00:54:31 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Jung-uk Kim <jkim@FreeBSD.org>
In-Reply-To: <201104072328.p37NSSsO055101@svn.freebsd.org>
Message-ID: <20110408224855.Y1390@besplex.bde.org>
References: <201104072328.p37NSSsO055101@svn.freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org,
	src-committers@FreeBSD.org
Subject: Re: svn commit: r220433 - in head/sys: amd64/amd64
 cddl/dev/dtrace/amd64
 cddl/dev/dtrace/i386 compat/linprocfs contrib/altq/altq dev/acpica i386/i386
 i386/isa pc98/pc98 x86/cpufreq x86/isa x86/x86
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
	user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Apr 2011 14:54:34 -0000

On Thu, 7 Apr 2011, Jung-uk Kim wrote:

> Log:
>  Use atomic load & store for TSC frequency.  It may be overkill for amd64 but
>  safer for i386 because it can be easily over 4 GHz now.  More worse, it can
>  be easily changed by user with 'machdep.tsc_freq' tunable (directly) or
>  cpufreq(4) (indirectly).  Note it is intentionally not used in performance
>  critical paths to avoid performance regression (but we should, in theory).
>  Alternatively, we may add "virtual TSC" with lower frequency if maximum
>  frequency overflows 32 bits (and ignore possible incoherency as we do now).

I disagree with this complexity and churn, and tried to avoid it -- see
previous mails for more details.

There are probably many 64-bit variables that need to be accessed more
atomically, and tsc_freq is one of the least in need of atomic accesses,
since the race windows are especially tiny for it.  Even when the race
is lost, this is harmless except for frequencies that are almost
exactly 2^32 Hz, since only sysctls that change the frequency from
above this to below, or vice versa, can give an inconsistent value.
Thus there is a race window of a few nsec which only matters with a
probability of about 1 in 2^32, only for user[s] of the sysctl to
change the TSC frequency.

The complexity of this shows that you should try hard not to use 64-bit
variables on 32-bit systems if accesses to them are not locked by a
mutex.  64-bit systems like amd64 don't need to do anything special,
any more than 32-bit systems accessing 32-bit variables.  We assume that
accesses of <= the bus width are atomic in a certain weak sense which
is enough for variables like tsc_freq.

I thought of the following KPI-transparent method that can be used for
any struct if updates need not be seen immediately.  It is a refined version
of timehands method used for timecounters:

 	#define	tsc_freq	_tsc_freq[_tsc_gen]

where the everything is updated sufficiently atomically and the object
data is guaranteed not to change for a suitably long time after the
generation that indexes it is read:

 	if ((new_tsc_freq >> 32) == (tsc_freq >> 32)) {
 		/*
 		 * Usual case.  Below 2**32, or upper bits just not
 		 * changing.  No need for complications.
 		 */
 		tsc_freq = new_tsc_freq;
 	else if ((new_tsc_freq & 0xFFFFFFFF) == (tsc_freq & 0xFFFFFFFF)) {
 		/*
 		 * Upper bits changing but lower bits not.  Very unusual,
 		 * and optimized here to give a place for this comment.
 		 * Again, no need for complications.
 		 */
 		tsc_freq = new_tsc_freq;
 	} else {
 		/*
 		 * Need a new generation.  Need locking here (but not for
 		 * readers).  Locking not shown.
 		 */
 		nextgen = (_tsc_gen + 1) % NGEN;
 		_tsc_freq[nextgen] = new_tsc_freq;
 		sfence();
 		_tsc_gen = nextgen;
 		sfence();
 		/*
 		 * Any new reader starting from a bit before now will use
 		 * the new generation.  (I think.  Can the memory ordering
 		 * be so weak that other CPUs don't see the fenced writes
 		 * here until they do some fenced operation?  Even if they
 		 * don't, then waiting a bit here will give them time to
 		 * catch up.)  We must keep the data for the old generation
 		 * stable until any readers of it (via the old generation
 		 * number) have completed.  We should wait a very long time,
 		 * in theory possibly forever, since the reader may be
 		 * preempted after it read the old generation number and
 		 * never run again.  In practice, we should be able to
 		 * bound this time to a few ticks, like timecounter code
 		 * does, by limiting uses to kernel threads in critical
 		 * regions or similar.  Even timecounter code cannot
 		 * guarantee any particular timing, since its running
 		 * may be delayed arbirarily long by sitting at the ddb
 		 * prompt.  The delay can be implemented as in timecounter
 		 * code, by using many generations spaced a tick or 2 apart,
 		 * or more simply by sleeping for a tick or 2 between each
 		 * generation change here.  It would even be OK to sleep a
 		 * minute or 2 here, since we only get here for changing
 		 * the 4/8/16... GHz bits.
 		 */
 		DELAY(1);
 	}

The necessary sfence()s are missing in timecounter code.  Timecounter
code also has explicit generation checks, which I think are not necessary
if enough care is taken keeping the old generation's data stable, but
which help if the generations cycle too fast or the timing is messed
up by ddb.

Another method that would work here (up to 8GHz; expand as necessary)
without changing the KPI is:

 	#define	tsc_freq	((uint64_t)_tsc_freq1 + _tsc_freq2)

where _tsc_freq[12] are 32 bits.  The components are kept below 2**32 so
that they never wrap.  Races may give components from different generations,
but this doesn't matter since adjustments only make small changes to
each component, so the result is just an average of the values of the old
and new generations.

Bruce