Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 8 Apr 2011 13:17:39 -0400
From:      Jung-uk Kim <jkim@FreeBSD.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r220433 - in head/sys: amd64/amd64 cddl/dev/dtrace/amd64 cddl/dev/dtrace/i386 compat/linprocfs contrib/altq/altq dev/acpica i386/i386 i386/isa pc98/pc98 x86/cpufreq x86/isa x86/x86
Message-ID:  <201104081317.41254.jkim@FreeBSD.org>
In-Reply-To: <20110408224855.Y1390@besplex.bde.org>
References:  <201104072328.p37NSSsO055101@svn.freebsd.org> <20110408224855.Y1390@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Friday 08 April 2011 10:54 am, Bruce Evans wrote:
> On Thu, 7 Apr 2011, Jung-uk Kim wrote:
> > Log:
> >  Use atomic load & store for TSC frequency.  It may be overkill
> > for amd64 but safer for i386 because it can be easily over 4 GHz
> > now.  More worse, it can be easily changed by user with
> > 'machdep.tsc_freq' tunable (directly) or cpufreq(4) (indirectly).
> >  Note it is intentionally not used in performance critical paths
> > to avoid performance regression (but we should, in theory).
> > Alternatively, we may add "virtual TSC" with lower frequency if
> > maximum frequency overflows 32 bits (and ignore possible
> > incoherency as we do now).
>
> I disagree with this complexity and churn, and tried to avoid it --
> see previous mails for more details.

I don't like it myself.  It's sort of my way of expressing "do not use 
64-bit global variables on 32-bit platforms unless the architecture 
guarantees coherency". ;-)

Seriously, using variant TSC is strongly discouraged for timekeeping.  
Messing up with it from user land is worse.  If it is really 
necessary, it means kernel is broken and we need to fix it.

> There are probably many 64-bit variables that need to be accessed
> more atomically, and tsc_freq is one of the least in need of atomic
> accesses, since the race windows are especially tiny for it.  Even
> when the race is lost, this is harmless except for frequencies that
> are almost exactly 2^32 Hz, since only sysctls that change the
> frequency from above this to below, or vice versa, can give an
> inconsistent value. Thus there is a race window of a few nsec which
> only matters with a probability of about 1 in 2^32, only for
> user[s] of the sysctl to change the TSC frequency.

powerd(8) via cpufreq(4) can change tsc_freq frequently if the CPU is 
not P-state invariant.  Well, the chances of getting ~4.3GHz CPU 
without P-state invariant TSC is little low, I guess.

> The complexity of this shows that you should try hard not to use
> 64-bit variables on 32-bit systems if accesses to them are not
> locked by a mutex.  64-bit systems like amd64 don't need to do
> anything special, any more than 32-bit systems accessing 32-bit
> variables.  We assume that accesses of <= the bus width are atomic
> in a certain weak sense which is enough for variables like
> tsc_freq.

Actually, Pentium and above guarantees the "weak" atomic memory 
read/write when it is aligned to quad-word boundary.  CMPXCHG8B is 
the only one instruction (except for FPU) useful for that feature, 
though.  I considered implementing weaker version initially but I am 
not sure if it has any benefit.

> I thought of the following KPI-transparent method that can be used
> for any struct if updates need not be seen immediately.  It is a
> refined version of timehands method used for timecounters:
>
>  	#define	tsc_freq	_tsc_freq[_tsc_gen]

It just hides the implementation.  How about this, then?

#define	tsc_freq	atomic_load_acq_64(&_tsc_freq)

:-P

> where the everything is updated sufficiently atomically and the
> object data is guaranteed not to change for a suitably long time
> after the generation that indexes it is read:
>
>  	if ((new_tsc_freq >> 32) == (tsc_freq >> 32)) {
>  		/*
>  		 * Usual case.  Below 2**32, or upper bits just not
>  		 * changing.  No need for complications.
>  		 */
>  		tsc_freq = new_tsc_freq;
>  	else if ((new_tsc_freq & 0xFFFFFFFF) == (tsc_freq & 0xFFFFFFFF))
> { /*
>  		 * Upper bits changing but lower bits not.  Very unusual,
>  		 * and optimized here to give a place for this comment.
>  		 * Again, no need for complications.
>  		 */
>  		tsc_freq = new_tsc_freq;
>  	} else {
>  		/*
>  		 * Need a new generation.  Need locking here (but not for
>  		 * readers).  Locking not shown.
>  		 */
>  		nextgen = (_tsc_gen + 1) % NGEN;
>  		_tsc_freq[nextgen] = new_tsc_freq;
>  		sfence();
>  		_tsc_gen = nextgen;
>  		sfence();
>  		/*
>  		 * Any new reader starting from a bit before now will use
>  		 * the new generation.  (I think.  Can the memory ordering
>  		 * be so weak that other CPUs don't see the fenced writes
>  		 * here until they do some fenced operation?  Even if they
>  		 * don't, then waiting a bit here will give them time to
>  		 * catch up.)  We must keep the data for the old generation
>  		 * stable until any readers of it (via the old generation
>  		 * number) have completed.  We should wait a very long time,
>  		 * in theory possibly forever, since the reader may be
>  		 * preempted after it read the old generation number and
>  		 * never run again.  In practice, we should be able to
>  		 * bound this time to a few ticks, like timecounter code
>  		 * does, by limiting uses to kernel threads in critical
>  		 * regions or similar.  Even timecounter code cannot
>  		 * guarantee any particular timing, since its running
>  		 * may be delayed arbirarily long by sitting at the ddb
>  		 * prompt.  The delay can be implemented as in timecounter
>  		 * code, by using many generations spaced a tick or 2 apart,
>  		 * or more simply by sleeping for a tick or 2 between each
>  		 * generation change here.  It would even be OK to sleep a
>  		 * minute or 2 here, since we only get here for changing
>  		 * the 4/8/16... GHz bits.
>  		 */
>  		DELAY(1);
>  	}
>
> The necessary sfence()s are missing in timecounter code. 
> Timecounter code also has explicit generation checks, which I think
> are not necessary if enough care is taken keeping the old
> generation's data stable, but which help if the generations cycle
> too fast or the timing is messed up by ddb.
>
> Another method that would work here (up to 8GHz; expand as
> necessary) without changing the KPI is:
>
>  	#define	tsc_freq	((uint64_t)_tsc_freq1 + _tsc_freq2)
>
> where _tsc_freq[12] are 32 bits.  The components are kept below
> 2**32 so that they never wrap.  Races may give components from
> different generations, but this doesn't matter since adjustments
> only make small changes to each component, so the result is just an
> average of the values of the old and new generations.

You see the complexity is still there and I don't like it.

Introducing "virtual TSC" with fixed and lower frequency can be better 
solution as I noted in the commit log.

Jung-uk Kim



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201104081317.41254.jkim>