From owner-svn-src-all@FreeBSD.ORG Fri Apr 8 14:54:34 2011 Return-Path: Delivered-To: svn-src-all@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3CBBF106566C; Fri, 8 Apr 2011 14:54:34 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au [211.29.132.189]) by mx1.freebsd.org (Postfix) with ESMTP id CC3488FC12; Fri, 8 Apr 2011 14:54:33 +0000 (UTC) Received: from c122-106-155-58.carlnfd1.nsw.optusnet.com.au (c122-106-155-58.carlnfd1.nsw.optusnet.com.au [122.106.155.58]) by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id p38EsVSi016581 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 9 Apr 2011 00:54:31 +1000 Date: Sat, 9 Apr 2011 00:54:31 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Jung-uk Kim In-Reply-To: <201104072328.p37NSSsO055101@svn.freebsd.org> Message-ID: <20110408224855.Y1390@besplex.bde.org> References: <201104072328.p37NSSsO055101@svn.freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org Subject: Re: svn commit: r220433 - in head/sys: amd64/amd64 cddl/dev/dtrace/amd64 cddl/dev/dtrace/i386 compat/linprocfs contrib/altq/altq dev/acpica i386/i386 i386/isa pc98/pc98 x86/cpufreq x86/isa x86/x86 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Apr 2011 14:54:34 -0000 On Thu, 7 Apr 2011, Jung-uk Kim wrote: > Log: > Use atomic load & store for TSC frequency. It may be overkill for amd64 but > safer for i386 because it can be easily over 4 GHz now. More worse, it can > be easily changed by user with 'machdep.tsc_freq' tunable (directly) or > cpufreq(4) (indirectly). Note it is intentionally not used in performance > critical paths to avoid performance regression (but we should, in theory). > Alternatively, we may add "virtual TSC" with lower frequency if maximum > frequency overflows 32 bits (and ignore possible incoherency as we do now). I disagree with this complexity and churn, and tried to avoid it -- see previous mails for more details. There are probably many 64-bit variables that need to be accessed more atomically, and tsc_freq is one of the least in need of atomic accesses, since the race windows are especially tiny for it. Even when the race is lost, this is harmless except for frequencies that are almost exactly 2^32 Hz, since only sysctls that change the frequency from above this to below, or vice versa, can give an inconsistent value. Thus there is a race window of a few nsec which only matters with a probability of about 1 in 2^32, only for user[s] of the sysctl to change the TSC frequency. The complexity of this shows that you should try hard not to use 64-bit variables on 32-bit systems if accesses to them are not locked by a mutex. 64-bit systems like amd64 don't need to do anything special, any more than 32-bit systems accessing 32-bit variables. We assume that accesses of <= the bus width are atomic in a certain weak sense which is enough for variables like tsc_freq. I thought of the following KPI-transparent method that can be used for any struct if updates need not be seen immediately. It is a refined version of timehands method used for timecounters: #define tsc_freq _tsc_freq[_tsc_gen] where the everything is updated sufficiently atomically and the object data is guaranteed not to change for a suitably long time after the generation that indexes it is read: if ((new_tsc_freq >> 32) == (tsc_freq >> 32)) { /* * Usual case. Below 2**32, or upper bits just not * changing. No need for complications. */ tsc_freq = new_tsc_freq; else if ((new_tsc_freq & 0xFFFFFFFF) == (tsc_freq & 0xFFFFFFFF)) { /* * Upper bits changing but lower bits not. Very unusual, * and optimized here to give a place for this comment. * Again, no need for complications. */ tsc_freq = new_tsc_freq; } else { /* * Need a new generation. Need locking here (but not for * readers). Locking not shown. */ nextgen = (_tsc_gen + 1) % NGEN; _tsc_freq[nextgen] = new_tsc_freq; sfence(); _tsc_gen = nextgen; sfence(); /* * Any new reader starting from a bit before now will use * the new generation. (I think. Can the memory ordering * be so weak that other CPUs don't see the fenced writes * here until they do some fenced operation? Even if they * don't, then waiting a bit here will give them time to * catch up.) We must keep the data for the old generation * stable until any readers of it (via the old generation * number) have completed. We should wait a very long time, * in theory possibly forever, since the reader may be * preempted after it read the old generation number and * never run again. In practice, we should be able to * bound this time to a few ticks, like timecounter code * does, by limiting uses to kernel threads in critical * regions or similar. Even timecounter code cannot * guarantee any particular timing, since its running * may be delayed arbirarily long by sitting at the ddb * prompt. The delay can be implemented as in timecounter * code, by using many generations spaced a tick or 2 apart, * or more simply by sleeping for a tick or 2 between each * generation change here. It would even be OK to sleep a * minute or 2 here, since we only get here for changing * the 4/8/16... GHz bits. */ DELAY(1); } The necessary sfence()s are missing in timecounter code. Timecounter code also has explicit generation checks, which I think are not necessary if enough care is taken keeping the old generation's data stable, but which help if the generations cycle too fast or the timing is messed up by ddb. Another method that would work here (up to 8GHz; expand as necessary) without changing the KPI is: #define tsc_freq ((uint64_t)_tsc_freq1 + _tsc_freq2) where _tsc_freq[12] are 32 bits. The components are kept below 2**32 so that they never wrap. Races may give components from different generations, but this doesn't matter since adjustments only make small changes to each component, so the result is just an average of the values of the old and new generations. Bruce