From owner-svn-src-all@FreeBSD.ORG Thu Mar 24 16:56:07 2011 Return-Path: Delivered-To: svn-src-all@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C14BE1065670; Thu, 24 Mar 2011 16:56:07 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id 484C48FC08; Thu, 24 Mar 2011 16:56:06 +0000 (UTC) Received: from c122-107-125-80.carlnfd1.nsw.optusnet.com.au (c122-107-125-80.carlnfd1.nsw.optusnet.com.au [122.107.125.80]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id p2OGu29Y021511 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 25 Mar 2011 03:56:04 +1100 Date: Fri, 25 Mar 2011 03:56:02 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: David Malone In-Reply-To: <201103241503.aa30875@walton.maths.tcd.ie> Message-ID: <20110325020734.P1958@besplex.bde.org> References: <201103241503.aa30875@walton.maths.tcd.ie> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: src-committers@FreeBSD.org, John Baldwin , svn-src-all@FreeBSD.org, Bruce Evans , svn-src-head@FreeBSD.org, Jung-uk Kim Subject: Re: svn commit: r219676 - head/sys/x86/x86 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Mar 2011 16:56:07 -0000 On Thu, 24 Mar 2011, David Malone wrote: >> If atomic 64-bit read/write is possible and it is absolutely >> necessary, it has to be set via machdep.foo_freq, not >> kern.timecounter.foo.frequency. Not really. The i386 TSC frequency variable can be 32 bits. That works in almost all cases now, and you can easily scale the variable so that a 32-bit frequency works for CPUs up to 8GHz, which should be enough until i386 mode goes away. amd64 already has sufficiently atomic 64-bit read/write for this, perhaps even without an explicit atomic op. OTOH, kern.timecounter.foo.frequency is also 64 bits, and needs this and atomic accesses more that the i386 TSC variable, since it is MI and there are more divisions by it. Explicit atomic ops are never used for timecounter variables, but I think some is needed in the following places: - switching of timehands. Atomic ops are supposed to be avoided by writing to an inactive timehands and than switching to it using a single atomic write, but the write for this is not explicitly atomic. Without an atomic write with certain acquire/release semantics whose details I can never remember, the writes before the switch may go out after the one for the switch. - switching of more global things. These include the switch from one timecounter to another, and changing the frequency within a timecounter, and stepping the time. Something like timecounter generations could be used to reduced to reduce the locking requirements for this, but currently the locking for this is null. For stepping the time, the missing locking is even documented by an XXX comment: % /* % * Step our concept of UTC. This is done by modifying our estimate of % * when we booted. % * XXX: not locked. % */ % void % tc_setclock(struct timespec *ts) % { % struct timespec tbef, taft; % struct bintime bt, bt2; % % cpu_tick_calibrate(1); This only has a tiny race. cpu_tick_calibrate() is normally called only every 16 seconds on 1 designated CPU. This spacing out of calls to it gives it significant "time-domain locking". And passing 1 to it mostly preserves this locking, by telling it to not do any calibration on this call, but to reset on this call and wait for another call or 2 to let things stabilize. But it must write a global variable for the reset, and the access for this is not atomic. I think it should do slightly more, and arrange for recalibration as soon as possible, after >= 1 but <= 2 seconds. Now the recalibration might not be delayed at all (due to the race), or might be delayed by only 1 cycle, or may be 16 seconds away. % nanotime(&tbef); Safe to call at any time, modulo other timecounter bugs. This depends on the update of the generation count being sufficiently atomic, and on read accesses to boottime being atomic, which they clearly aren't. % timespec2bintime(ts, &bt); Utility function operating on local variables, so safe. % binuptime(&bt2); % bintime_sub(&bt, &bt2); Like the previous 2 calls, but safer since boottime is not accessed. % bintime_add(&bt2, &boottimebin); Unsafe read access to boottimebin. % boottimebin = bt; Unsafer write access to boottimebin. This is the main write access to it. Changing it at all is what makes the read accesses to it unsafe. Changing it at all after setting it initially is also wrong. The boot time is a time so it cannot actually be changed by any later event, but it is changed here as a hack to make the current time come out right, even when this makes the boot time come out wrong. The boot time needs to be fixed if it is initially wrong, but should not change after that. This wouldn't simplify the locking here, since an adjustment would still be needed. When this is called from settime(), Giant is held. This prevents concurrent calls to here from userland. Concurrent calls may still be possible, since this is also called from resume methods. I don't know if resume methods are locked by Giant or anything. % bintime2timeval(&bt, &boottime); Unsafe write access to boottime. Like the one to boottimebin. % % /* XXX fiddle all the little crinkly bits around the fiords... */ % tc_windup(); tc_windup() is normally called only every tc_tick ticks on a not-so- designated CPU, and we break the time-domain locking from that as above. tc_windup() begins by operating on the "next" generation of timehands, so it has obvious races if it is called concurrently. It depends on the time-domain locking to prevent this, but here we don't even try to wait for things to stablize like we do for the much less important cpu_tock_calibrate() call. Calling tc_windup() a lot from here also risks cycling through the timehands too rapidly even if the calls don't overlap. Userland can exercise this bug easily at securelevel <= 1 by calling settimeofday() in a loop. securelevel > 1 gives (IMO bogus) restrictions and a kernel printf service rate-limited to only once per second. OTOH, if tc_windup() were safe to call at any time, calling it here would be the right thing to do. Note that updates of the TSC timecounter frequency variable are propagated from the TSC machdep sysctl to the timecounter layer in a less racy manner. The machdep sysctl just copies the machdep variable to the timecounter variable and depends on the tc_windup() call propagating it later. This avoids concurrent calls and only has a tiny race window which is neverthless easy to hit by calling the machdep sysctl in a loop. tc_windup() used to be called (via tc_ticktock()) only from hardclock() on the BSP. Now tc_ticktock() is also called from hardclock_anycpu(). tc_ticktock() is now passed an arg which should prevent too-rapid cycling of the timehands, but I don't see anything to limit the calls to the BSP or to prevent calls very close together. This is unlike the new call to cpu_tick_calibrate() from hardclock_anycpu(). That is limited to the designated CPU. % nanotime(&taft); Back to only races for only read accesses. % if (timestepwarnings) { % log(LOG_INFO, % "Time stepped from %jd.%09ld to %jd.%09ld (%jd.%09ld)\n", % (intmax_t)tbef.tv_sec, tbef.tv_nsec, % (intmax_t)taft.tv_sec, taft.tv_nsec, % (intmax_t)ts->tv_sec, ts->tv_nsec); % } % cpu_tick_calibrate(1); Like the call at the start. % } > A number of people have asked me about adjusting the frequency of > other time counters - it would seem to make sense to allow this at > the timecounter layer. It also seems possible that people might > want the timecounters view of the frequency to be different from > other uses of a counter. I can't see any reason for the frequencies to be different, except transiently to keep changes to them smooth. Anyway, it should be possible to change them all to the same frequency, and having a single operation for this is good for minimizing transients, races and complexity. Bruce