From owner-cvs-all@FreeBSD.ORG Sun Nov 27 12:02:45 2005 Return-Path: X-Original-To: cvs-all@FreeBSD.org Delivered-To: cvs-all@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id ABCC716A41F; Sun, 27 Nov 2005 12:02:45 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9E70143D66; Sun, 27 Nov 2005 12:02:33 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87]) by mailout1.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id jARC2UI3031385; Sun, 27 Nov 2005 23:02:30 +1100 Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id jARC2QK5023159; Sun, 27 Nov 2005 23:02:27 +1100 Date: Sun, 27 Nov 2005 23:02:26 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Robert Watson In-Reply-To: <20051127095921.A81764@fledge.watson.org> Message-ID: <20051127212014.U28222@delplex.bde.org> References: <1647.1133083921@critter.freebsd.dk> <20051127095921.A81764@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: cvs-src@FreeBSD.org, Poul-Henning Kamp , src-committers@FreeBSD.org, cvs-all@FreeBSD.org Subject: Re: cvs commit: src/sys/sys time.h src/sys/kern kern_time.c X-BeenThere: cvs-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: CVS commit messages for the entire tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Nov 2005 12:02:46 -0000 On Sun, 27 Nov 2005, Robert Watson wrote: > On Sun, 27 Nov 2005, Poul-Henning Kamp wrote: > >> In message <200511270055.jAR0tIkF032480@repoman.freebsd.org>, Robert Watson >> writes: >> >>> This offers a minimum update rate of 1/HZ, >>> but in practice will often be more frequent due to the frequency of >>> time stamping in the kernel: >> >> Not quite... >> >> The precision is guaranteed to be no worse than 1msec, and is unlikely to >> be significantly better. Actually, it is guaranteed to be no better than (2, tc_tick + 1) / HZ, and is unlikely to be significantly different to tc_tick/HZ. The defaults are HZ = 1000 and tc_tick = max(1, ), but HZ = 1000 is bogusly large and configuring it to a smaller value gives a worse precision; OTOH, if HZ is even larger than 1500 or so then the precision can be made better by configuring tc_tick to 1 using systtl. > Sadly, for some workloads it will be -- we update the cached time stamp for > every kernel context switch, so workloads that trigger lots of context > switches will also trigger time stamp updates. I.e., loopback network > traffic due to the netisr and user space context switches, high in-bound > network triffic due to ithread and netisr context switches, etc. No, that's not how the update of the cached timestamp works. The update is very (*) expensive since it requires synchronization (**) with other CPUs, so it is only done every tc_tick hardclock interrupts. (*) Perhaps no more expensive than mutex locking generally. (**) The synchronization is is essentially by the hardclock interrupt handler being non-preemptible by itself, and the generation count to indicate that resycnhroization is needed. Both of these seem to be buggy. (1) tc_windup() has no explicit locking, so it can run concurrently on any number of CPUs, with N-1 of the CPUs calling it via clock_settime() and 1 calling it via hardclock (this one may also be the same as one already in it). I doubt that the generation stuff is enough to prevent problems here, especially with bug (2). (2) The generation count stuff depends on writes as seen by other CPUs being ordered. This is not true on many current CPUs. E.g., on amd64, writes are ordered as seen by readers on the current CPU, but they may be held in a buffer and I think the buffer can be written an any order to main memory. I think this only gives a tiny race window. There is a mutex lock in all (?) execution paths soon after tc_windup() returns, and this serves to synchronize writes. BTW, I have been working on optimizing libm and noticed that out of order execution makes timestamps very slippery even on a single CPU. Causality is broken unless there is a synchronization point. On i386's, rdtsc is not a serializing instruction, so timestamps made using it, while correct relative to each other (assuming that the tsc clock doesn't jump), may be made many cycles before or after the previous instruction in the instruction stream completes. Even accesses to an off-CPU hardware clock might not be serializing, though they will take so long that the CPU probably has time to complete all previously issued instructions and thus they may give a timestamp that is sure to be after the completion of previous instructions. This shows that on many machines for many purposes it is fundamentally impossible to take timestamps that are more than 2 of efficient, synchronized and useful (they aren't useful if they have too large an impact on what is being measured). Perhaps this incoherency within a single CPU can be turned into a feature: timestamps are inherently incomparable unless there is a synchronization point between them. Code that needs perfect coherency needs to issue a heavyweight synchonization call before making timestamps; code that just wants monotonic timestamps like rdtsc gives needs to do less: if a timecounter call is only the same CPU then nothing needs to be done if the timecounter is the TSC, but if it is on a different CPU then much more needs to be done (transparently to the caller). The timestamps made in mi_switch() are always on the same CPU (`switchtime' is per-CPU), so they don't need any synchronization to use a per-CPU timecounter like a TSC; they just need a timecounter with a stable clock. It also shouldn't matter that they may be wrong by several cycles due to out of order execution and no serialization. Hopefully the wrongness averages out, and it would be wronger to use a serializing instructions to "fix" the counts -- this would just waste time by preventing overlap of intruction execution between the old and new threads. Bruce