From owner-cvs-src@FreeBSD.ORG Thu Oct 20 08:02:09 2005 Return-Path: X-Original-To: cvs-src@FreeBSD.org Delivered-To: cvs-src@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D35C316A41F; Thu, 20 Oct 2005 08:02:09 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0E90C43D5A; Thu, 20 Oct 2005 08:02:08 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87]) by mailout1.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id j9K81qh9017953; Thu, 20 Oct 2005 18:01:52 +1000 Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id j9K81btg015774; Thu, 20 Oct 2005 18:01:39 +1000 Date: Thu, 20 Oct 2005 18:01:38 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Poul-Henning Kamp In-Reply-To: <69026.1129649491@critter.freebsd.dk> Message-ID: <20051020155911.C99720@delplex.bde.org> References: <69026.1129649491@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Scott Long , src-committers@FreeBSD.org, Andrew Gallatin , cvs-src@FreeBSD.org, cvs-all@FreeBSD.org, David Xu Subject: Re: cvs commit: src/sys/amd64/amd64 cpu_switch.S machdep.c X-BeenThere: cvs-src@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: CVS commit messages for the src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 Oct 2005 08:02:10 -0000 On Tue, 18 Oct 2005, Poul-Henning Kamp wrote: > [At the risk of repeating myself once more...] > ... > One of the things you have to realize is that once you go down this > road you need a lot of code for all the conditionals. > > For instance you need to make sure that every new timestamp you > hand out not prior to another one, no matter what is happening to > the clocks. Clocks are already incoherent in many ways: - the times returned by the get*() functions incoherent with the ones returned by the functions that read the hardware, because the latter are always in advance of the former and the difference is sometimes visible at the active resolution. POSIX tests of file times have been reporting this incoherency since timecounters were implemented. The tests use time() to determine the current time and stat() to determine file times. In the sequence: t1 = time(...): sleep(1) touch(file); stat(file); t2 = mtime(file); t2 should be < t1, but the bug lets t2 == t1 happen. - times are incoherent between threads unless the threads use their own expensive locking to prevent this. This is not very different from timestamps being incoherent between CPUs unless the system uses expensive locking to prevent it. > ... >>> It seems like reading ACPI-fast is "only" 3us or so, but when the ctx >>> switch is otherwise 4us, it adds up. i8254 is much worse on this >>> system (6.5us). > > i8254 is always bad, and about as bad as it can. The i8254 is not that bad, and far from as bad as can be. > Mostly because > of the need to disable interrupts (Actually, that's a critical > section today, isn't it ?) and also hobbled by the three 8 bit > ISA-bus(-like) accesses needed. Mostly not: - disabling interrupts is not necessary is was done mainly because it is most efficient except (apparently) on P4's. It is only necessary to repeat the read if the conditions were changed underneath us by an interrupt. Whether there was an interrupt can easily be determined by looking at the interrupt count. Disabling of interrupts is still always used, at least on i386's. This is essential in the non-lapic case and good in the lapic case: - In the non-lapic case, the code hasn't changed significantly lately and still has an explicit hard-disablement. There is a magic number of 20 i8254 cycles (spelled TIMER0_LATCH_COUNT in axed code) that gives a real-time requirement on the maximum time between the i8254 timer read and the check for rollover. Disabling interrupts is not sufficient to meet this requirement since bus activity may lengthen the time for the combined i/o to many more than 20 cycles (I've measured about 200 for similar code in getit()), but it mostly works. If interrupts were not hard-disabled, then almost any interrupt would break this requirement. - In the lapic case, there is now only a spin mutex on the clock lock. The lock is essential, and it gives a critical section which is almost as essential (since without the critical section a low priority thread reading the i8254 might be preempted while holding the lock). Spin mutexes still hard-disable interrupts, so interrupts are still hard-disabled as a side effect. Hard-disabling interrupts for spinlocks is a bug, but here it is good though not essential. It prevents fast interrupt handlers and low-level non-context-switching interrupt code from running. There is no longer a requirement for completing the function in 20 i8254 cycles, but doing so is safest. The simplification in the lapic case has very little to do with interrupts, clock or otherwise. The real-time requirement is now that i8254_get_timecount() be called significantly more often than the i8254 rolls over. This is now easily satisfied by increasing the rollover period to ~55 msec and depending on users not configuring HZ to permitted values of <= 18 Hz. Even HZ = 100 provides a safety margin. This method could also be used for the non-lapic case, using either another source of periodic interrupts to keep calling i82854_get_timecount() significantly more often than every 1/HZ seconds, or by using another source for hardclock interrupts. On i386's, the RTC would work perfectly for clock interrupts too except for minor problems in schedulers and maybe applications wanting timeouts of exactly 10 msec. - only 1 or 2 accesses are needed: - 2 with only the LSB of the count used. This HZ to be larger than about 5000. Large HZ are undesirable in general but are sometimes good for dumb hardware like the i8254. - 1 with unlatched reads. I could never get this to work. >>> > I wonder if moving to HZ=1000 on amd64 and i386 was really all that good >>> > of an idea. > > The main benefit was getting more precise timeouts, something we have > at various times thought about implementing with deadline counters > on platforms that have it. Nobody has done it though. Dragonfly did it. > So, instead of looking for "quick fixes", lets look at this with a > designers or architects view: > > On a busy system the scheduler works hundred thousand times per > second, but on most systems nobody ever looks at the times(2) data. More like 1000 times a second. Even stathz = 128 gives too many decisions per second for the 4BSD scheduler, so it is divided down to 16 per second. Processes blocking on i/o may cause many more than 128/sec calls to the scheduler, but there should be nothing much to decide then. > The smart solution is therefore to postpone the heavy stuff into > times(2) and make the scheduler work as fast as it can. Once more: schedulers haven't used anything related to times(2) since the ancient version of 3BSD or 4BSD where times() was superseded by gettimeofday(), and have never used timecounters. (Even times(2) doesn't use anything related to scheduling except to fake 4BSD scheduler clock ticks in its API.) > So the scheduler should read the TSC and schedule in TSC-ticks. Schedulers never read the TSC. The schedule in statclock ticks. > times(2) will then have to convert this to clock_t compatible > numbers. It has converted from real times to clock_t's since before FreeBSD-1. The real times happen to be implemented using timecounters and the timecounter may be the TSC. times() doesn't really care. OTOH, getrusage() reports process times in real times (with only some resolution lost by converting MD times to bintimes and then bintimes to timevals). > According the The Open Group, clock_t is in microseconds by means > of historical standards mistakes. clock_t in microseconds is required for historical mistakes in OS's supported by The Open Group. FreeBSD never had these particular mistakes. It has different ones, and has sysconf(_SC_CLK_TCK) fixed at 128 to support them. (Note that the units for clock_t are not the same for all uses of clock_t, but for the historical times() mistake they are 1/sysconf(_SC_CLK_TCK) seconds. As an implementation detail, FreeBSD uses 1/128 for all clock_t's even in cases where the historical mistakes have less inertia.) > However, I can see nowhere that would collide with an interpretation > that said "clock_t is microseconds PROVIDED the cpu had run at full > speed", so a simple one second routine to latch the highest number > of TSC-tics we've seen in a second would be sufficient to generate > the conversion factor. > > And in many ways this would be a much more useful metric to offer > (in top(1)) than the current rubber-band-cpu-seconds. You seem to have left out a "not" here. Users mostly only care about the real time taken by their processes. If the conversion factor is constant then it is possible for even users to apply it to convert from the units displayed by top and friends to their favourite units, but with variable conversion factors it would be difficult for even applications to do the conversion. Syscalls would have to return a table giving their best idea of the conversion factors at different times in the processes lifetime, and applications would have to integrate over time to convert to a single number to display to the user, according to user-specified weights. Better yet, put the integration in the kernel and use syscalls to tell the kernel the weights ;-). Anyway, getrusage() has fewer historical mistakes than times(), and maintaining non-broken support for it requires using timecounters in mi_switch() almost like we already do. Hmm. Checking the history shows some anachronisms in what I said in the above. It is only necessary to go back as far as FreeBSD-1 to find a BSD where ticks are used for getrusage() too. In FreeBSD-1, there wasn't even an mi_swtch(). Context switches went directly to MD code in swtch() and swtch() was missing calls to microtime()/bintime() and many other expenses. The bogusness in times() and getrusage() was sort of reversed -- getrusage() (actually hardclock()) converted from low-resolution tick counts to high resolution timevals and times() just returned the tick counts; now getrusage() only uses the tick counts for dividing up the total time and times() converts from the high-res units back to low-res ones and ends up with less accuracy that it started with due to double rounding. So the current pessimizations from timecounter calls in mi_switch() are an end result of general pessimizations of swtch() starting in 4.4BSD. I rather like this part of the pessimizations... Bruce