From owner-cvs-src@FreeBSD.ORG Thu Oct 20 12:55:54 2005 Return-Path: X-Original-To: cvs-src@freebsd.org Delivered-To: cvs-src@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D19AA16A41F; Thu, 20 Oct 2005 12:55:53 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.115]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7DB5043D5A; Thu, 20 Oct 2005 12:55:51 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87]) by mailout2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id j9KCtbVE013693; Thu, 20 Oct 2005 22:55:37 +1000 Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id j9KCtMGX011464; Thu, 20 Oct 2005 22:55:23 +1000 Date: Thu, 20 Oct 2005 22:55:23 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Poul-Henning Kamp In-Reply-To: <23346.1129796829@critter.freebsd.dk> Message-ID: <20051020215101.Y874@delplex.bde.org> References: <23346.1129796829@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Scott Long , src-committers@freebsd.org, Andrew Gallatin , cvs-src@freebsd.org, cvs-all@freebsd.org, David Xu Subject: Re: cvs commit: src/sys/amd64/amd64 cpu_switch.S machdep.c X-BeenThere: cvs-src@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: CVS commit messages for the src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 Oct 2005 12:55:54 -0000 On Thu, 20 Oct 2005, Poul-Henning Kamp wrote: > In message <20051020155911.C99720@delplex.bde.org>, Bruce Evans writes: > >>> One of the things you have to realize is that once you go down this >>> road you need a lot of code for all the conditionals. >>> >>> For instance you need to make sure that every new timestamp you >>> hand out not prior to another one, no matter what is happening to >>> the clocks. >> >> Clocks are already incoherent in many ways: >> - the times returned by the get*() functions incoherent with the ones >> returned by the functions that read the hardware, because the latter >> are always in advance of the former and the difference is sometimes >> visible at the active resolution. > > Sorry Bruce, but this is just FUD: The entire point of the get* > familiy of functions is to provide "good enough" timestamps, very > fast, for code that knows it doesn't need better than roughly 1/hz > precision. This bug shows that the get* functions don't actually provide "good enough" timestamps, even for what is probably their primary use -- ffs file times are probably their primary use, and these only need a resolution of 1 second; however, they need to be accurate relative to other clocks, and a precision of ~1/hz doesn't provide enough accuracy due to implementation details. >> visible at the active resolution. POSIX tests of file times have >> been reporting this incoherency since timecounters were implemented. >> The tests use time() to determine the current time and stat() to >> determine file times. In the sequence: >> >> t1 = time(...): >> sleep(1) >> touch(file); >> stat(file); >> t2 = mtime(file); >> >> t2 should be < t1, but the bug lets t2 == t1 happen. > > t2 == t1 is not illegal. It is just invalid and of low quality for file systems that provide a resolution of 1 second in their timestamps. The sleep of 1 second in there is specific to such file systems; it is to ensure that at least 1 second has elapsed between the time() and the touch(). > The morons who defined a non-extensible timestamp format obviously > didn't belive in Andy Moore, but given a sufficiently fast computer > the resolution of the standardized timestamps prevents t2 > t1 in > the above test code. POSIX specifies the resolution for file times but doesn't specify their accuracy AFAIK (not far). Quality of implementation specifies their accuracy. The above is a simple test for strict monotonicity of file times that happens to test for accuracy and coherency too. This monotonicity is very easy to get right. sleep(3) is required to sleep for at least 1 second. nanosleep(2) is sloppy about this -- it uses a get* function so it risks similar bugs, but I think none here since the extra tick in the timeout provides a sufficient margin for error. After sleeping for at least 1 second, the time has surely advanced by 1 second and timestamps taken by a coherent clock will see this. With a time(2) in it, the test would just not see incoherencies of 1 second. >> - times are incoherent between threads unless the threads use their >> own expensive locking to prevent this. This is not very different >> from timestamps being incoherent between CPUs unless the system uses >> expensive locking to prevent it. > > Only if the get* family of functions is used in places where they > shouldn't be. I belive there is a sysctl which determines if it > is used for vfs timestamp. The default can be changed if necessary. This point is for all the functions. A timestamp taken by 1 thread might not be used until after many timestamps are taken and used by other threads. Naive comparison of these timestamps would then give apparent incoherencies. It is up to the threads to provide synchronization points if they want to compare times. More interestingly, there is no need to keep the timecounters seen by different threads perfectly in sync except at synchronization points, since any differences would be indistinguishable frome ones caused be unsynchronized preemption. (Strict real time to ~nanoseconds accuracy wouldn't work for either.) I use the sysctl in POSIX tests to as not to keep seeing the the file times bugs, but I sometimes forget to use it so I get remined of the bugs anyway. IIRC, I got jdp to change the sysctl a bit to handle more cases. He wanted an option for more resolution and I wanted one to unbreak seconds resolution. The implementation actually uses the get* functions for seconds and 1/hz resolution and the non-get* functions for microseconds and nanoseconds resolution. So I use an unnecessarily high resolution to avoid the bug. >>> On a busy system the scheduler works hundred thousand times per >>> second, but on most systems nobody ever looks at the times(2) data. >> >> More like 1000 times a second. Even stathz = 128 gives too many decisions >> per second for the 4BSD scheduler, so it is divided down to 16 per second. >> Processes blocking on i/o may cause many more than 128/sec calls to the >> scheduler, but there should be nothing much to decide then. > > I'm regularly running into 5 digits in the Csw field in systat -vm. > I don't know what events you talk about, but they are clearly not > the same as the ones I'm talking about. I just looked at csw values on machines in the freebsd cluster. They may be underpowered and not heavily used, but they are more active than any machine that I run and may be representative of general server machines. On hub a few hours ago, csw was a transient 100-500 and the average since boot time was 1010. The count since boot time may have overflowed but the average is reasonable. hub has been up for 236 days and an average of 1010/sec gives a count of just below INT_MAX. The 128/16 events is for timekeeping for scheduling. 4BSD does little more than incrememnt a tick count here. ULE does a bit more. Then there are the rescheduling every second for 4BSD, and more distributed rescheduling for ULE. On context switches, the scheduler has (or should have) little to do. It is context switching itself that makes the timestamps that become too expensive when csw is high. > The problem here is context-switch time, and while we can argue if > this is really scheduler related or not, the fact that the scheduler > decides which thread to context-switch to should be enough to > avoid a silly discussion of semantics. The problem is still unrelated to (non-broken) schedulers. Most context switches happens because something blocks on i/o or is preempted by an interrupt handler (it's a very low level of scheduling -- just interrupt priority -- that allows the preemption, so I don't count it as part of scheduling). So unavoidable context switches can happen a lot on busy machines and the scheduler can't/shouldn't affect their count except possibly to reduce it a bit. Given that they happen a lot on some systems, they should be as efficient as possible. I think the timecounter part of their inefficiency is not very important except in the usual case of a slow timecounter. Losses from busted caches may dominate. >> So the current pessimizations from timecounter calls in mi_switch() >> are an end result of general pessimizations of swtch() starting in >> 4.4BSD. I rather like this part of the pessimizations... > > It's so nice to have you back in action Bruce :-) I don't plan to stay very active. Bruce