From owner-cvs-src@FreeBSD.ORG Thu Oct 20 21:04:21 2005 Return-Path: X-Original-To: cvs-src@FreeBSD.org Delivered-To: cvs-src@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 75A4B16A41F; Thu, 20 Oct 2005 21:04:21 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.FreeBSD.org (Postfix) with ESMTP id AA03343D66; Thu, 20 Oct 2005 21:04:20 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (unknown [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id A86E9BC7A; Thu, 20 Oct 2005 21:04:17 +0000 (UTC) To: Bruce Evans From: "Poul-Henning Kamp" In-Reply-To: Your message of "Fri, 21 Oct 2005 02:03:09 +1000." <20051021011035.T1945@delplex.bde.org> Date: Thu, 20 Oct 2005 23:04:16 +0200 Message-ID: <27345.1129842256@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: cvs-src@FreeBSD.org, src-committers@FreeBSD.org, Andre Oppermann , cvs-all@FreeBSD.org Subject: Timekeeping [Was: Re: cvs commit: src/usr.bin/vmstat vmstat.c src/usr.bin/w w.c] X-BeenThere: cvs-src@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: CVS commit messages for the src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 Oct 2005 21:04:21 -0000 I can see that Warner has already handled some of the necessary rebuttals so I will not repeat his arguments apart from noting my agreement that leapseconds are evil and should be abandonned as soon as possible. But let me step back a bit and explain the rationale for the way we keep time in FreeBSD, as a means for clearing up some of the confusion which the discussion between Bruce and me have caused. The first thing to remember is that a clock consists of a frequency source and a counter. The counter is trivial [1], you can do it with any technology and get it right, it's the frequency source which is the tricky bit. So our hardest task is to decide how long we think seconds are. Initially we trust the timecount hardware to know this (some of them autocalibrate) but we take corrections from NTPD and other programs via a specialized group of syscalls, because unless the computer has timecounting hardware driven by a primary frequency standard (Cesium or a steered oscillator) corrections are necessary to get the length of seconds right. But we also need to get the counter synchronized with UTC. If the length of our seconds is perfect, we need to do this only once. If the length of our seconds are not perfect, the phase error will become non-zero, and we can either fix this with a correction to the phase, a time step, or we do it by overcorrection of the frequency (the length of our seconds) for a period of time until we have regained or lost the phase synchronization. If we are able to estimate the frequency error, we can of course apply the correction predictively. Hardware or software, like NTPD, which does all of the above three are called a second order Phase Locked Loop ("a PLL"), and has a lot of mathematical theory hidden in dusty textbooks. If people do stupid things like use hard steps (*settime*()) to correct rate problems, then they get what they deserve, including potentially backwards jumps in time, but the integral over time of all steps apart from the first one amounts to a rate correction. When NTPD is running it gives the kernel gets a rate correction which is really mix of a corrective phase adjustment, a corrective rate adjustment and a predictve rate adjustment. The math works out the same however: leaving out the first phase adjustment (which is usually handled by a step anyway) the integral over time of the sum of the phase and rate adjustments is the true rate correction [2]. Adjtime() is a middle case, it implements a phase step but spreads it out over time (by doing frequency corrections) to avoid large gaps or backwards steps in the CLOCK_REALTIME timescale. Adjtime() is used by various time synchronization tools which doesn't do rate estimation at all but rather implements occational phase synchronization using these "soft steps". Again repeated phase synchronization amounts to crude frequency steering, and therefore again, the integral over time is our best estimate of SI second duration. But as I said: timekeeping in all forms consists of getting the phase right the first time, and keeping the frequency right (on average) afterwards and there is no escaping this basic mathematical fact because you can't go back and remeasure the past. FreeBSD incorporates everything but the hard steps into the CLOCK_MONOTONIC timescale, because over time, the integral of those corrections are our best estimate of the correct length of SI seconds. It can be argued that any hard steps after the second should be factored in as well, but in practice subsequent hard steps are either to correct mistakes in the initial hard step or so infrequent that averaging out the corrections doesn't make sense, so we treat all hard steps as phase only corrections. In summary: CLOCK_MONOTONIC is our best estimate of how many SI seconds the system have been runing [3]. Given that CLOCK_MONOTONIC is our best guess how long the kernel has been running, it follows that CLOCK_REALTIME - CLOCK_MONOTONIC must be our best estimate of what time the kernel booted. CLOCK_REALTIME aka. UTC is therefore maintained in FreeBSD by keeping around our best estimate of when the system booted in UTC time and adding CLOCK_MONOTONIC to it. Hard phase steps are implemented by changing our boottime estimate according to the desired step. The only snag in this is that leapsecond does not exist in CLOCK_REALTIME, but they very much exist in the real world. We deal with (ie: ignore) leap seconds by either replaying or skipping a second on the CLOCK_REALTIME timescale [5], and in order to make the math come out right, we do that by adjusting boottime one second either way. This is technically wrong, and will mean that the boottime estimate is wrong by the number of leapseconds the system has experienced while running. Considering that leapseconds happen once every 500 days or so and that POSIX found them so insignificant that they just defined them out of existence as far as computers go, I have no problem with this approximation. Conclusion: Provided root doesn't go out of his way to muck it up, timekeeping in FreeBSD will Do The Right Thing, and do it a fair bit better and with higher precicion than any other operating system. If you want to know how long time the system has been running, CLOCK_MONOTONIC is the best number you will get. Footnotes: [1] Actually, as leapseconds have proven it is possible for a highly skilled group of scientists to get the counting part wrong also. [2] Because NTPD implements a 2nd order PLL, the integral over time of the phase adjustment alone is the frequency drift divided by the PLL timeconstant, a number which is lost in the noise unless you have an hi-quality OCXO or better timebase. [3] As Bruce has correctly pointed out, if the root plays silly buggers with time management systemcalls, he can muck it up [4]. One way would be to apply a 500PPM frequency correction and step one second in the other direction every 2000 seconds. On average the clock would be right, but the CLOCK_MONOTONIC would be 500PPM wrong. [4] Toot could also do "killall -9 sh" or "rm -rf /", either of which would be both faster and more spectacular. [5] Warner is right: I got the actual sequence it wrong in my previous email. References: http://phk.freebsd.dk/pubs/timecounter.pdf -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence.