Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 31 Mar 2005 11:00:16 GMT
From:      Bruce Evans <bde@zeta.org.au>
To:        freebsd-bugs@FreeBSD.org
Subject:   Re: kern/79339: [patch] Kernel time code sync with improvements  from DragonFly
Message-ID:  <200503311100.j2VB0Gf1074983@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/79339; it has been noted by GNATS.

From: Bruce Evans <bde@zeta.org.au>
To: Uwe Doering <gemini@geminix.org>
Cc: Joshua Coombs <jcoombs@gwi.net>, freebsd-bugs@FreeBSD.org,
	freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/79339: [patch] Kernel time code sync with improvements 
 from DragonFly
Date: Thu, 31 Mar 2005 20:50:50 +1000 (EST)

 On Thu, 31 Mar 2005, Uwe Doering wrote:
 
 > Joshua Coombs wrote:
 >>  Testing with wakeup_latency.c on a 5.3-Rel box shows the same symptom set. 
 >> I've not yet tested the proposed fix on 5-x.  I will try dupilcating this 
 >> issue on 6-current as well to nail down the problem scope. 
 >
 > Please also look at what's actually in DragonFly's CVS repository.  Your PR 
 > is based on the original patch, while the code in DragonFly is more 
 > sophisticated.  Namely, tvtohz() was split into two functions, tvtohz_low() 
 > and tvtohz_high(), which replace the original function depending on the 
 > context tvtohz() appears in.
 >
 > From this I conclude that the original patch is insufficient (likely to break 
 > parts of the kernel), and that integrating this improvement into FreeBSD 
 > might not be as easy and straightforward as it appears to be at first glance. 
 > On the other hand, with some effort it ought to be doable.
 
 Indeed.
 
 Here is a discussion of some of the bugs in the patch:
 
 % >Fix:
 % /usr/src/sys/kern/kern_clock.c
 % 325c325
 % <                       / tick + 1;
 % ---
 % >                       / tick;
 % 328c328
 % <                       + ((unsigned long)usec + (tick - 1)) / tick + 1;
 % ---
 % >                       + ((unsigned long)usec + (tick - 1)) / tick;
 
 This breaks all callers of tvtohz() except the one that is changed in
 the patch to expect this API change.  The comment before tvtohz() still
 says that tvtohz() adds 1.
 
 % /usr/src/sys/kern/kern_time.c
 % 232c232
 % <       int error;
 % ---
 % >       int error, sleepticks;
 % 241a242
 % >                 sleepticks = tvtohz(&tv);
 % 243c244
 % <                   tvtohz(&tv));
 % ---
 % >                     (sleepticks < 1)? 1 : sleepticks);
 
 This is more or less correct.  1 should be subtracted from tvtohz() in
 callers that do a careful comparision of the times before and after
 the sleep so that they can tell if the sleep time has completely
 expired.
 
 The function here (nanosleep1()) is not quite such a caller.  It does
 a sloppy comparision of times, using getnanouptime() instead of
 nanouptime().  getnanouptime() has a resolution of 1/ticktock_hz, where
 ticktock_hz is appoximately min(hz, 1000) (normally just hz), so there
 is a possible error of 2/ticktock_hz in the comparision.  I think all
 the errors go the same way, so the maximum error is 1/ticktock_hz.
 The extra tick added by tvtohz() accidentally compensates for this
 error.  Synchronization effects may reduce (or increase?) the error.
 The first getnanouptime() is unsynchronized, but ones done just after
 timeout returns are synced with clock interrupts, so they give a
 fairly accurate time every hz/ticktock_hz hardclock interrupts.
 Anyway, if 1 is subtracted from tztvohz(), then naouptime() should
 be used to avoid these errors.
 
 There are many other callers like nanosleep1(): the ones for select(2),
 poll(2) and setitimer(2).  These all depend on tvtohz() adding 1 to
 ensure that they sleep for the specified interval, and they all do
 sloppy comparisions like nanosleep1(), so they all need similar changes
 if you want timeouts to be synchronized with 1/HZ second boundaries as
 perfectly as possible.
 
 % 252c253,254
 % <                               *rmt = ts;
 % ---
 % >                                 rmt->tv_sec = ts.tv_sec;
 % >                                 rmt->tv_nsec = ts.tv_nsec;
 % 258c260,261
 % <               ts3 = ts;
 % ---
 % >                 ts3.tv_sec = ts.tv_sec;
 % >                 ts3.tv_nsec = ts.tv_nsec;
 
 These changes just introduce style bugs.
 
 % 260a264,265
 % >                 if (tv.tv_sec == 0 && tv.tv_usec < tick)
 % >                         return (0);
 
 This can't be right.  We have just not-so-carefully checked whether
 the time has expired, and only get here when it hasn't.
 (tv.tv_sec == 0 && tv.tv_usec < tick) means that we would have preferred
 the sleep time to be less than 1 tick.  We had to request a sleep of
 exactly 1 tick because less than 1 is impossible (this is with 1
 subtracted from tvtohz()).  Sleeping for exactly 1 tick is also
 impossible, so we have woken up after an interval of anywhere between
 0+epsilon and (1-epsilon+latency) seconds.  The interval may be
 significantly smaller or larger than than `tv' and we must go back to
 sleep if it is smaller.  The above change breaks this.
 
 I think the problem that this change is supposed to fix is related to
 the tick frequency not being an exact multiple of 1/HZ.  Also, to avoid
 sleeping longer than necessary, we should try to wake up 1 tick early
 and then decide whether to sleep another tick or 2 to finish.  Note
 that although tvtohz() always rounds up, physical sleep intervals are
 always shorter than the specified timeout, so waking up 1 tick early
 is very common for unsynchonized sleeps.  Thus if we subtract 1 from
 tvtohz(), we often wake up 1 tick early as a side effect, which is what
 we want, but there is a problem: suppose that that everything is in
 perfect sync, but the hardclock interrupt frequency is slightly less
 than 1/HZ seconds.  Then we may wake up 5 usec or so early and decide
 to go back to sleep, giving a large error.  Changes later in the patch
 are related to this.  I think we shouldn't do anything special here
 except possibly return early if `tv' is very small.
 
 Going around the loop in nanosleep1() an extra time is a small
 pessimization.  Using nanouptime() to get the decision of whether to
 loop right is a pessimization too, but it is relatively small.
 
 % /usr/src/sys/i386/isa/clock.c
 % 113c113,114
 % < #define       TIMER_DIV(x) ((timer_freq + (x) / 2) / (x))
 % ---
 % > #define TIMER_DIV(x) (timer_freq / (x))
 % > #define FRAC_ADJUST(x) (timer_freq - ((timer freq / (x)) * (x)))
 
 Reducing TIMER_DIV() unconditionally would be harmless under FreeBSD.
 It's rounding to nearest dates from there was little more than hardclock
 ticks for timekeeping.  Now HZ and the hardclock interrupt frequency
 are almost unrelated to timekeeping.
 
 % 141a143
 % > u_int   timer0_frac_freq;
 % 204a207,209
 % >         int phase;
 % >         int delta;
 % >
 % 215a221,236
 % >
 % >         phase = 1000000 / timer0_frac_freq;
 % >         delta = timecounter->tc_microtime.tv_usec % phase;
 
 tc_microtime.tv_usec is not quite the right thing to use here.  It is
 updated every tick or two so it might be up to date, but it has
 unnecessary jitter.  microtime() would give a more accurate timestamp.
 I think microtime() and not microuptime() is the correct function to
 use here, since we want to sync with the real time.  OTOH, nanosleep1()
 and friends use the uptime, so they must be looked at some more to
 determine the effects of using different time scales on syncing.  I
 think the synchronization done here is honored by nanosleep1() despite
 the different scales, and sync is only lost when the clock is changed
 using settimeofday() (then everything gets out of sync).
 
 % > #if 1
 % >       disable_intr();
 
 The clock should be read inside this critical section.
 
 % >         if (delta < (phase >> 1)) {
 % >                 outb(TIMER_CNTR0, timer0_max_count & 0xff);
 % >                 outb(TIMER_CNTR0, timer0_max_count >> 8);
 % >         } else {
 % >                 outb(TIMER_CNTR0, (timer0_max_count +1) & 0xff);
 % >                 outb(TIMER_CNTR0, (timer0_max_count +1) >> 8);
 % >                 ++i8254_offset;
 % >         }
 
 I think i8254_offset needs to be reinitialized every time the maximum
 count is reprogrammed.  This is not done in set_timer_freq(); however,
 most callers of set_timer_freq() initialize or update the i8254
 timecounter immediately after, and testing shows that this reduces
 lost ticks to an acceptable value (usually, and hopefully always < 10).
 Correctly reprogramming the i8254 on every interrupt is harder.  Losing
 even 1 tick per interrupt is too much, but I think the above can
 sometimes lose 100 (if clkintr() is delayed for that long, which can
 easily happen especially in RELENG_4 since clkintr() is not a fast
 interrupt handler there).  See nearby code that calls
 i8254_get_timecount() inside a critical section for a way to reduce
 the error to at most 5 ticks.  It takes about 5 ticks just to read the
 counter.  This is still far too large to do on every clock tick.  All
 of this only matters if the i8254 is used for timekeeping.
 
 % >       enable_intr();
 % > #endif
 % >
 % 236a258
 % >                 timer0_frac_freq = new_rate;
 % 247,248c269,270
 % <               if ((timer0_prescaler_count += timer0_max_count)
 % <                   >= hardclock_max_count) {
 % ---
 % >                 timer0_prescaler_count += timer0_max_count;
 % >                 if (timer0_prescaler_count >= hardclock_max_count) {
 
 This change is just to style.
 
 % 689a712
 % >         timer0_frac_freq = intr_freq;
 
 The changes seem to be too simple to give a PLL.  I didn't check the details
 for this.
 
 % 1221c1244
 % <       count = timer0_max_count - ((high << 8) | low);
 % ---
 % >         count = timer0_max_count + 1 - ((high << 8) | low);
 
 Always adding 1 here seems to be wrong.  Shouldn't you only add 1 if
 timer0_max_count isn't actually the max count, i.e., when the max count
 has been programmed to be 1 more than usual?  All references to
 timer0_max_count are potentially wrong when timer0_max_count isn't
 actually the max count.  You add 1 to i8254_offset in the above; this
 seems to be to adjust for 1 of the references being wrong, but it doesn't
 seem to adjust for `count' being 1 too large.
 
 % A sawtooth is still present, but the accuracy is MUCH better.  I suspect my hack application of the PLL function isn't correct or my P133 is slow enough that I'm observing some other latencies.  I have observed occasional negative offsets, which according to the article are strictly forbidden by RFCs, so please check my work.  I believe they were the result of my playing with a hz value too high for the machine to reasonably handle, and are not occuring with saner values for hz.
 
 I only agree with the non-hardware changes (don't sleep for an extra
 tick in nanosleep1() and friends if this is easy to avoid).  All that
 that perfect sync of real time with hardclock() clock gives is the
 possibility of waking up on precisely 1/HZ boundaries relative to real
 time (with whole seconds being boundaries).  System activity lengthens
 sleeps by indeterminate amounts except on unloaded systems.  The average
 error for a random sleep on an unloaded systems would still be 0.5/HZ
 (or 1.5/HZ without the nanosleep1() change).
 
 Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200503311100.j2VB0Gf1074983>