Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 18 Aug 2013 00:49:53 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Hiroki Sato <hrs@FreeBSD.org>
Cc:        arch@FreeBSD.org
Subject:   Re: suspend/resume time-gap and expiration timers in network stack
Message-ID:  <20130818004948.L4326@besplex.bde.org>
In-Reply-To: <20130817.173019.1478850854128616078.hrs@allbsd.org>
References:  <20130817.173019.1478850854128616078.hrs@allbsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 17 Aug 2013, Hiroki Sato wrote:

> I recently committed s/time_second/time_uptime/ replacement in
> sys/netinet6 which was done in sys/netinet some years ago.  It works
> well and should be more robust against time_second deviation (by NTP,
> for example).  However, IPv6 default routers and prefixes do not
> expire as before if the machine is suspended and then resumed after a
> long time, say, 3 days.
>
> I think the impact is limited because suspend/resume causes to reset
> all of the physical network interfaces in some way, but in some cases
> the decayed entries can persist.  So I would like your comments about
> the following idea to solve this:

The monotonic clock has always been broken across suspend/resume inxi
FreeBSD.  Long ipv6 timeouts are only one of the things broken.  If
you fux the monotonic clock, then the ipv6 code shouldn't need any
changes.

POSIX.1 (2001-draft7.txt) specifies the monotonic clock as follows:

%%%
6679 MON          If the Monotonic Clock option is supported, all implementations shall support a clock_id of
6680              CLOCK_MONOTONIC defined in <time.h>. This clock represents the monotonic clock for the
6681              system. For this clock, the value returned by clock_gettime( ) represents the amount of time (in
6682              seconds and nanoseconds) since an unspecified point in the past (for example, system start-up
6683              time, or the Epoch). This point does not change after system start-up time. The value of the
6684              CLOCK_MONOTONIC clock cannot be set via clock_settime( ). This function shall fail if it is
6685              invoked with a clock_id argument of CLOCK_MONOTONIC.
%%%

Note that the "unspecified point in the past" "does not change after
system start-up [bogus word time]".  This is very broken in FreeBSD.
In FreeBSD, at least without the FFCLOCK complications, the point in
the past is nominally the boot time, which is given by the variable
boottimebin (or boottime), but the real time is misimplemented as the
sum of boottimebin and the monotonic time and since the monotonic time
never jumps, boottimebin must jump instead whenever the real time
jumps.  The jump occurs in tc_setclock(), so almost everything that
calls there gives a bug.  Some jumps are more real than others, but
just 3 types of jump should change the boot time.

Types of jumps:
1. when the system is initialized.  Times are set by eventually calling
     tc_setclock.  The real time and the boot time are statically initialized
     by 0.  The first dynamic initialization is thus seen as a large jump
     forward of the real time.  The boot time is then set to the real time,
     to maintain the (buggy) invariant real_time = monotonic_time + boottime.
     The monotonic time remains at 0, or at perhaps a small value of a few
     seconds that the boot has taken to reach this point.  This jump of the
     boot time is correct, though it can only give a fuzzy setting for the
     boot time, since even if the monotonic time is nonzero at this point,
     it can only give the time since some earlier indeterminate point in the
     boot (several seconds after the boot started, when the hardware timer
     started being used).
2. Many systems have the RTC on wall clock time.  Both of the times set by
     (1) are then wrong by the same large amount (10-11 hours here).  The
     real time is normally fixed up by adjkerntz using settimeofday(2).  This
     gives a large jump in the real time and a large jump in the boot time.
     Both jumps are correct.
3. Many systems set the real time more precisely using ntp.  This is best
     done by ntpdate stepping the clock.  If it is done by ntpd stepping the
     clock, then the jump will occur later and you don't have so much control
     over when it occurs.  If it is done by ntpd slewing the clock, then the
     boot time doesn't get set more precisely.  I think ntpd just uses
     settimeofday() or clock_settime() when it steps the clock.  This calls
     tc_setclock(), and as usual jumps the boot time to match the jump in the
     real time.  Both jumps are correct.  When ntpd slews the clock, it uses
     ntp_adjtime() on FreeBSD.  Plain adjtime() would do much the same.  This
     makes micro-adjustments to the monotonic time which become
     micro-adjustments to the real time via the invariant real_time =
     monotonic_time + boottime.  Strictly, adjusting the monotonic time is
     a bug when it is to fix an error in the current real time, but with
     sufficiently small micro-adjustments to slew ntpd's maximum amount of
     1/128 (?) seconds, this doesn't cause any more problems than not doing
     the much smaller adjustments for relativistic effects.
All other types of jumps of the boot time are bugs.  After initialization,
the boot time is whatever it is and can't change.
4. Leap seconds adjustments.  boottimebin is jumps to maintain the
     invariant.  This happens in tc_windup(), not in tc_setclock().  The
     jumps are rare and not very large, so no one usually notices the
     boot time being clobbered.
5. Any call to setimeofday() to do further adjustments of the real time.
     Perhaps you aren't using ntpd, or stopped ntpd or suspended and
     resumed and are fixing up times manually.  This breaks the value of
     boottimebin (assuming that it was set correctly by steps 1-3) by jumping
     it by the same amount as the real time.
6. Calls to tc_setclock() on resume to fix up the real time.  These clobber
     the boot time.  But more seriously, the invariant and the clobbering of
     the boot time to maintai it break the monotonic time perfectly by
     allowing to not advance across long suspend/resume operations.

The implementation should be something like:

      real_time = monotonic_time + base_time

where base_time != boot_time but is updated similarly to the current
boot_time, and boot_time which is updated similarly to the current
boot_time only during initialization.  Also, monotonic_time needs to
be updated for (6) and possibly for (5) (to fix it up after anything
stops seeing it advancing at very nearly a constant rate).

To limit adjustments of boot_time to initialization, something like
the machdep.disable_rtc_set sysctl could be used.  It is not easy for
tc_setclock() to determine if its context is for initialization.

> 1. Record RTC value upon suspend.

This should already be done to support fixing up the real time on resume.
Both the real and monotonic times should be advanced on resume by

      <real time on suspend> - <RTC time on suspend>

The acpica code that fixes up the real time on resume is hard to find.
I can only find acpi_resync_clock().  This is under an __amd64__ ifdef
and a boolean sysctl that defaults to on.  It is very broken:

% static void
% acpi_resync_clock(struct acpi_softc *sc)
% {
% #ifdef __amd64__
%     if (!acpi_reset_clock)
% 	return;
% 
%     /*
%      * Warm up timecounter again and reset system clock.
%      */
%     (void)timecounter->tc_get_timecount(timecounter);
%     (void)timecounter->tc_get_timecount(timecounter);
%     inittodr(time_second + sc->acpi_sleep_delay);
% #endif
% }

The time_second + ... arg for inittodr() is probably out of date, but
it is only used if reading the RTC hardware fails.  This function is
very broken since it doesn't use the RTC delta.  It is basically missing
all of the initialization steps (1)-(3), plus any later tracking of the
real time done by ntp.  The largest error is when the RTC is on wall
clock time.  Then the above gives an error of 10 hours or so.

Old i386-isa only apm code tries to do the right thing, but is buggy in
the details:

% static struct timeval suspend_time;
% static struct timeval diff_time;
% 
% static int
% pmtimer_suspend(device_t dev)
% {
% 
% 	microtime(&diff_time);
% 	inittodr(0);
% 	microtime(&suspend_time);
% 	timevalsub(&diff_time, &suspend_time);
% 	return (0);
% }

This abuses inittodr() to read the RTC time.  It clobbers the current
real time (and collateral state like boottimebin) by setting it to the
current time.  It then reads the RTC time with a second microtime().
There are races in this that may give errors of a second or 2, but
these are unimportant compared with races from clobbering the current
time.  Otherwise, this caclulates the time adjustment correctly.  There
are now APIs for reading RTC times, but this old code hasn't been changed
to use them.  The only recent changes have been to remove spl()'s that
gave hints about the necessary locking.

% 
% static int
% pmtimer_resume(device_t dev)
% {
% 	u_int second, minute, hour;
% 	struct timeval resume_time, tmp_time;
% 
% 	/* modified for adjkerntz */
% 	timer_restore();		/* restore the all timers */

Done better in resume methods.

% 	inittodr(0);			/* adjust time to RTC */

Same clobbering of real time as on suspend, so that we have a time that
the adjustment applies to.

% 	microtime(&resume_time);
% 	getmicrotime(&tmp_time);
% 	timevaladd(&tmp_time, &diff_time);

Caclulate new real time to adjust to.

But everything after this point is garbage.

% 
% #ifdef FIXME
% 	/* XXX THIS DOESN'T WORK!!! */
% 	time = tmp_time;

This might have worked before timecounters.  Now it just needs a
tc_setclock().  But no timer-setting code is called except
inittodr(0), so the calculated adjustment is never used and the
result is the same as for acpi_resync_clock().

% #endif
% 
% #ifdef PMTIMER_FIXUP_CALLTODO

This became garbage before it was committed.

But note that complete fixes for monotonic times also requires fixing
timeouts.  Most timeouts sleep on monotonic time, even when they should
sleep on real time.  You want the ones that expired during
suspend/resume to time out soon after resume, but not all at once.  I
think PMTIMER_FIXUP_CALLTODO worked in a version of FreeBSD before it
was committed by making them all time out at once.

% 	/* Calculate the delta time suspended */
% 	timevalsub(&resume_time, &suspend_time);
% 	/* Fixup the calltodo list with the delta time. */
% 	adjust_timeout_calltodo(&resume_time);
% 	/* 
% 	 * We've already calculated resume_time to be the delta between 
% 	 * the suspend and the resume. 
% 	 */
% 	second = resume_time.tv_sec; 
% #else /* !PMTIMER_FIXUP_CALLTODO */
% 	second = resume_time.tv_sec - suspend_time.tv_sec; 
% #endif /* PMTIMER_FIXUP_CALLTODO */
% 	hour = second / 3600;
% 	second %= 3600;
% 	minute = second / 60;
% 	second %= 60;
% 	log(LOG_NOTICE, "wakeup from sleeping state (slept %02d:%02d:%02d)\n",
% 		hour, minute, second);
% 	return (0);
% }

> 2. Calculate the time-gap by using the recorded value and the current
>    one upon resume.
>
> 3. Store the calculated time-gap in time_uptime_gap variable and use
>    (time_uptime + time_uptime_gap) for expiration timers.

The gap needs to be added to the monotonic time.  Then it will fix more
than you want to know about, but nothing will need to add it later.
There are some races in doing this addition, but tc_setclock() already
has many races (it modifies the active multi-word boottime* variables
with null locking), and then may reenter tc_windup().  Adding to the
timehands variables is a bit trickier).

The gaps are actually subtly different:

      RTC_gap = resume_RTC_time - susp_RTC_time
      resume_real_time = resume_RTC_time + (adj = susp_real_time - susp_RTC_time)
  		     = susp_real_time + RTC_gap.

I think you should use:

      resume_mono_time = susp_mono_time + RTC_gap,

but you actually use:

      resume_mono_time = orig_resume_mono_time + RTC_gap.

The mono time will advance a bit after the RTC is read at suspend time,
and it is important not to let the mono time go backwards, but the last
formula counts a lot of mono time twice.

> If there is no suspend/resume event, nothing happens.  If any, the
> time-gap is added to the expiration time.  The time_uptime_gap
> variable is always positive, so (time_uptime + time_uptime_gap) never
> goes backwards.

Locking is no simpler with addition of the gap when it is used.  Even
adding seconds only has a change of being atomic on 32-bit arches if
time_t remains in seconds.  For bintimes, there are much the same
races adding the gap to the incomplete monotonic time as for adding
frobbed boottimebin to the incomplete monotonic time to get the
real time.  Reads of boottimebin are locked by the generation count,
modulo bugs in the locking.  Writes of boottimebine are unlocked.
Both boottimebin and the gap are rarely changed so races writing them
are rarely lost.

> A experimental patch is attached.  Changes include the following:
>
> ====
> - Add clock_suspend_stamp() and clock_resume_fixup() functions to
>  calculate suspend/resume time-gap.  clock_suspend_stamp() record
>  an RTC value just before disabling timer, and clock_resume_fixup()
>  store the time-gap into time_uptime_gap variable.  Ideally,
>  time_uptime + time_uptime_gap becomes equal to the elapsed time
>  since boottime.

I haven't fixed this since none of my machines do suspend right.

I use very accurate (~1 usec) related fixups in the RTC interrupt
handler.  It basically calculates the gap every 64 seconds for later
use.  I only use this to fix up the real time after stopping in ddb.
The fixup is done by tc_setclock().  Stopping clock updates in ddb is
very like stopping in suspend/resume.  tc_setlock() clobbers boottimebin
by the amount of time spent stopped in ddb.  The ~1 usec accuracy is
obtained by synchronizing RTC reads with seconds rollovers and by
adjusting for the current drift rate (drift is typically 5 usec/sec,
so if ddb stops clock updates for 60 seconds, then the gap is 5 * 60
extra.  After hours-long suspensions, you can't hope for microseconds
accuracy, but should try to avoid seconds inaccuracies).

> - Add time-gap calculation in ACPI suspend/resume.
>
> - Add kern.clock_suspend_stamp and kern.clock_resume_fixup sysctls to
>  allow time-gap calculation from userland.  These are write-only, and
>  setting them to non-zero calls the functions.  Typically,
>  kern.clock_suspend_stamp should be set to 1 just before suspend,
>  and kern.clock_resume_fixup should be set to 1 just after resume.
>  kern.clock_resume_fixup is idempotent.

It's reasonable to do it all from userland.  I delay the fixups for ddb
by 3 seconds, and reject all backwards fixups and forwards fixups of
more than 1 day.

The patch is too large for me.  This shouldn't take much more than twice
the 15 lines taken by acpi_resync_clock() to do a buggy real time fixup
(another 15 lines to do a non-buggy real time fixup, and the same to
apply apply the same fixup to the monotonic time.  Unfortunately,
probably many more to untangle boottimebin).

subr_rtc.c needs a clock-reading function to begin with, so that the
pmtimer code can find the active clock without hacking on inittodr().

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130818004948.L4326>