From owner-freebsd-arch@FreeBSD.ORG  Thu Aug 21 03:34:58 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id DE494AB2;
 Thu, 21 Aug 2014 03:34:58 +0000 (UTC)
Received: from mail108.syd.optusnet.com.au (mail108.syd.optusnet.com.au
 [211.29.132.59])
 by mx1.freebsd.org (Postfix) with ESMTP id 88D463C6F;
 Thu, 21 Aug 2014 03:34:58 +0000 (UTC)
Received: from c122-106-147-133.carlnfd1.nsw.optusnet.com.au
 (c122-106-147-133.carlnfd1.nsw.optusnet.com.au [122.106.147.133])
 by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id AFCF61A2C2E;
 Thu, 21 Aug 2014 13:34:48 +1000 (EST)
Date: Thu, 21 Aug 2014 13:34:47 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
Subject: Re: [PATCH 0/2] plug capability races
In-Reply-To: <20140821044234.H11472@besplex.bde.org>
Message-ID: <20140821113753.D933@besplex.bde.org>
References: <1408064112-573-1-git-send-email-mjguzik@gmail.com>
 <201408151031.45967.jhb@freebsd.org> <20140816102840.V1007@besplex.bde.org>
 <201408201111.47601.jhb@freebsd.org> <20140821044234.H11472@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=AOuw8Gd4 c=1 sm=1 tr=0
 a=7NqvjVvQucbO2RlWB8PEog==:117 a=PO7r1zJSAAAA:8 a=tTSYktBZc9AA:10
 a=KN91Z2BipYgA:10 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8
 a=rnMR6aIR_FJUPtQO_FsA:9 a=W3cemsHr8jZuBReB:21 a=LqShDMf_JJxsx4l9:21
 a=CjuIK1q_8ugA:10
Cc: Mateusz Guzik <mjguzik@gmail.com>, Robert Watson <rwatson@freebsd.org>,
 Johan Schuijt <johan@transip.nl>, freebsd-arch@freebsd.org,
 Konstantin Belousov <kib@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Aug 2014 03:34:59 -0000

On Thu, 21 Aug 2014, Bruce Evans wrote:

> ...
> I now remember a bit more about the algorithm.  There are several
> generations of timehands.  Each generation remains stable for several
> clock ticks.  That should be several clock ticks at 100 Hz.  Normally
> there is no problem with just using the old pointer read from timehands
> (except there is no serialization for updating timehands itself (*)).
> ...
> (*):
>
> % binuptime(struct bintime *bt)
> % {
> % 	struct timehands *th;
> % 	u_int gen;
> % % 	do {
> % 		th = timehands;
>
> Since tc_windup() also doesn't dream of memory ordering, timehands here
> may be in the future of what it points to.  That is much worse than it
> being in the past.  Barriers would be cheap in tc_windup() but useless
> if they require barriers in binuptime() to work.
>
> tc_windup() is normally called from the clock interrupt handler.  There
> are several mutexes (or at least atomic ops that give synchronization on
> at least x86 SMP) before and after it.  These gives serialization very
> soon after the changes.
>
> The fix (without adding any barrier instructions) is easy.  Simply
> run the timehands update 1 or 2 generations behind the update of what
> it points to.  This gives even more than time-domain locking, since
> the accidental synchronization from the interrupt handler gives ordering
> between the update of the pointed-to data and the timehands pointer.
> ...

More details:
- lock tc_windup() and tc_ticktock() using a spinlock

- add hard real-time rate limiting and error recovery so that the timehands
   are not cycled through too fast or too slow.

     tc_ticktock() already does this for calls from the clock interrupt
     handler except when clock interrupts are non-hard.  tc_ticktock()
     can use mtx_trylock() and do nothing if the mutex is contested.

     tc_setclock() and possibly inittimecounter() should wait to synchronize
     with the next clock interrupt that would call tc_windup(), and advance
     the time that they set by the wait delay plus previous delays, and even
     more, since its changes shouldn't go live for several generations.  It
     sort of does this now, in a broken way.  It corrupts the boot time
     using racy accesses.  This limits problems from large adjustments to
     realtime clock ids (the ones that add the boot time).  There are no
     further delays, just races accessing the boot time in critical places
     like boottime().  Delays are now also limited by calling tc_windup()
     and tc_windup() going live with updated timehands almost immediately
     (as soon as it complete).  The immediate tc_windup() call is commented
     on as being to fiddle with all the crinkly bits aroudn the fiords, but
     the only criticial thing it does is update the generation count in
     a fiarly non-racy way -- this tells bintime() to loop, so it has a
     chance of picking up the changed boot time with a coherent value.

     sysctl_kern_timecounter_hardware() should call tc_windup() to do
     a staged update way much like for tc_setclock().  It refrains from
     doing this because of the races, but it hacks on the timehands
     pointer in a different and even more fragile racy way.  It now calls
     timekeep_push_vdso() to do the userland part of tc_windup().

     The timehands may be recycled too slowly.  This happens mainly on
     suspension.  The system depends on frequent windups to work, so
     it can't run really tickless.  After suspension, all old generations
     are garbage but their generation counts might not have been updated
     to indicate this.  The system should at least try to detect this.
     I don't understand what happens for timecounters on resume now.

- in tc_windup(), bump the generation count for the second-oldest generation
   instead of setting it to 0 for the current generation, and update the
   timehands for the oldest generation instead of changing them for the
   current generation.  This also fixes busy-waiting and contention on the
   timehands for the current generation during the windup.

     Using the special generation count of 0 essentially reduces the
     "queue" of timehands from length 10 to length 0 during the
     windup, at a cost of complications and bugs.

     It also makes the other 9 generations of the timehands almost
     never used, and not very useful.  1 generation together with a
     generation count that is set to 0 during windups suffices, at
     the cost of spinning while the generation count is 0 and
     complications and bugs in accesses to the generation count.

     But the current version already has all these costs in the usual
     case where the generation changes.  tc_windup() is supposed to
     run with interrupts disabled, so that it cannot be preempted
     and the length of the spinning is bounded.  (Having only Giant
     locking for the call in settime() is even worse than first appeared.
     It doesn't prevent preemption at all, so the length of the spinning
     is unbounded.)

     In unusal cases, binuptime() is preempted and the generation count
     changes many times before the original timehands is used.  Then
     the pointer to it is invalid.  But the generation count in it has
     increased by more than usual, so the change is detected and the
     pointer is updated.  So old generations are not used for storing
     anything important except for the generation count, and having 10
     generations just reduces the rate of increase of generation counts
     by a factor of 10, so it takes preemption by 10 ** 2^32 windups
     instead of only 2**32 for the algorithm to by broken by wraparound
     of the generation count (with HZ = 1000, that is 490 days of
     preemption instead of only 49).

   The delayed updates might cause different complications.  I think
   ntp seconds updates strictly should to be done in advance so as
   to go live on seconds rollover.  The details can't be too
   critical, since with HZ = 100 tc_windup calls are out of sync with
   seconds rollovers by an average of 5 milliseconds (+-5) and no one
   seemed to notice problems from that.  Isn't there an error of 1
   second for the duration of the sync time around leap seconds
   adjustments?  With HZ = 1000 the update "queue" with intentionally
   delayed updates could have length 5 and give much the same
   behaviour except for missing races (the average delay would still
   be 5 milliseconds but now +-0.5).

Bruce