From owner-freebsd-arch@FreeBSD.ORG Thu Aug 21 03:34:58 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id DE494AB2; Thu, 21 Aug 2014 03:34:58 +0000 (UTC) Received: from mail108.syd.optusnet.com.au (mail108.syd.optusnet.com.au [211.29.132.59]) by mx1.freebsd.org (Postfix) with ESMTP id 88D463C6F; Thu, 21 Aug 2014 03:34:58 +0000 (UTC) Received: from c122-106-147-133.carlnfd1.nsw.optusnet.com.au (c122-106-147-133.carlnfd1.nsw.optusnet.com.au [122.106.147.133]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id AFCF61A2C2E; Thu, 21 Aug 2014 13:34:48 +1000 (EST) Date: Thu, 21 Aug 2014 13:34:47 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans Subject: Re: [PATCH 0/2] plug capability races In-Reply-To: <20140821044234.H11472@besplex.bde.org> Message-ID: <20140821113753.D933@besplex.bde.org> References: <1408064112-573-1-git-send-email-mjguzik@gmail.com> <201408151031.45967.jhb@freebsd.org> <20140816102840.V1007@besplex.bde.org> <201408201111.47601.jhb@freebsd.org> <20140821044234.H11472@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=AOuw8Gd4 c=1 sm=1 tr=0 a=7NqvjVvQucbO2RlWB8PEog==:117 a=PO7r1zJSAAAA:8 a=tTSYktBZc9AA:10 a=KN91Z2BipYgA:10 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8 a=rnMR6aIR_FJUPtQO_FsA:9 a=W3cemsHr8jZuBReB:21 a=LqShDMf_JJxsx4l9:21 a=CjuIK1q_8ugA:10 Cc: Mateusz Guzik , Robert Watson , Johan Schuijt , freebsd-arch@freebsd.org, Konstantin Belousov X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Aug 2014 03:34:59 -0000 On Thu, 21 Aug 2014, Bruce Evans wrote: > ... > I now remember a bit more about the algorithm. There are several > generations of timehands. Each generation remains stable for several > clock ticks. That should be several clock ticks at 100 Hz. Normally > there is no problem with just using the old pointer read from timehands > (except there is no serialization for updating timehands itself (*)). > ... > (*): > > % binuptime(struct bintime *bt) > % { > % struct timehands *th; > % u_int gen; > % % do { > % th = timehands; > > Since tc_windup() also doesn't dream of memory ordering, timehands here > may be in the future of what it points to. That is much worse than it > being in the past. Barriers would be cheap in tc_windup() but useless > if they require barriers in binuptime() to work. > > tc_windup() is normally called from the clock interrupt handler. There > are several mutexes (or at least atomic ops that give synchronization on > at least x86 SMP) before and after it. These gives serialization very > soon after the changes. > > The fix (without adding any barrier instructions) is easy. Simply > run the timehands update 1 or 2 generations behind the update of what > it points to. This gives even more than time-domain locking, since > the accidental synchronization from the interrupt handler gives ordering > between the update of the pointed-to data and the timehands pointer. > ... More details: - lock tc_windup() and tc_ticktock() using a spinlock - add hard real-time rate limiting and error recovery so that the timehands are not cycled through too fast or too slow. tc_ticktock() already does this for calls from the clock interrupt handler except when clock interrupts are non-hard. tc_ticktock() can use mtx_trylock() and do nothing if the mutex is contested. tc_setclock() and possibly inittimecounter() should wait to synchronize with the next clock interrupt that would call tc_windup(), and advance the time that they set by the wait delay plus previous delays, and even more, since its changes shouldn't go live for several generations. It sort of does this now, in a broken way. It corrupts the boot time using racy accesses. This limits problems from large adjustments to realtime clock ids (the ones that add the boot time). There are no further delays, just races accessing the boot time in critical places like boottime(). Delays are now also limited by calling tc_windup() and tc_windup() going live with updated timehands almost immediately (as soon as it complete). The immediate tc_windup() call is commented on as being to fiddle with all the crinkly bits aroudn the fiords, but the only criticial thing it does is update the generation count in a fiarly non-racy way -- this tells bintime() to loop, so it has a chance of picking up the changed boot time with a coherent value. sysctl_kern_timecounter_hardware() should call tc_windup() to do a staged update way much like for tc_setclock(). It refrains from doing this because of the races, but it hacks on the timehands pointer in a different and even more fragile racy way. It now calls timekeep_push_vdso() to do the userland part of tc_windup(). The timehands may be recycled too slowly. This happens mainly on suspension. The system depends on frequent windups to work, so it can't run really tickless. After suspension, all old generations are garbage but their generation counts might not have been updated to indicate this. The system should at least try to detect this. I don't understand what happens for timecounters on resume now. - in tc_windup(), bump the generation count for the second-oldest generation instead of setting it to 0 for the current generation, and update the timehands for the oldest generation instead of changing them for the current generation. This also fixes busy-waiting and contention on the timehands for the current generation during the windup. Using the special generation count of 0 essentially reduces the "queue" of timehands from length 10 to length 0 during the windup, at a cost of complications and bugs. It also makes the other 9 generations of the timehands almost never used, and not very useful. 1 generation together with a generation count that is set to 0 during windups suffices, at the cost of spinning while the generation count is 0 and complications and bugs in accesses to the generation count. But the current version already has all these costs in the usual case where the generation changes. tc_windup() is supposed to run with interrupts disabled, so that it cannot be preempted and the length of the spinning is bounded. (Having only Giant locking for the call in settime() is even worse than first appeared. It doesn't prevent preemption at all, so the length of the spinning is unbounded.) In unusal cases, binuptime() is preempted and the generation count changes many times before the original timehands is used. Then the pointer to it is invalid. But the generation count in it has increased by more than usual, so the change is detected and the pointer is updated. So old generations are not used for storing anything important except for the generation count, and having 10 generations just reduces the rate of increase of generation counts by a factor of 10, so it takes preemption by 10 ** 2^32 windups instead of only 2**32 for the algorithm to by broken by wraparound of the generation count (with HZ = 1000, that is 490 days of preemption instead of only 49). The delayed updates might cause different complications. I think ntp seconds updates strictly should to be done in advance so as to go live on seconds rollover. The details can't be too critical, since with HZ = 100 tc_windup calls are out of sync with seconds rollovers by an average of 5 milliseconds (+-5) and no one seemed to notice problems from that. Isn't there an error of 1 second for the duration of the sync time around leap seconds adjustments? With HZ = 1000 the update "queue" with intentionally delayed updates could have length 5 and give much the same behaviour except for missing races (the average delay would still be 5 milliseconds but now +-0.5). Bruce