Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 7 Apr 2019 02:20:49 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Bruce Evans <brde@optusnet.com.au>, Michael Tuexen <tuexen@fh-muenster.de>, freebsd-hackers Hackers <freebsd-hackers@freebsd.org>,  FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject:   Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]
Message-ID:  <20190407015735.W2994@besplex.bde.org>
In-Reply-To: <20190406121650.GJ1923@kib.kiev.ua>
References:  <20190324110138.GR1923@kib.kiev.ua> <E0785613-2B6E-4BB3-95CD-03DD96902CD8@fh-muenster.de> <20190403070045.GW1923@kib.kiev.ua> <20190404011802.E2390@besplex.bde.org> <20190405113912.GB1923@kib.kiev.ua> <20190405230717.D3383@besplex.bde.org> <20190405132128.GD1923@kib.kiev.ua> <20190406003907.C3872@besplex.bde.org> <20190405143619.GF1923@kib.kiev.ua> <20190406014724.X4174@besplex.bde.org> <20190406121650.GJ1923@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 6 Apr 2019, Konstantin Belousov wrote:

> On Sat, Apr 06, 2019 at 02:26:11AM +1100, Bruce Evans wrote:
>> On Fri, 5 Apr 2019, Konstantin Belousov wrote:
>>
>>> On Sat, Apr 06, 2019 at 01:01:19AM +1100, Bruce Evans wrote:
>>>> On Fri, 5 Apr 2019, Konstantin Belousov wrote:
>>>>
>>>>> On Fri, Apr 05, 2019 at 11:52:27PM +1100, Bruce Evans wrote:
>>>>>> On Fri, 5 Apr 2019, Konstantin Belousov wrote:
>>>>>>
>>>>>>> On Thu, Apr 04, 2019 at 02:47:34AM +1100, Bruce Evans wrote:
>>>>>>>> I noticed (or better realized) a general problem with multiple
>>>>>>>> timehands.  ntpd can slew the clock at up to 500 ppm, and at least an
>>>>>>>> old version of it uses a rate of 50 ppm to fix up fairly small drifts
>>>>>>>> in the milliseconds range.  500 ppm is enormous in CPU cycles -- it is
>>>>>>>> 500 thousand nsec or 2 million cycles at 4GHz.  Winding up the timecounter
>>>>>>>> every 1 msec reduces this to only 2000 cycles.
>>>>>>>> ...
>>>>>>>> The main point of having multiple timehands (after introducing the per-
>>>>>>>> timehands generation count) is to avoid blocking thread N during the
>>>>>>>> update, but this doesn't actually work, even for only 2 timehands and
>>>>>>>> a global generation count.
>>>>>>>
>>>>>>> You are describing the generic race between reader and writer. The same
>>>>>>> race would exist even with one timehand (and/or one global generation
>>>>>>> counter), where ntp adjustment might come earlier or later of some
>>>>>>> consumer accessing the timehands. If timehand instance was read before
>>>>>>> tc_windup() run but code consumed the result after the windup, it might
>>>>>>> appear as if time went backward, and this cannot be fixed without either
>>>>>>> re-reading the time after time-depended calculations were done and
>>>>>>> restarting, or some globabl lock ensuring serialization.
>>>>>>
>>>>>> With 1 timehand, its generation count would be global.  I think its ordering
>>>>>> is strong enough to ensure serialization.
>>>>> Yes, single timehands result in global generation.  But it would not solve
>>>>> the same race appearing in slightly different manner, as I described above.
>>>>> If reader finished while generation number in th was not yet reset, but
>>>>> caller uses the result after tc_windup(), the effect is same as if we
>>>>> have two th's and reader used the outdated one.
>>>>
>>>> You described it too concisely for me to understand :-).
>>>>
>>>> I now see that a single generation count doesn't give serialization.  I
>>>> thought that setting the generation to 0 at the start of tc_windup()
>>>> serialized the reader and writer.  But all it does is prevent use of the
>>>> results of the windup while only some of them are visible.  If the
>>>> setting the generation count to 0 doesn't become before tc_windup() reads
>>>> the hardware timecounter, then this read may be before other reads using
>>>> the old timehands, but it needs to be after.
>>> If we have either single th or global gen counter, current code would
>>> become serialized, but this is not what I am about.  Lets assume, for
>>
>> No, generation counts used like they are now (or in any way that I can
>> see) can't give serialization.
>>
>>> the sake of the discussion only, that all readers take the same spinlock
>>> as tc_windup (i.e. tc_setclock_mtx).
>>
>> Spinlocks are far too heavyweight.  Most of the complications in timecounter
>> locking are to avoid using them.  But OK for the discussion.
>>
>>> It is still possible that reader unlocked the mutex, tc_windup() kicked
>>> in, locked the mutex and moved timehands (as you noted, this might
>>> even happen on the same CPU), and only then the reader continues. For
>>> consumer of bintime() or any other function' result, it looks exactly
>>> the same as if we did not serialized with writer but used outdated
>>> timehands.
>>
>> Not with full serialization.  The writer tc_windup() is also a reader, and
>> serializing its read gives the necessary monotonicity (for a single thread):
>> - normal reader locks the mutex, reads the timecounter and unlocks.  The
>>    mutex makes visible all previous writes, so the reader doesn't use a
>>    stale timehands.  Consumers of bintime(), etc., use this time in the past.
>> - tc_windup() locks the mutex, reads the timecounter hardware and writes the
>>    timecounter soft state.  The new offset is after all previous times read,
>>    since this is serialized.
>> - normal reader as above sees the new state, so it reads times after the
>>    time of the windup, so also after the time of previous normal reads.
>>
>>> Let me formulate this diffeently: as far as consumer of the bintime()
>>> result does not serialize itself with tc_windup(), serializing bintime()
>>> itself against tc_windup() does not close the race, but it is not
>>> obvious that the race matters.
>>
>> Readers can easily see times far in the past, but the times increase in
>> program order.
> This is true for the current implementation (single-thread monotonic).

Provided there is no skew across CPUs.  I think this is a requirement for
any implementation.

>>
>>> Either we should just accept the race as
>>> we currently do, or readers must take the spinlock where the exact value
>>> of the current time is important,
>>
>> Disabling interrupts should be enough.  In my version of 5.2, spinlocks
>> don't disable hardware interrupts and may be preempted by fast interrupt
>> handlers which may be not so fast and take hundreds of usec.  Actually,
>> even disabling interrupts might not be enough.  A single isa bus read
>> can take at least 138 usec (when it is behind a DMA queue or something).
>> There are also NMI's and SMI's.
>>
>>> or readers must re-read the time after
>>> doing something important, and redo if the new measuremedtime is outside
>>> the acceptable range.
>>
>> This method seems to be essential for robustness.
>>
>> But I don't see any race (for a single thread and no timecounter skew
>> across CPUs).  Sloppy readers just see times an unknown but usually small
>> time in the past.  Non-sloppy readers can also defend against timecounter
>> skew by binding to 1 CPU.
> Yes, this is true.
>
>>
>> Mutex locking of the timecounter doesn't give monotonic times across threads.
>> It gives some order, but you don't know which.  Another mutex or rendezvous
>> is needed to control the order.
> Define monotonic between threads ?

The threads establish an order of events using _any_ reasonable method,
e.g., memory ordering.  Then if the events are a monotonic clock-reading
event E0, any event E1, and a monotonic clock-reading event E2, in the
order E0 <= E1 <= E2 according to the reasonable method, then E0 must
be <= E2 according to the timespec value of the clock.

One thread can arrange for E0 <= E1 very easily using only program order.
E1 might consist of setting a flag.  Another thread watching this flag
can't tell when it was changed, but can tell that it was changed after E0,
so E2 in the other thread must be after E0, by program order E1 <= E2 in
the other thread.  More precisely,there are really 2 different events E1. 
One (E1a) is setting the flag in the first thread and the other (E1b) is
observing the change in the other thread.  The full order is E0 <= E1a <=
E1b <= E2, where E0 <= E1a and E1b <= E2 by program order in each thread
and E1a <= E2b by the arrow of time (this doesn't take any memory ordering
magic, except in practice you would have to do more to not miss changes
when reusing the flag).

Establishing an order using external events doesn't work so well, since the
events become visible in different threads at different times.  If you have
a clock good enough for measuring and compensating for the different times,
then you should be using it for the timecounter and can't use it for
implementing the timecounter :-).  Ticks on a system-wide timer is an
example of an external event.  Another is arrival of packets at an
interface.  If the packets are read from the interface by separate threads
and put into a buffer using any fast method, then they are likely to become
out of order.  Hardware timestamps on them would allow fixing up the order,
but software timestamps made by separate readers don't work at all without
using a necessarily slow method to communicate the order.

I think the multiple timehands method works perfectly if the frequency
or phase is never adjusted.  It is just an optimization of the
calculation 'result = base + timecounter_value * inverse_frequency',
except for negligible rounding errors.  We want to do the multiplication
using only 64-bit integers, so we replace the RHS by '(base + offset
* inverse_freqency) + (timecountere_value - base) * inverse_frequency'.
When inverse_frequency is constant, It doesn't matter when the (base,
offset) pair is changed provided the change is atomic and not too far
in the past for overflow to occur with 64-bit integers.  The method
gives atomicity, and a bug apparently gives overflow.

Races from decreases in the frequency are harmless (for monotonicity)
since they increase the inverse frequency so increase time differences
so never give negative differences.

So only races from increases in the frequency are a problem.

I think changes in the frequency are rare (ntpd normally only makes them
every 64 seconds) so it wouldn't hurt to wait for them.  The problem is
know when to wait.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190407015735.W2994>