Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 2 Mar 2019 05:40:58 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kib@freebsd.org>
Cc:        Mark Millard <marklmi@yahoo.com>,  freebsd-hackers Hackers <freebsd-hackers@freebsd.org>,  FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject:   Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]
Message-ID:  <20190302043936.A4444@besplex.bde.org>
In-Reply-To: <20190301112717.GW2420@kib.kiev.ua>
References:  <D3D7E9F4-9A5E-4320-B3C8-EC5CEF4A2764@yahoo.com> <20190228145542.GT2420@kib.kiev.ua> <20190228150811.GU2420@kib.kiev.ua> <962D78C3-65BE-40C1-BB50-A0088223C17B@yahoo.com> <28C2BB0A-3DAA-4D18-A317-49A8DD52778F@yahoo.com> <20190301112717.GW2420@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 1 Mar 2019, Konstantin Belousov wrote:

> On Thu, Feb 28, 2019 at 11:39:06PM -0800, Mark Millard wrote:
>> [The new, trial code also has truncation occurring.]
> In fact no, I do not think it is.
>
>> An example observation of diff_scaled having an overflowed
>> value was:
>>
>> scale_factor            == 0x80da2067ac
>> scale_factor*freq overflows unsigned, 64 bit representation.
>> tim_offset              ==   0x3da0eaeb
>> tim_cnt                 ==   0x42dea3c4
>> tim_diff                ==    0x53db8d9
>> For reference:                0x1fc9d43 == 0xffffffffffffffffull/scale_factor
>> scaled_diff       == 0xA353A5BF3FF780CC (truncated to 64 bits)
>>
>> But for the new, trail code:
>>
>> 0x80da2067ac is 40 bits
>>    0x53db8d9 is 27 bits
>> So 67 bits, more than 63. Then:
>>
>>    x
>> == (0x80da2067ac>>32) * 0x53db8d9
>> == 0x80 * 0x53db8d9
>> == 0x29EDC6C80
>>
>>    x>>32
>> == 0x2
>>
>>    x<<32
>> == 0x9EDC6C8000000000 (limited to 64 bits)
>> Note the truncation of: 0x29EDC6C8000000000.
> Right, this is how the patch is supposed to work.  Note that the overflow
> bits 'lost' due to overflow of the left shift are the same bits that as
> used to increment bt->sec:
> 	bt->sec += x >> 32;
> So the 2 seconds are accounted for.
>
>>
>> Thus the "bintime_addx(bt, x << 32)" is still
>> based on a truncated value.
>
> I must admit that 2 seconds of interval where the timehands where
> not updated is too much.  This might be the real cause of all ppc
> troubles.  I tried to see if the overflow case is possible on amd64,
> and did not get a single case of the '> 63' branch executed during the
> /usr/tests/lib/libc run.

The algorithm requires the update interval to be less than 1 second.
th_scale is 2**64 / tc_frequency, so whatever tc_frequency is, after
1 second the value of the multiplication is approximately 2**64 so
it overflows about then (depending on rounding).

The most useful timecounters are TSC's, and these give another overflow
in tc_delta() after 1 second when their frequency is 4 GHz (except the
bogus TSC-low timecounter reduces the frequency to below 2 binary GHz,
so the usual case is overflow after 2 seconds).

> Actually, the same overflow-prone code exists in libc, so below is the
> updated patch:
> - I added __predict_false()
> - libc multiplication is also done separately for high-order bits.
> (fftclock counterpart is still pending).
>
> diff --git a/lib/libc/sys/__vdso_gettimeofday.c b/lib/libc/sys/__vdso_gettimeofday.c
> index 3749e0473af..a14576988ff 100644
> --- a/lib/libc/sys/__vdso_gettimeofday.c
> +++ b/lib/libc/sys/__vdso_gettimeofday.c
> @@ -32,6 +32,8 @@ __FBSDID("$FreeBSD$");
> #include <sys/time.h>
> #include <sys/vdso.h>
> #include <errno.h>
> +#include <limits.h>
> +#include <strings.h>
> #include <time.h>
> #include <machine/atomic.h>
> #include "libc_private.h"
> @@ -62,6 +64,7 @@ binuptime(struct bintime *bt, struct vdso_timekeep *tk, int abs)
> {
> 	struct vdso_timehands *th;
> 	uint32_t curr, gen;
> +	uint64_t scale, x;
> 	u_int delta;
> 	int error;
>
> @@ -78,7 +81,14 @@ binuptime(struct bintime *bt, struct vdso_timekeep *tk, int abs)
> 			continue;
> 		if (error != 0)
> 			return (error);
> -		bintime_addx(bt, th->th_scale * delta);
> +		scale = th->th_scale;
> +		if (__predict_false(fls(scale) + fls(delta) > 63)) {

This is unnecessarily pessimal.  Updates must be frequent enough to prevent
tc_delta() overflowing, and it is even easier to arrange that this
multiplication doesn't overflow (since the necessary update interval for
the latter is constant).

`scale' is 64 bits, so fls(scale) is broken on 32-bit arches, and
flls(scale) is an especially large pessimization.  I saw this on my
calcru1() fixes -- the flls()s take almost as long as long long
divisions when the use the pessimal C versions.

The algorithm requires tc_delta() to be only 32 bits, since otherwise the
multiplication would be 64 x 64 bits so would be much slower and harder
to write.

If tc_freqency is far above 4GHz, then th_scale is far below 4G, so the
scaling is not so accurate.  But 0.25 parts per billion is much more than
enough.  Even 1 part per million is enough for a TSC, since TSC instability
is more than 1ppm.  The overflows could be pushed off to 1024 seconds by
dividing by 1024 at suitable places.  A 32-bit scale times a 64-bit delta
would be simple compared with both 64 bits (much like the code here).

The code here can be optimized using values calculated at initialization
time instead of fls*().  The overflow threshold for delta is approximately
2**64 / tc_frequency.

> +			x = (scale >> 32) * delta;
> +			scale &= UINT_MAX;
> +			bt->sec += x >> 32;
> +			bintime_addx(bt, x << 32);
> +		}
> +		bintime_addx(bt, scale * delta);
> 		if (abs)
> 			bintime_add(bt, &th->th_boottime);

When the timecounter is the i8254, as it often was when timecounters
were new, tc_windup() had to be called more often than every i8254 rollover
(in practice once every hardclock tick), partly to keep tc_delta() small
(since rollover gives a form of overflow).  This was not so easy to arrange.
It requires not losing any hardclock ticks and also not having any with
high latency and also complications to detect the rollover when there is
small latency.  Most hardware is easier to handle now.  With tickless
kernels, hardclock() is often not called for about 1 second, but it must
be called at least that often to prevent the overflow here.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190302043936.A4444>