Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 3 Mar 2019 04:43:20 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Mark Millard <marklmi@yahoo.com>,  freebsd-hackers Hackers <freebsd-hackers@freebsd.org>,  FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject:   Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]
Message-ID:  <20190303041441.V4781@besplex.bde.org>
In-Reply-To: <20190302142521.GE68879@kib.kiev.ua>
References:  <20190228145542.GT2420@kib.kiev.ua> <20190228150811.GU2420@kib.kiev.ua> <962D78C3-65BE-40C1-BB50-A0088223C17B@yahoo.com> <28C2BB0A-3DAA-4D18-A317-49A8DD52778F@yahoo.com> <20190301112717.GW2420@kib.kiev.ua> <20190302043936.A4444@besplex.bde.org> <20190301194217.GB68879@kib.kiev.ua> <20190302071425.G5025@besplex.bde.org> <20190302105140.GC68879@kib.kiev.ua> <20190302225513.W3408@besplex.bde.org> <20190302142521.GE68879@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 2 Mar 2019, Konstantin Belousov wrote:

> On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote:
>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
>>> ...
>>> So I am able to reproduce it with some surprising ease on HPET running
>>> on Haswell.
>>
>> So what is the cause of it?  Maybe the tickless code doesn't generate
>> fake clock ticks right.  Or it is just a library bug.  The kernel has
>> to be slightly real-time to satisfy the requirement of 1 update per.
>> Applications are further from being real-time.  But isn't it enough
>> for the kernel to ensure that the timehands cycle more than once per
>> second?
> No, I entered ddb as you suggested.

But using ddb is not normal.  It is convenient that this fixes HPET and
ACPI timecounters after using ddb, but this method doesn't help for
timecounters that wrap fast.  TSC-low at 2GHz wraps in 2 seconds, and
i8254 wraps in a few milliseconds.

>> I don't changing this at all this.  binuptime() was carefully written
>> to not need so much 64-bit arithmetic.
>>
>> If this pessimization is allowed, then it can also handle a 64-bit
>> deltas.  Using the better kernel method:
>>
>>  		if (__predict_false(delta >= th->th_large_delta)) {
>>  			bt->sec += (scale >> 32) * (delta >> 32);
>>  			x = (scale >> 32) * (delta & 0xffffffff);
>>  			bt->sec += x >> 32;
>>  			bintime_addx(bt, x << 32);
>>  			x = (scale & 0xffffffff) * (delta >> 32);
>>  			bt->sec += x >> 32;
>>  			bintime_addx(bt, x << 32);
>>  			bintime_addx(bt, (scale & 0xffffffff) *
>>  			    (delta & 0xffffffff));
>>  		} else
>>  			bintime_addx(bt, scale * (delta & 0xffffffff));
> This only makes sense if delta is extended to uint64_t, which requires
> the pass over timecounters.

Yes, that was its point.  It is a bit annoying to have a hardware
timecounter like the TSC that doesn't wrap naturally, but then make it
wrap by masking high bits.

The masking step is also a bit wasteful.  For the TSC, it is 1 step to
discard high bids at the register level, then another step to apply the
nask to discard th high bits again.

>> I just noticed that there is a 64 x 32 -> 64 bit multiplication in the
>> current method.  This can be changed to do expicit 32 x 32 -> 64 bit
>> multiplications and fix the overflow problem at small extra cost on
>> 32-bit arches:
>>
>>  		/* 32-bit arches did the next multiplication implicitly. */
>>  		x = (scale >> 32) * delta;
>>  		/*
>>  		 * And they did the following shifts and most of the adds
>>  		 * implicitly too.  Except shifting x left by 32 lost the
>>  		 * seconds part that the next line handles.  The next line
>>  		 * is the only extra cost for them.
>>  		 */
>>  		bt->sec += x >> 32;
>>  		bintime_addx(bt, (x << 32) + (scale & 0xffffffff) * delta);
>
> Ok, what about the following.

I'm not sure that I really want this, even if the pessimization is done.
But it avoids using fls*(), so is especially good for 32-bit systems and
OK for 64-bit systems too, especially in userland where fls*() is in the
fast path.

>
> diff --git a/lib/libc/sys/__vdso_gettimeofday.c b/lib/libc/sys/__vdso_gettimeofday.c
> index 3749e0473af..cfe3d96d001 100644
> --- a/lib/libc/sys/__vdso_gettimeofday.c
> +++ b/lib/libc/sys/__vdso_gettimeofday.c
> @@ -32,6 +32,8 @@ __FBSDID("$FreeBSD$");
> #include <sys/time.h>
> #include <sys/vdso.h>
> #include <errno.h>
> +#include <limits.h>

Not needed with 0xffffffff instead of UINT_MAX.

The userland part is otherwise little changed.

> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> index 2656fb4d22f..2e28f872229 100644
> --- a/sys/kern/kern_tc.c
> +++ b/sys/kern/kern_tc.c
> ...
> @@ -351,17 +352,44 @@ fbclock_getmicrotime(struct timeval *tvp)
> 	} while (gen == 0 || gen != th->th_generation);
> }
> #else /* !FFCLOCK */
> +
> +static void
> +bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta)
> +{
> +	uint64_t x;
> +
> +	x = (*scale >> 32) * delta;
> +	*scale &= 0xffffffff;
> +	bt->sec += x >> 32;
> +	bintime_addx(bt, x << 32);
> +}

It is probably best to not inline the slow path, but clang tends to
inline everything anyway.

I prefer my way of writing this in 3 lines.  Modifying 'scale' for
the next step is especially ugly and pessimal when the next step is
in the caller and this function is not inlined.

> +
> void
> binuptime(struct bintime *bt)
> {
> 	struct timehands *th;
> -	u_int gen;
> +	uint64_t scale;
> +	u_int delta, gen;
>
> 	do {
> 		th = timehands;
> 		gen = atomic_load_acq_int(&th->th_generation);
> 		*bt = th->th_offset;
> -		bintime_addx(bt, th->th_scale * tc_delta(th));
> +		scale = th->th_scale;
> +		delta = tc_delta(th);
> +#ifdef _LP64
> +		/* Avoid overflow for scale * delta. */
> +		if (__predict_false(th->th_large_delta <= delta))
> +			bintime_helper(bt, &scale, delta);
> +		bintime_addx(bt, scale * delta);
> +#else
> +		/*
> +		 * Also avoid (uint64_t, uint32_t) -> uint64_t
> +		 * multiplication on 32bit arches.
> +		 */

"Also avoid overflow for ..."

> +		bintime_helper(bt, &scale, delta);
> +		bintime_addx(bt, (u_int)scale * delta);

The cast should be to uint32_t, but better write it as & 0xffffffff as
elsewhere.

bintime_helper() already reduced 'scale' to 32 bits.  The cast might be
needed to tell the compiler this, especially when the function is not
inlined.  Better not do it in the function.  The function doesn't even
use the reduced value.

bintime_helper() is in the fast path in this case, so should be inlined.

> +#endif
> 		atomic_thread_fence_acq();
> 	} while (gen == 0 || gen != th->th_generation);
> }

This needs lots of testing of course.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190303041441.V4781>