Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 5 Mar 2019 05:17:14 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Bruce Evans <brde@optusnet.com.au>, Mark Millard <marklmi@yahoo.com>,  freebsd-hackers Hackers <freebsd-hackers@freebsd.org>,  FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject:   Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]
Message-ID:  <20190305031010.I4610@besplex.bde.org>
In-Reply-To: <20190304114150.GM68879@kib.kiev.ua>
References:  <20190301194217.GB68879@kib.kiev.ua> <20190302071425.G5025@besplex.bde.org> <20190302105140.GC68879@kib.kiev.ua> <20190302225513.W3408@besplex.bde.org> <20190302142521.GE68879@kib.kiev.ua> <20190303041441.V4781@besplex.bde.org> <20190303111931.GI68879@kib.kiev.ua> <20190303223100.B3572@besplex.bde.org> <20190303161635.GJ68879@kib.kiev.ua> <20190304043416.V5640@besplex.bde.org> <20190304114150.GM68879@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 4 Mar 2019, Konstantin Belousov wrote:

> On Mon, Mar 04, 2019 at 05:29:48AM +1100, Bruce Evans wrote:
>> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
>>
>>> On Mon, Mar 04, 2019 at 12:32:12AM +1100, Bruce Evans wrote:
>>>> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
>>>>
>>>>> On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote:
>>>>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
>>>>>>
>>>>>>> On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote:
>>>>>>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
>>> * ...
>>>> However, shrd in rdtsc-low (tsc_get_timecount_low()) does a slow combining
>>>> step.  i386 used to be faster here -- the first masking step of discarding
>>>> %edx doesn't take any code.  amd64 has to mask out the top bits in %rax.
>>>> Now for the tsc-low pessimization, i386 has to do a slow shrd, and amd64
>>>> has to do a not so slow shr.
>>> i386 cannot discard %edx after RDTSC since some bits from %edx come into
>>> the timecounter value.
>>
>> These bits are part of the tsc-low pessimization.  The shift count should
>> always be 1, giving a TSC frequency of > INT32_MAX (usually) and > UINT32_MAX
>> sometimes.
>>
>> When tsc-low was new, the shift count was often larger (as much as 8),
>> and it is still changeable by a read-only tunable, but now it is 1 in
>> almost all cases.  The code only limits the timecounter frequency
>> to UINT_MAX, except the tunable defaults to 1 so average CPUs running
>> at nearly 4 GHz are usually limited to about 2 GHz.  The comment about
>> this UINT_MAX doesn't match the code.  The comment says int, but the
>> code says UINT.
>>
>> All that a shoft count of 1 does is waste time to lose 1 bit of accuracy.
>> This much accuracy is noise for most purposes.
>>
>> The tunable is fairly undocumented.  Its description is "Shift to apply
>> for the maximum TSC frequency".  Of course, it has no effect on the TSC
>> frequency.  It only affects the TSC timecounter frequency.
> I suspect that the shift of 1 (at least) hides cross-socket inaccuracy.
> Otherwise, I think, some multi-socket machines would start showing the
> detectable backward-counting bintime().  At the frequencies at 4GHz and
> above (Intel has 5Ghz part numbers) I do not think that stability of
> 100MHz crystall and on-board traces is enough to avoid that.

I think it is just a kludge that reduced the problem before it was fixed
properly using fences.

Cross-socket latency is over 100 cycles according to jhb's tscskew
benchmark: on Haswell 4x2:

CPU | TSC skew (min/avg/max/stddev)
----+------------------------------
   0 |     0     0     0    0.000
   1 |    24    49    84   14.353
   2 |   164   243   308   47.811
   3 |   164   238   312   47.242
   4 |   168   242   332   49.593
   5 |   168   243   324   48.722
   6 |   172   242   320   52.596
   7 |   172   240   316   53.014

freefall is similar.  Latency is apparently measured relative to CPU 0.
It is much lower to CPU 1 since that is on the same core.

I played with this program a lot 3 and a half years ago, but forgot
mist of what I learned :-(.  I tried different fencing in it.  This
seems to make little difference when the program is rerun.  With the
default TESTS = 1024, the min skew sometimes goes negative on freefall,
but with TESTS = 1024000 that doesn't happen.  This is the opposite
of what I would expect.  freefall has load average about 1.

Removing the only fencing in it reduces average latency by 10-20 cycles
and minimum latency by over 100 cycles, except on freefall it is
reduced from 33 to 6.  On Haswell it is 24 with fencing and I didn't
test it with no fencing.

I think tscskew doesn't really measure tsc skew.  What it measures is
the time taken for a locking protocol, using the TSCs on different
CPUs to make the start and end timestamps.  If the TSCs have a lot of
skew or jitter, then this will show up indirectly as inconsistent and
possibly negative differences.

A shift of just 1 can't hide latencies of hundreds of cycles on single-
socket machines.  Even a shift of 8 only works sometimes, by reducing
the chance of observing the TSC going backwards by a factor of 256.
E.g., assume for simplicity that all instructions and IPCs take 0-1
cycles, and that unfenced rdtsc's differ by at most +-5 cycles (with
the 11 values between -5 and 5 uniformly distributed.  Then with a
shift of 0 and no fences, a CPU that updates the timehands is ahead of
another CPU that spins reading the timehands about 5/11 of the time.
With a shift of 8, the CPUs are close enough when the first one reads
at least 5 above and at least 5 below a 256-boundary.  The chance of
seeing a negative difference is reduced by at least a factor of 10/256.

> I suspect that the shift of 1 (at least) hides cross-socket inaccuracy.
> Otherwise, I think, some multi-socket machines would start showing the
> detectable backward-counting bintime().  At the frequencies at 4GHz and
> above (Intel has 5Ghz part numbers) I do not think that stability of
> 100MHz crystall and on-board traces is enough to avoid that.

Why would losing just 1 bit fix that?

Fences for rdtsc of course only serialize it for the CPU that runs it.
The locking (ordering) protocol (for the generation count) orders the
CPUs too.  It takes longer than we would like, much more than the 1-
cycle error that might be hidden by ignoring the low bit.  Surely the
ordering protocol must work across sockets?  It then gives ordering of
rdtsc's.

TSC-low was added in 2011.  That was long before the ordering was fixed.
You added fences in 2012 and memory ordering for the generation count in
2016.  Fences slowed everything down by 10-20+ cycles and probably hide
bugs in the memory ordering better than TSC-low.  Memory ordering plus
fences slow down the cross-core case by more than 100 cycles according
to tscskew.  That is enough to hide large hardware bugs.

> We can try to set the tsc-low shift count to 0 (but keep lfence) and see
> what is going on in HEAD, but I am afraid that the HEAD users population
> is not representative enough to catch the issue with the certainity.
> More, it is unclear to me how to diagnose the cause, e.g. I would expect
> the sleeps to hang on timeouts, as was reported from the very beginning
> of this thread. How would we root-cause it ?

Negative time differences cause lots of overflows so break the timecounter.
The fix under discussion actually gives larger overflows in the positive
direction.  E.g., a delta of -1 first overflows to 0xffffffff.  The fix
prevents overflow on multiplication by that.  When the timecounter
frequency is small, say 1 MHz, 0xffffffff means 4294 seconds, so the
timecounter advances by that.

>>> amd64 cannot either, but amd64 does not need to mask out top bits in %rax,
>>> since the whole shrdl calculation occurs in 32bit registers, and the result
>>> is in %rax where top word is cleared by shrdl instruction automatically.
>>> But the clearing is not required since result is unsigned int anyway.
>>>
>>> Dissassemble of tsc_get_timecount_low() is very clear:
>>>   0xffffffff806767e4 <+4>:     mov    0x30(%rdi),%ecx
>>>   0xffffffff806767e7 <+7>:     rdtsc
>>>   0xffffffff806767e9 <+9>:     shrd   %cl,%edx,%eax
>>> ...
>>>   0xffffffff806767ed <+13>:    retq
>>> (I removed frame manipulations).

I checked that all compilers still produce horrible code for the better
source code 'return (rdtsc() << (intptr_t)tc->tc_priv);'.  64-bit shifts
are apparently pessimal for compatibility.  The above is written mostly
in asm to avoid 2-5 extra instructions.

>>>> ...
>>>> Similarly in bintime().
>>> I merged two functions, finally.  Having to copy the same code is too
>>> annoying for this change.

I strongly disklike the merge.

>>> So I verified that:
>>> - there is no 64bit multiplication in the generated code, for i386 both
>>>  for clang 7.0 and gcc 8.3;
>>> - that everything is inlined, the only call from bintime/binuptime is
>>>  the indirect call to get the timecounter value.
>>
>> I will have to fix it for compilers that I use.
> Ok, I will add __inline.

That will make it fast enough, but still hard to read.

>>> +		*bt = *bts;
>>> +		scale = th->th_scale;
>>> +		delta = tc_delta(th);
>>> +#ifdef _LP64
>>> +		if (__predict_false(th->th_large_delta <= delta)) {
>>> +			/* Avoid overflow for scale * delta. */
>>> +			bintime_helper(bt, scale, delta);
>>> +			bintime_addx(bt, (scale & 0xffffffff) * delta);
>>> +		} else {
>>> +			bintime_addx(bt, scale * delta);
>>> +		}
>>> +#else
>>> +		/*
>>> +		 * Use bintime_helper() unconditionally, since the fast
>>> +		 * path in the above method is not so fast here, since
>>> +		 * the 64 x 32 -> 64 bit multiplication is usually not
>>> +		 * available in hardware and emulating it using 2
>>> +		 * 32 x 32 -> 64 bit multiplications uses code much
>>> +		 * like that in bintime_helper().
>>> +		 */
>>> +		bintime_helper(bt, scale, delta);
>>> +		bintime_addx(bt, (uint64_t)(uint32_t)scale * delta);
>>> +#endif
>>
>> Check that this method is really better.  Without this, the complicated
>> part is about half as large and duplicating it is smaller than this
>> version.
> Better in what sence ?  I am fine with the C code, and asm code looks
> good.

Better in terms of actually running significantly faster.  I fear the
32-bit method is actually slightly slower for the fast path.

>>> -	do {
>>> -		th = timehands;
>>> -		gen = atomic_load_acq_int(&th->th_generation);
>>> -		*bt = th->th_bintime;
>>> -		bintime_addx(bt, th->th_scale * tc_delta(th));
>>> -		atomic_thread_fence_acq();
>>> -	} while (gen == 0 || gen != th->th_generation);
>>
>> Duplicating this loop is much better than obfuscating it using inline
>> functions.  This loop was almost duplicated (except for the delta
>> calculation) in no less than 17 functions in kern_tc.c (9 tc ones and
>> 8 fflock ones).  Now it is only duplicated 16 times.
> How did you counted the 16 ?  I can see only 4 instances in the unpatched
> kern_tc.c, and 3 in patched, but it is 3 and not 1 only because I do not
> touch ffclock until the patch is finalized.  After that, it would be
> 1 instance for kernel and 1 for userspace.

Grep for the end condition in this loop.  There are actually 20 of these.
I'm counting the loops and not the previously-simple scaling operation in
it.  The scaling is indeed only done for 4 cases.  I prefer the 20 
duplications (except I only want about 6 of the functions).  Duplication
works even better for only 4 cases.

This should be written as a function call to 1 new function to replace
the line with the overflowing multiplication.  The line is always the
same, so the new function call can look like bintime_xxx(bt, th).

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190305031010.I4610>