Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 21 Oct 2015 21:48:50 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Andriy Gapon <avg@FreeBSD.org>
Cc:        freebsd-hackers <freebsd-hackers@FreeBSD.org>, Poul-Henning Kamp <phk@phk.freebsd.dk>, Jung-uk Kim <jkim@FreeBSD.org>
Subject:   Re: instability of timekeeping
Message-ID:  <20151021184850.GX2257@kib.kiev.ua>
In-Reply-To: <56274FFC.2000608@FreeBSD.org>
References:  <56261398.60102@FreeBSD.org> <56261FE6.90302@FreeBSD.org> <56274FFC.2000608@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Oct 21, 2015 at 11:42:36AM +0300, Andriy Gapon wrote:
> On 20/10/2015 14:05, Andriy Gapon wrote:
> > I performed a small observation.  With ntpd disabled I ran `ntpdate -d` at 10
> > second intervals in a loop (done via `sleep 10`).  It looks like for about 25
> > minutes the time offset between a reference server and my machine was quite
> > stable. But then it sort of jumped about 2.5 seconds between two consecutive
> > ntpdate invocations.
> [snip]
> > On 20/10/2015 13:12, Andriy Gapon wrote:
> [snip]
> >> kern.timecounter.tc.TSC-low.frequency: 1607357461
> >> kern.timecounter.tc.TSC-low.counter: 2457319922
> >> kern.timecounter.tc.TSC-low.mask: 4294967295
> [snip]
> 
> Another observation and a hypothesis.
> 
> I tried time counters other than TSC and I couldn't reproduce the issue with them.
> 
> Another thing that occurred to me is that TSC-low.mask / TSC-low.frequency — 2.7
> seconds.
> 
> >From these observations and reading some code comes my hypothesis.
> First, I assume that visible values of TSCs on different cores are not perfectly
> synchronized. Second, I think that there can be circumstances where
> tc_ticktock() -> tc_windup() can get called on different cores sufficiently
> close in time that the later call would see TSC value which is "before"
> (sequentially smaller) than the TSC value read earlier (on the other core).  In
> that case the delta between the readings would be close to TSC-low.mask.
> 
> Right now I do not have any proof that what the hypothesis says is what actually
> happens.  On the other hand, I do not see anything that would prevent the
> hypothesized situation from occurring.
> 
> To add more weight to the hypothesis:
> cpu1:invltlb                      674384          4
> cpu1:invlrng                      453020          3
> cpu1:invlpg                    108772578        652
> cpu1:preempt                    37435000        224
> cpu1:ast                           36757          0
> cpu1:rendezvous                     9473          0
> cpu1:hardclock                  22267434        133
> As you can see I am currently running workloads that result in a very
> significant number of IPIs, especially the page invalidation IPIs.  Given that
> all (x86) IPIs have the same priority (based on their vector numbers) I think
> it's plausible that the hardclock IPI could get arbitrarily delayed.
> 
> I guess I could add a tracepoint to record deltas that are close to a current
> timecounter's mask.
> 
> Assuming the hypothesis is correct I see two possible ways to work-around the
> problem:
> 
> 1. Increase tsc_shift, so that the cross-CPU TSC differences are smaller (at the
> cost of lower resolution).  That should reduce the risk of seeing "backwards"
> TSC values.  Judging from numbers that tools/tools/tscdrift produces I can set
> tsc_shift to 7.  The resulting resolution should not be worse than that of HPET
> or ACPI-fast counters with the benefit of TSC being much faster to read.
> 
> 2. Change the code, so that tc_windup() is always called on the same CPU.  E.g.
> it could be the BSP or a CPU that receives the actual timer interrupts (as
> opposed to the hardclock IPIs).  This should help with the timekeeping, but
> won't help with the "jitter" in binuptime() and friends.
> 
> 3. In tc_delta() somehow detect and filter out "slightly behind" timecounter
> readings.  Not sure if this is possible at all.

Am I right that the tsc synchronization test passes on your machine ? If
yes, you probably cannot read 'slightly behind' timecounter after IPI
on other core. Might be, try to change CPUID instruction in the test to
MFENCE and see if the test still able to pass.

Does the symptom disappear if you switch the eventtimer to LAPIC ?
What happens if you turn off usermode gettimeofday()
by setting kern.timercounter.fast_gettime to 0 ?



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20151021184850.GX2257>