Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 21 Mar 2019 15:32:37 -0400
From:      Mark Johnston <markj@freebsd.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        rgrimes@freebsd.org, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r345359 - in head/sys/cddl/dev/dtrace: amd64 i386
Message-ID:  <20190321193237.GF8186@raichu>
In-Reply-To: <20190321180738.P1024@besplex.bde.org>
References:  <201903210252.x2L2qMSP022374@repo.freebsd.org> <201903210320.x2L3KIjc058579@gndrsh.dnsmgr.net> <20190321034917.GA8186@raichu> <20190321180738.P1024@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Mar 21, 2019 at 10:44:49PM +1100, Bruce Evans wrote:
> On Wed, 20 Mar 2019, Mark Johnston wrote:
> 
> > On Wed, Mar 20, 2019 at 08:20:18PM -0700, Rodney W. Grimes wrote:
> >>> Author: markj
> >>> Date: Thu Mar 21 02:52:22 2019
> >>> New Revision: 345359
> >>> URL: https://svnweb.freebsd.org/changeset/base/345359
> >>>
> >>> Log:
> >>>   Don't attempt to measure TSC skew when running as a VM guest.
> >>>
> >>>   It simply doesn't work in general since VCPUs may migrate between
> >>>   physical cores.  The approach used to measure skew also doesn't
> >>>   make much sense in a VM.
> 
> I think it simply doesn't work.
> 
> jhb's src/tools/tools/tscdrift/tscdrift.c also simply doesn't work for
> measuring drift.  It mostly measures IPC latency and jitter from
> preemption in attempts to measure skew.  It might be possible to
> determine the current skew using a statistical analysis of the latency
> and a few more measurements (measure latency between each CPU in a
> pair instead of only from CPU0 to other CPUs), but it would be better
> to do more direct measurements.
> 
> >> "May" is the important aspect here, and to my knowledge both
> >> bhyve and Vmware provide a way to pin cpus there should be
> >> a way for us to turn this back on if it is desired.  Sticking
> >> it under the big knob vm_guest is probably not the most flexiable
> >> solution.
> >>
> >> Could we please have someway to force this back on?
> >
> > Even with pinning the skew computation is bogus in a VM.  On an idle
> > host running bhyve with pinned vCPUs it gives offsets that are several
> > orders of magnitude larger than on the host.  I would prefer to see a
> > specific reason for wanting the old behaviour before adding a knob.
> 
> The method is bogus (buggy) out of a VM.
> 
> XX static void
> XX dtrace_gethrtime_init(void *arg)
> XX {
> XX 	struct pcpu *pc;
> XX 	cpuset_t map;
> XX 	int i;
> XX #endif
> XX 
> XX 	/* The current CPU is the reference one. */
> XX 	sched_pin();
> 
> It is also necessary to disable interrupts in hardware.  Or maybe better
> and necessary, detect when they occur and discard related samples.  It
> is impossible to disable NMIs.

Not quite, just carry out the measurement from an NMI handler. :)

> Then there are SMIs.  Not to mention
> virtualization.  At best, it gives delays a bit like for SMIs, but longer.
> 
> Since this is init code, perhaps interrupts are already masked in hardware,
> but to detect drift in the skews, it is necessary to run a routine like this
> quite often.

It is worse: on amd64 and i386 dtrace is never compiled into the kernel
(as you discovered), so the measurement is done whenever the KLDs
happen to be loaded.

> XX 	tsc_skew[curcpu] = 0;
> XX 	CPU_FOREACH(i) {
> XX 		if (i == curcpu)
> XX 			continue;
> XX 
> XX 		pc = pcpu_find(i);
> XX 		CPU_SETOF(PCPU_GET(cpuid), &map);
> XX 		CPU_SET(pc->pc_cpuid, &map);
> XX 
> XX 		smp_rendezvous_cpus(map, NULL,
> XX 		    dtrace_gethrtime_init_cpu,
> XX 		    smp_no_rendezvous_barrier, (void *)(uintptr_t) i);
> 
> Some synchronization is necessary, but this is very heavyweight.  Even
> the lightweight synchronization in tscdrift.c has an average latency of
> about 150-200 cycles on freefall's Xeon: afer fixing bugs in it:
> 
> CPU | TSC skew (min/avg/max/stddev)
> ----+------------------------------
>    0 |     0     0     0    0.000
>    1 |    36    65 47913   101.227	# lower latency on same core
>    2 |   144   180 52977   139.862
>    3 |   147   180 34077   113.543
>    4 |   135   181 49335   228.104
>    5 |   123   175 33300   91.703
>    6 |   165   188 129825   166.999
>    7 |   165   189 248010   278.321
> 
> tscdrift measures from CPU0 to CPU_n for each n.  After fixing it to
> read the TSC on CPU0 before doing an IPC to allow reading the TSC on
> CPU_n, this measures mainly the IPC latency.  Actually reading the TSC
> takes about 24 cycles.  The skew appears to be 0 for all CPUs.  Anyway,
> it is so much smaller than the IPC latency that it can be ignored, provided
> it doesn't drift to become non-small, and applications don't try to be
> too smart about correcting for the latency.
> 
> The unfixed tscdrift allows the other CPU to read the TSC before reading
> it itself.  This is more like the above.  It still ends up measuring mainly
> IPC latency and other bugs in the program, since the read of the TSC on
> the same CPU occurs relatively instantly without the other bugs, and out
> of order (this possibly earlier than instantly) with the other bugs.  The
> IPC delay is about the same as above, but it takes more complicated
> statistics to prove this.
> 
> XX 
> XX 		tsc_skew[i] = tgt_cpu_tsc - hst_cpu_tsc;
> 
> This has even more complicated ordering.  The IPC delay depends on the
> internals of smp_rendezvous_cpus().
> 
> It should work better to schedule comparisons of 2 CPUs using a third CPU
> (this is possible except with precisely 2 CPUs).  2 CPUs sit waiting for
> permission to read their TSC.  Another CPU sets a variable to grant
> permission.
> 
> I plan to try a variation of this: all CPUs spin reading the TSC and
> appending the result atomically to a log buffer.  Then analyze the results.
> With no skew or interrupts, it is expected that the results are almost
> monotonic.  Contention causes non-monotonicity only occasionally.  The
> average non-mononicity should be 0.  If there is significant skew, then
> the order for the sameples form each pair of CPUs will reverse on almost
> every second reading.  The skew is the adjustment needed to stop this
> reversal.
> 
> Logs of the TSC for a single CPU show that the granularity is about 18
> on freefall's Xeon.  I don't know yet if this is more closely related to
> the clock rate or the rdtsc latency.  The TSC frequency is 3.3GHz.  I
> recently learned that a common implementation uses a shared clock of
> frequency only 100 MHz.  With no interpolation of this, the granularity
> is 33 cycles.  rdtsc itself takes about 18 cycles, so not much is gained
> from any interpolation, but some seems to be done.
> 
> On Haswell i4790K at 4 GHz, IIRC rdtsc seemed to take 24 cycles.  I haven't
> done the test to see if non-multiples of 24 are returned.  A 100 MHz clock
> would have a granularity of 40 cycles, so there is apparently some
> interpolation.
> 
> A commit log message by mav@ says that some Skylakes never return odd values
> for rdtsc, and some appear to have a granularity of about 186 (?) cycles.
> freefall's Xeon does return odd values for rdtsc.
> 
> XX 	}
> XX 	sched_unpin();
> XX }
> XX ...
> XX /*
> XX  * DTrace needs a high resolution time function which can
> XX  * be called from a probe context and guaranteed not to have
> XX  * instrumented with probes itself.
> XX  *
> XX  * Returns nanoseconds since boot.
> XX  */
> 
> dtrace does too much hand-rolling here, but it has to do something
> to handle the 2 problems mentioned in the comment.  Other subsystems
> handle this worse (*).
> 
> But it shouldn't try harder than the TSC timecounters to handle skew.
> The TSC timecounter ignores the problem of skew, but does other things
> more carefully.

It comes from Solaris/illumos, which uses the following technique to
measure skew during boot:

AP i:
    while (!flag)
        tsc[i] = rdtsc();

BSP:
    flag = 1;
    mfence();
    tsc = rdtsc();
    skew[i] = tsc - tsc[i];
    if (skew[i] < 0)
    	skew[i] = -skew[i];

There are some heuristics that attempt to correct for the latency
added by the mfence().

I don't think it makes much sense to measure skew on systems with an
invariant TSC.  If the TSC is not invariant, measuring it once at load
time is likely not enough.  Based on TSC-invariant systems I looked at,
the measured skew is small enough that the erroneous deltas probably
don't cause much harm.  In VMs, the deltas are large enough that they
create noticeable problems.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190321193237.GF8186>