Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 26 Apr 2009 20:40:21 +0200
From:      Juergen Lock <nox@jelal.kn-bremen.de>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        kalinoj1@iem.pw.edu.pl, freebsd-emulation@FreeBSD.org
Subject:   Re: Recent qemu and timers issue
Message-ID:  <20090426184021.GA9545@triton.kn-bremen.de>
In-Reply-To: <20090424201623.N887@besplex.bde.org>
References:  <200904032223.n33MNTiq019599@triton.kn-bremen.de> <200904072137.n37LbbdC071227@triton.kn-bremen.de> <20090423214701.GA83621@triton.kn-bremen.de> <20090424201623.N887@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Apr 24, 2009 at 10:20:33PM +1000, Bruce Evans wrote:
> On Thu, 23 Apr 2009, Juergen Lock wrote:
> 
> > On Tue, Apr 07, 2009 at 11:37:37PM +0200, Juergen Lock wrote:
> >> In article <200904062254.37824.kalinoj1@iem.pw.edu.pl> you write:
> >>> Dnia sobota 04 kwietnia 2009 o 00:23:29 Juergen Lock napisa=C5=82(a):
> >>>> In article <c948bb4de85d1b2a340ac63a7c46f6d9@iem.pw.edu.pl> you write:
> >>> ...
> >> OK I did some tests in the meantime on my newest box that has 4 cpus and
> >> is running 7-stable: (host)
> >>>>
> >>>>> My logs and console are full of messages like this:
> >>>>> calcru: runtime went backwards from 11248014678717988 usec to 119242 usec
> >>>>> for pid 60 (ps)
> >>>>>
> >>>>> calcru: runtime went backwards from 11248014678921829 usec to 2516605 us=
> >>> ec
> >>>>> for pid 60 (ps)
> >>>>> calcru: runtime went backwards from 7014844 usec to 6270851 usec for pid=
> >>> 0
> >> ...
> >>>>> I tried to use all possible timers using sysctl, where I have:
> >>>>> TSC(800) HPET(900) ACPI-safe(850) i8254(0) dummy(-1000000)
> >>>>> None of these helped.
> 
> None of these are normally used for calculating runtimes.  Normally
> on i386, the TSC is used.

Aaah-haa, this I didn't know.

>  The only way to configure this is to edit
> the source code.  Try removing the calls to set_cputicker() in the MD
> code.  Then the MI and timecounter-based cputicker tc_cpu_ticks() will
> be used.

 Yup, that seemed to help indeed. (patch below.)

>  A better implementation would use a user-selectable
> timecounter-based cputicker in all cases, but usually not the system
> timecounter since that is likely to be very slow so as to be more
> accurate.
> 
 This was using qemu's emulated hpet...  I guess you mean slow to read
the counter value?  How often is the cputicker read, at every context
switch?  More often?

> >>> At this moment it's very hard for me to judge if it's qemu or kernel issue..
> >>> Any thoughts?
> >>
> >>  Yeah, could be one or the other.
> >
> > I just noticed running any FreeBSD guest with bootverbose (boot -v,
> > or menu item 5 in the default beastie menu) gives me lots of these:
> > 	t_delta 16.0109662fcd2712a0 too long
> > 	t_delta 16.014a1099ee815ca0 too long
> > 	t_delta 16.0108d9c2ecd404c0 too long
> > 	t_delta 16.0108d032cb334b60 too long
> > 	t_delta 15.f58306afb53419a0 too short
> > 	t_delta 16.010a762538a38680 too long
> > etc.  My guess is this could be related...
> 
> As you probably know, virtualizing timers is hard to do correctly (see
> vmware docs), so the problem is most probably partly in qemu.  The
> cputicker code is much more fragile and/or broken than the timecounter
> code, so imperfect virtualization is likely to cause more problems
> with it.  The above t_delta printfs are a sign of problems.
> 
> It is not surprising that the recalibration code (cpu_tick_calibrate()
> (sic)) has timing problems with virtual timers.  When it detects a
> problem, it prints the above; then it continues using the old (possibly
> wrong) frequency.  When it doesn't detect a problem, and the recalibrated
> frequency would be larger, then it continues using the new (possibly
> wrong) frequency.
> 
> When it doesn't detect a problem, and the recalibrated frequency would
> be smaller, then it is just broken -- it continues using the old
> (certainly wrong) frequency.  When miscalibrations are normal as appears
> to be the case with qemu, this bug probably makes their effects less
> obvious:
>      It makes the miscalibrated frequencies monotonically increasing
>      so they will soon converge to a limit and the frequency used will
>      then be invariant until the next full calibration.  When the limit
>      is infinity, as is caused messing up the timers completely (e.g.,
>      by sitting at a debugger prompt for more than a couple of seconds)
>      then other effects are fairly visible -- all user and sys times
>      are broken (nearly 0). When the limit is only a few percent larger
>      than the average, the user and sys times will be only a few percent
>      lower, provided the difference between the limit and the actual
>      (transient) frequency is not so large that calcru() sees problems
>      and tries to fix them.  Fixing there is fundamentally impossible
>      but in some cases the damage is small enough to hide.
> 
> In all cases, continuing with a wrong frequency will cause problems with
> the runtime calculations.  Continuing with a right (but changed) frequency
> is little better.  calcru() is fundamentally broken if the frequency
> actually changes, since it applies the current frequency to ticks
> accumulated at old frequencies.
> 
> The following patch fixes a couple of the problems in recalibration:
> 
> % diff -c2 ./kern/kern_tc.c~ ./kern/kern_tc.c
> % *** ./kern/kern_tc.c~	Thu Mar 20 01:05:27 2008
> % --- ./kern/kern_tc.c	Thu Mar 20 01:05:29 2008
> % ***************
> % *** 884,888 ****
> %   		return;
> % 
> % ! 	getbinuptime(&t_this);
> %   	c_this = cpu_ticks();
> %   	if (t_last.sec != 0) {
> % --- 884,888 ----
> %   		return;
> % 
> % ! 	binuptime(&t_this);
> %   	c_this = cpu_ticks();
> %   	if (t_last.sec != 0) {
> 
 I tried this too... (with tsc as cputicker)

> The existence of getbinuptime() is a bug, and its use here is especially
> wrong.  We should use the current time, and binuptime() gives this as
> well as possible.  The caller has just set the time cached for
> getbinutime() to the current time, but this is not quite the same.
> There is no locking here, except the implicit locking given by being
> in a fast interrupt handler, so the time between the caching and the
> use of the value is unbounded.  We may be interrupted even if we are
> in a fast interrupt handler, depending on the implementation of fast
> interrupt handlers.  Emulation is quite likely to result in fast
> interrupt handlers being interrupted -- they are probably as "fast"
> (really low-latency) as possible on the target, but they may be
> interrupted on the host and then long delayed on the target in some
> cases (hopefully not often).

 (Yup, especially when the host is running other things too besides this
one vm guest...)

>  So using binuptime() is only slightly
> better here -- a delay may occur after the timecounter is read, no
> matter where it is read.  Something like the following is needed:
> 
> %%%
>  	binuptime(&start_bt);
>  	start_tsc = rdtsc();
>  	do {
>  		binuptime(&sample_bt);
>  		tsc = rdtsc();
>  		binuptime(&bt);
>  		bintime_sub(&bt, &sample_bt);
>  		if (bintimecmp(&bt, &best_bt, <)) {
>  			best_bt = bt;
>  			best_sample_bt = sample_bt;
>  			best_tsc = tsc;
>  		}
>  		bt = sample_bt;
>  		bintime_sub(&bt, &start_bt);
>  	} while (++nsamples < min_nsamples ||
>  	    bintimecmp(&best_bt, min_error_btp, >));
> %%%
> 
> Reads of the timer being calibrated (TSC here) are sandwiched between
> reads of the reference timer and ones that can't be measured to within
> a few parts per billion are discarded.
> 
> I never saw any effects from the above change on non-virtual systems.
> Also, the sanity checks (which print the t_delta message) never fired
> for the extremely messed up timers given by running ddb.  When running
> ddb, normal timing stops working after a few seoncs or millseconds since
> timecounters require interrupts to work.  However, the cputicker keeps
> working precisely if it is the TSC.  So the cputicker : delta-binuptime
> ratio soon becomes effectively infinite.  The sanity checking code seems
> to always fail to detect this problem.
> 
> % ***************
> % *** 922,931 ****
> %   			c_delta <<= 20;
> %   			c_delta /= divi;
> % ! 			if (c_delta  > cpu_tick_frequency) {
> % ! 				if (0 && bootverbose)
> % ! 					printf("cpu_tick increased to %ju Hz\n",
> % ! 					    c_delta);
> % ! 				cpu_tick_frequency = c_delta;
> % ! 			}
> %   		}
> %   	}
> % --- 922,930 ----
> %   			c_delta <<= 20;
> %   			c_delta /= divi;
> % ! 			if ((0 && bootverbose))
> % ! 				printf(
> % ! 			    "cpu_tick_frequency changed from %ju to %ju Hz\n",
> % ! 				    cpu_tick_frequency, c_delta);
> % ! 			cpu_tick_frequency = c_delta;
> %   		}
> %   	}
> 
> This fixes the bug which makes cpu_tick_frequency non-decreasing.
> 
 ...and I tried this, both changes didn't fix the problem.

> Another thing you can try here is to edit the source code to change
> the set_cputicker() calls to say that the frequency is not variable.

 That probably won't help here because I noticed at least the initial
tsc `calibration' in the guest (in init_TSC()) is way off too (it got
not even half the value here of the actual frequency, which according
to dmesg on this host is `TSC: P-state invariant'.)

 OK _maybe_ if we get the proper frequency into the guest there somehow
from the beginning and then say its not variable maybe it could work,
but that still leaves the case of hosts with non P-state invariant tsc
because...

> I used this temporarily to work around the non-decreasing calibration.
> This should be the default for emulators for most cputickers -- emulators
> should emulate a constant frequency and not emulate the complexities
> for pwoer saving.

 Hmm I guess thats more easily said than done. :)  At least qemu
basically just passes the host tsc thru when a guest reads it.

>  This should probably be the default when the cputicker
> is the i386 TSC and the TSC is P-state invariant.  Currently, P-state
> invariance only causes turning off of cpufreq's adjustment of the TSC
> as a timecounter.

 Oh and thanx for this enlightful post, very much appreciated!

 And for anyone who would like to test a patch:  (I think vm_guest is
only defined on head tho so this patch wont work on stable.)

Index: sys/i386/i386/tsc.c
@@ -79,6 +79,8 @@
 	800,			/* quality (adjusted in code) */
 };
 
+extern int	vm_guest;	/* Running as virtual machine guest? */
+
 void
 init_TSC(void)
 {
@@ -109,6 +111,16 @@
 	 * via tsc_freq_max().  This also will be updated if someone loads
 	 * a cpufreq driver after boot that discovers a new max frequency.
 	 */
+#if 1
+	/* most(?) VMs' tsc are unreliable as cputicker, causing
+	 * `calcru: runtime went backwards ...' messages.  (This falls
+	 * back to using tc_cpu_ticks instead of rdtsc which hopefully
+	 * is something better in the vm like acpi-fast or hpet.)
+	 * XXX is this also true for xen?
+	 */
+	if (vm_guest)
+	    return;
+#endif
 	set_cputicker(rdtsc, tsc_freq, 1);
 
 	/* Register to find out about changes in CPU frequency. */
Index: sys/amd64/amd64/tsc.c
@@ -78,6 +78,8 @@
 	800,			/* quality (adjusted in code) */
 };
 
+extern int	vm_guest;	/* Running as virtual machine guest? */
+
 void
 init_TSC(void)
 {
@@ -100,6 +102,16 @@
 	 * via tsc_freq_max().  This also will be updated if someone loads
 	 * a cpufreq driver after boot that discovers a new max frequency.
 	 */
+#if 1
+	/* most(?) VMs' tsc are unreliable as cputicker, causing
+	 * `calcru: runtime went backwards ...' messages.  (This falls
+	 * back to using tc_cpu_ticks instead of rdtsc which hopefully
+	 * is something better in the vm like acpi-fast or hpet.)
+	 * XXX is this also true for xen?
+	 */
+	if (vm_guest)
+	    return;
+#endif
 	set_cputicker(rdtsc, tsc_freq, 1);
 
 	/* Register to find out about changes in CPU frequency. */

 (I say most(?) VMs' tsc are unreliable there because I found this issue
mentioned in the list archives also for vmware and ms virtual server
at least...  and also the comments in sys/i386/i836/tsc.c:init_TSC_tc()
say tsc is unreliable with apm because of cpu time spent in smm code,
and as mentioned with VMs I'd say this is even worse because of cpu time
spent in the host kernel, other host processes, etc.)

 Thanx,
	Juergen



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090426184021.GA9545>