Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 11 Mar 2012 11:01:49 -0600
From:      Ian Lepore <freebsd@damnhippie.dyndns.org>
To:        Adam Strohl <adams-freebsd@ateamsystems.com>
Cc:        FreeBSD-Stable ML <freebsd-stable@freebsd.org>
Subject:   Re: Time Clock Stops in FreeBSD 9.0 guest running under ESXi 5.0
Message-ID:  <1331485309.32194.63.camel@revolution.hippie.lan>
In-Reply-To: <4F5B0BB5.5010406@ateamsystems.com>
References:  <4F5B0BB5.5010406@ateamsystems.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 2012-03-10 at 15:07 +0700, Adam Strohl wrote:
> I've now seen this on two different VMs on two different ESXi servers 
> (Xeon based hosts but different hardware otherwise and at different 
> facilities):
> 
> Everything runs fine for weeks then (seemingly) suddenly/randomly the 
> clock STOPS.  In the first case I saw a jump backwards of about 15 
> minutes (and then a 'freeze' of the clock).  The second time just 'time 
> standing still' with no backwards jump.  Logging accuracy is of course 
> questionable given the nature of the issue, but nothing really jumps out 
> (ie; I don't see NTPd adjusting the time just before this happens or 
> anything like that).
> 
> Naturally the clock stopping causes major issues, but the machine does 
> technically stay running.  My open sessions respond, but anything that 
> relies on time moving forward hangs.  I can't even gracefully reboot it 
> because shutdown/etc all rely on time moving forward (heh).
> 
> So I'm not sure if this is a VMWare/ESXi issue or a FreeBSD issue, or 
> some kind of interaction between the two.   I manage lots of VMWare 
> based FreeBSD VMs, but these are the only ESXi 5.0 servers and the only 
> FreeBSD 9.0 VMs.  I have never seen anything quite like this before, and 
> last night as I mentioned above I had it happen for the second time on a 
> different VM + ESXi server combo so I'm not thinking its a fluke 
> anymore.  I've looked for other reports of this both in VMWare and 
> FreeBSD contexts and not seeing anything.
> 
> What is interesting is that the 2 servers that have shown this issue 
> perform similar tasks, which are different from the other VMs which have 
> not shown this issue (yet).  This is 2 VMs out of a dozen VMs spread 
> over two ESXi servers on different coasts.  This might be a coincidence 
> but seems suspicious. These two VMs run these services (where as the 
> other VMs don't):
> 
> - BIND
> - CouchDB
> - MySQL
> - NFS server
> - Dovecot 2.x
> 
> I would also say that these two VMs probably are the most active, have 
> the most RAM and consume the most CPU because of what they do (vs. the 
> others).
> 
> I have disabled NTPd since I am running the OpenVM Tools (which I 
> believe should be keeping the time in sync with the ESXi host, which 
> itself uses NTP), my only guess is maybe there is some kind of collision 
> where NTPd and OpenVMTools were adjusting the time at the same time.  
> I'm playing the waiting game now to see what this brings (again though I 
> am running NTPd and OpenVMTools on all the other VMs which have yet to 
> show this issue).
> 
> Anyone seen anything like this?  Ring any bells?
> 

I've run into the "time standing still" problem, but only on bringing up
FreeBSD on new hardware (usually industrial single-board computers).  In
those cases time never advances beyond the time obtained from the RTC
hardware at boot.  I've never seen it happen that time runs normally for
a while then stops advancing, but I have almost no experience with
FreeBSD as a VM guest OS.

When I have seen the problem, it's always been due to interrupt
problems, such as the timer tick handler getting hung or the selected
timer hardware not generating interrupts.  

It seems unlikely to me that ntpd and the vm tools would be fighting in
a way that caused this symptom.  The way ntpd affects timing is to step
the clock (which gets logged), or to numerically steer the kernel's
timekeeping routines.  The steering is clamped at 500 ppm; to make the
clock appear to stop it would have to steer at 1e6 ppm.  I've always
assumed that VM guest services daemons that handle timekeeping use the
same ntp_adjtime() interface to the kernel timekeeping that ntpd itself
uses, so the same steering limits would apply.

If it happens again, interesting data might be found in the output of:

  sysctl kern.timecounter
  sysctl kern.eventtimer
  vmstat -i
  ntpdc -c kerninfo
  <anything unusual in dmesg output>

-- Ian





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1331485309.32194.63.camel>