From owner-freebsd-stable@FreeBSD.ORG Tue Sep 28 09:30:54 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 126701065670 for ; Tue, 28 Sep 2010 09:30:54 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from QMTA11.westchester.pa.mail.comcast.net (qmta11.westchester.pa.mail.comcast.net [76.96.59.211]) by mx1.freebsd.org (Postfix) with ESMTP id B38878FC0C for ; Tue, 28 Sep 2010 09:30:53 +0000 (UTC) Received: from omta03.westchester.pa.mail.comcast.net ([76.96.62.27]) by QMTA11.westchester.pa.mail.comcast.net with comcast id C9LC1f0030bG4ec5B9WtK5; Tue, 28 Sep 2010 09:30:53 +0000 Received: from koitsu.dyndns.org ([98.248.41.155]) by omta03.westchester.pa.mail.comcast.net with comcast id C9Ws1f0083LrwQ23P9Wt6F; Tue, 28 Sep 2010 09:30:53 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 359719B418; Tue, 28 Sep 2010 02:30:51 -0700 (PDT) Date: Tue, 28 Sep 2010 02:30:51 -0700 From: Jeremy Chadwick To: Jurgen Weber Message-ID: <20100928093051.GA59282@icarus.home.lan> References: <4CA19F27.6050903@ish.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4CA19F27.6050903@ish.com.au> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-stable@freebsd.org Subject: Re: cpu timer issues X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2010 09:30:54 -0000 On Tue, Sep 28, 2010 at 05:54:15PM +1000, Jurgen Weber wrote: > Hello List > > We have been having issues with some firewall machines of ours using > pfSense. > > FreeBSD smash01.ish.com.au 7.2-RELEASE-p5 FreeBSD 7.2-RELEASE-p5 #0: > Sun Dec 6 23:20:31 EST 2009 sullrich@FreeBSD_7.2_pfSense_1.2.3_snaps.pfsense.org:/usr/obj.pfSense/usr/pfSensesrc/src/sys/pfSense_SMP.7 > i386 > > MotherBoard: http://www.supermicro.com/products/motherboard/Xeon3000/3200/X7SBi-LN4.cfm > > Originally the systems started out by showing a lot of packet loss, > the system time would fall behind, and the value of "#vmstat -i | > grep timer" was dropping below 2000. I was lead to believe by the > guys at pfSense that this is where the value should sit. I would > also receive errors in messages that looked like " kernel: calcru: > runtime went backwards from 244314 usec to 236341". > > We tried a variety of things, disabling USB, turning off the Intel > Speed Step in the BIOS, disabling ACPI, etc, etc. All having little > to no effect. The only thing that would right it is restarting the > box but over time it would degrade again. I talked to the SuperMicro > and they said that this is a FreeBSD issue and pretty much washed > their hands of it. > > After a couple of months of dealing with this and just rebooting the > systems reguarly, the symptoms slowly but surely disappeared. eg. > The kernel messages went away, the system time was not falling > behind and I was experiencing no packet loss but the "#vmstat -i | > grep timer" value would continue to decrease over time. Eventually I > think, when it finally got the 0 the machine restarted (I am only > guessing here). > > After this restart it worked again for a couple of hours and then it > restarted again. > > After the second time the system has not missed a beat, it has been > fine and the "#vmstat -i | grep timer" value remained near the 2000 > mark... We setup some zabbix monitoring to watch it. As mentioned it > was fine for about a month. Until today. Today the value has dropped > to 0, but the system has not restarted and over the last couple of > hours the value has increased to 47. > > This machine is mission critical, we have two in a fail over > scenario (using pfSense's CARP features) and it seems unfortunate > that we have an issue with two brand new SuperMicro boxes that > affect both machines. While at the moment everything seems fine I > want to ensure that I have no further issues. Does anyone have any > suggestions? > > Lastly I have double check both of the below: > http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#CALCRU-NEGATIVE-RUNTIME > We disabled EIST. > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#COMPUTER-CLOCK-SKEW > > # dmesg | grep Timecounter > Timecounter "i8254" frequency 1193182 Hz quality 0 > Timecounters tick every 1.000 msec > # sysctl kern.timecounter.hardware > kern.timecounter.hardware: i8254 > > Only have one timer to choose from. I have a subrevision of this motherboard in use in production, which ran RELENG_7 and now runs RELENG_8, without any of the problems you describe. I don't have any experience with the -LN4 submodel though, although I do have experience with the X7SBA-LN4. Our hardware in question: http://www.supermicro.com/products/system/1U/5015/SYS-5015B-MT.cfm The machine in question consists of 4 disks (1 OS, 3 ZFS raidz1), uses both NICs (two separate networks) at gigE rates, handles nightly backups for all other servers, acts as an NFS server, a time source (ntpd) for other servers on the network, and a serial console head. Oh, it also has EIST enabled, and runs powerd with some minor (well-known) tunings in loader.conf for it. Secondly, here's our sysctl kern.timecounter tree on our system, in addition to our SMBIOS details (proving the system is what I say it is). Note that we have multiple timecounter choices, and APCI-fast is chosen. I would expect problems if i8254 was chosen, but the question is why this is being chosen on your systems and why alternate timecounter choices aren't available. You said you tried booting with ACPI disabled, which might explain why ACPI-fast or ACPI-safe are missing. $ sysctl kern.timecounter kern.timecounter.tick: 1 kern.timecounter.choice: TSC(-100) ACPI-fast(1000) i8254(0) dummy(-1000000) kern.timecounter.hardware: ACPI-fast kern.timecounter.stepwarnings: 0 kern.timecounter.tc.i8254.mask: 65535 kern.timecounter.tc.i8254.counter: 47135 kern.timecounter.tc.i8254.frequency: 1193182 kern.timecounter.tc.i8254.quality: 0 kern.timecounter.tc.ACPI-fast.mask: 16777215 kern.timecounter.tc.ACPI-fast.counter: 188736 kern.timecounter.tc.ACPI-fast.frequency: 3579545 kern.timecounter.tc.ACPI-fast.quality: 1000 kern.timecounter.tc.TSC.mask: 4294967295 kern.timecounter.tc.TSC.counter: 2830682562 kern.timecounter.tc.TSC.frequency: 2333508681 kern.timecounter.tc.TSC.quality: -100 kern.timecounter.smp_tsc: 0 kern.timecounter.invariant_tsc: 1 $ kenv | grep smbios smbios.bios.reldate="07/24/2009" smbios.bios.vendor="Phoenix Technologies LTD" smbios.bios.version="1.30 " smbios.chassis.maker="Supermicro" smbios.chassis.serial="0123456789" smbios.chassis.tag=" " smbios.chassis.version="0123456789" smbios.memory.enabled="8388608" smbios.planar.maker="Supermicro" smbios.planar.product="X7SBi" smbios.planar.serial="0123456789" smbios.planar.version="PCB Version" smbios.socket.enabled="1" smbios.socket.populated="1" smbios.system.maker="Supermicro" smbios.system.product="X7SBi" smbios.system.serial="0123456789" smbios.system.uuid="XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" smbios.system.version="0123456789" smbios.version="2.5" Fourthly, here's our BIOS settings (using BIOS 1.30, which is referred to as "R 1.3a" on Supermicro's site): -------------------- Supermicro SuperServer 5015B-MT BIOS Settings ============================================= Current BIOS: 1.30 ============================================= Reset to Factory Defaults, then change: * Main * Date --> Set to GMT, not local time! * Serial ATA --> Native Mode Operation --> Serial ATA --> SATA AHCI Enable --> Enabled * Advanced * Boot Features --> Quiet Boot --> Disabled * I/O Device Configuration --> Serial port B --> Disabled --> Parallel port --> Disabled * Console Redirection --> Com Port Address --> On-board COM A --> Baud Rate --> 115.2K --> Console Type --> VT100+ --> Continue C.R. after POST --> On (SEE NOTE #2) NOTE #2: CR after POST ======================== If the system is running RELENG_7, ***do not*** enable this option. The bootloader and thus kernel appear to get confused by who controls the interrupt, and you end up without *any* serial console output period. RELENG_8 has addressed this problem, and you *should* enable this feature when using that OS. This will allow you to see LAN option ROM messages during PXE booting, or boot0 (if you use it; usually we don't). -------------------- Since you have two systems with the same problem, I really don't know what to tell you. What I can tell you is that we've run RELENG_7 and RELENG_8 on all of the following hardware without any problems: * Supermicro SuperServer 5015B-MTB http://www.supermicro.com/products/system/1U/5015/SYS-5015B-MT.cfm * Supermicro SuperServer 5015M-T+B http://www.supermicro.com/products/system/1U/5015/SYS-5015M-T_.cfm * Supermicro X7SBA http://www.supermicro.com/products/motherboard/Xeon3000/3210/X7SBA.cfm * Supermicro X7SBL-LN2 http://www.supermicro.com/products/motherboard/Xeon3000/3200/X7SBL-LN2.cfm Can you provide any tuning you do in loader.conf or sysctl.conf, as well as your kernel configuration? Otherwise, if you continue to have problems of this nature, I would strongly recommend replacing the hardware. Clock skew of this nature, at least based on what I've seen at my day/night job, is usually the sign of a crystal going bad on the motherboard. Yes, I realise you have two systems which are exhibiting the same behaviour, but for all I know a manufacturer (not Supermicro) released a batch of bad crystals into the market. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |