From owner-freebsd-questions@FreeBSD.ORG Wed Feb 28 18:12:04 2007 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9DD5416A400 for ; Wed, 28 Feb 2007 18:12:04 +0000 (UTC) (envelope-from alex@schnarff.com) Received: from outbound.mailhop.org (outbound.mailhop.org [63.208.196.171]) by mx1.freebsd.org (Postfix) with ESMTP id 71BB213C47E for ; Wed, 28 Feb 2007 18:12:04 +0000 (UTC) (envelope-from alex@schnarff.com) Received: from c-68-49-149-185.hsd1.va.comcast.net ([68.49.149.185] helo=schnarff.com) by outbound.mailhop.org with esmtpa (Exim 4.63) (envelope-from ) id 1HMSUH-000ByF-In for freebsd-questions@freebsd.org; Wed, 28 Feb 2007 12:21:10 -0500 Received: (qmail 698 invoked by uid 67); 28 Feb 2007 17:21:08 -0000 Received: from 192.168.2.68 ([192.168.2.68]) by mail.schnarff.com (Horde) with HTTP for ; Wed, 28 Feb 2007 12:21:08 -0500 X-Mail-Handler: MailHop Outbound by DynDNS X-Originating-IP: 68.49.149.185 X-Report-Abuse-To: abuse@dyndns.com (see http://www.mailhop.org/outbound/abuse.html for abuse reporting information) X-MHO-User: schnarff Message-ID: <20070228122108.bhd56o5wn4ss8c4g@mail.schnarff.com> Date: Wed, 28 Feb 2007 12:21:08 -0500 From: alex@schnarff.com To: freebsd-questions@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format="flowed" Content-Disposition: inline Content-Transfer-Encoding: 7bit User-Agent: Internet Messaging Program (IMP) H3 (4.0.4) Cc: Jean Lagarde Subject: Stability Issues on 5.4-RELEASE Box X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Feb 2007 18:12:04 -0000 Hello All, I've recently fallen into the task of administering a FreeBSD 5.4-RELEASE box that acts as the web server for a small non-profit that I volunteer for. Unfortunately, the system has been having some extremely vexing stability issues over the last month or so, which even my 6+ years of experience as an OpenBSD admin have not helped me track down. First things first, let me say explicitly that I'm not trying to say "FreeBSD sucks, it's not stable" or anything like that. It's a fine OS, and I'm sure that it's either faulty hardware or a misconfiguration of some sort causing these problems. :-) That said, here are some of the symptoms the box has been experiencing: * Occasional random reboots. I've only personally witnessed one, and they don't happen often, but any time a *NIX box just reboots for no apparent reason (there was no indication of a problem in any of the logs, at least that I could see), something really bad is going on. * Random extreme slowness when logging in via SSH, with the time to get a shell ranging from a second or two all the way up to 80 seconds. The box isn't busy enough that it's just slow due to load (especially since, once you're in, things fly), and it's not just a reverse DNS issue like I've seen on OpenBSD (this occurs even when logging in from locations listed in /etc/hosts that resolve properly out of that file). Until I upgraded to the current version of OpenSSL/OpenSSH, the box would occasionally just become unresponsive altogether over SSH, not allowing logins for 15+ minutes at a time. * Issues with files that are not found on startup sometimes, but are other times. Prime example: the Zope CMS system that's been installed failed to find libmysqlclient.so after a planned soft reboot, but found it with no trouble on a subsequent boot a few minutes later, with no config changes in between. * A warning in /var/log/messages that the root filesystem was full, when it was at 60% capacity (and something like 2% inode capacity); the problem has yet to repeat, though no files have been cleared off of that filesystem. * Random crashes of the Zope/Plone system that's running the main part of the web site. While I realize that, in and of itself, this means nothing about the stability of the underlying OS, in the context of all of the other things going on (as well as the fact that the Zope list has been unable to help figure out why it's crashing), it seems like it might be further evidence of a larger problem. Thus far, besides simply scanning log files, constantly watching "top" and "ps", etc., I've not been able to do much with the box. As I said, I upgraded OpenSSL/OpenSSH to current versions, and I installed pf as the firewall (there was none before I arrived...don't even get me started on that). This weekend the guy who was the previous admin will be running a Memtest for me and disabling hyperthreading (which there's no performance justification for, and which has caused me stability issues at least on Linux in the past), since the server is in Oregon and I'm in the DC area. That's about the extent of what I've been able to do to date, since this is a production box. What I'd like to know from you guys is: * Am I justified in suspecting hyperthreading as a potential cause of instability? * Does 5.4-RELEASE have any known bugs that might cause stability issues like the ones I've described here? More importantly, would an upgrade to 6.2-RELEASE be worthwhile (as is my instinct), in terms of being generally more stable and/or having better hardware support? Would such an upgrade be possible/relatively painless to perform without being physically at a console, as has been the case with OpenBSD over the years? * Given my dmesg below, do you see any specific problems? * Do you have any other suggestions for debugging this problem? Thanks in advance for any help you can provide. :-) Alex Kirk dmesg: Copyright (c) 1992-2005 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 5.4-RELEASE #0: Sun May 8 10:21:06 UTC 2005 root@harlow.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC ACPI APIC Table: Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel(R) Pentium(R) 4 CPU 3.20GHz (3200.01-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf43 Stepping = 3 Features=0xbfebfbff Hyperthreading: 2 logical CPUs real memory = 2137509888 (2038 MB) avail memory = 2086207488 (1989 MB) ioapic0: Changing APIC ID to 2 ioapic0 irqs 0-23 on motherboard npx0: on motherboard npx0: INT 16 interface acpi0: on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0 cpu0: on acpi0 acpi_button0: on acpi0 pcib0: port 0xcf8-0xcff on acpi0 pci0: on pcib0 pci0: at device 2.0 (no driver attached) pcib1: at device 28.0 on pci0 pci1: on pcib1 pcib2: at device 28.2 on pci0 pci2: on pcib2 pcib3: at device 28.3 on pci0 pci3: on pcib3 pcib4: at device 28.4 on pci0 pci4: on pcib4 pcib5: at device 28.5 on pci0 pci5: on pcib5 uhci0: port 0x2080-0x209f irq 23 at device 29.0 on pci0 usb0: on uhci0 usb0: USB revision 1.0 uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered uhci1: port 0x2060-0x207f irq 19 at device 29.1 on pci0 usb1: on uhci1 usb1: USB revision 1.0 uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 2 ports with 2 removable, self powered uhci2: port 0x2040-0x205f irq 18 at device 29.2 on pci0 usb2: on uhci2 usb2: USB revision 1.0 uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub2: 2 ports with 2 removable, self powered uhci3: port 0x2020-0x203f irq 16 at device 29.3 on pci0 usb3: on uhci3 usb3: USB revision 1.0 uhub3: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub3: 2 ports with 2 removable, self powered pci0: at device 29.7 (no driver attached) pcib6: at device 30.0 on pci0 pci6: on pcib6 fxp0: port 0x1100-0x113f mem 0x88000000-0x8801ffff,0x88021000-0x88021fff irq 21 at device 0.0 on pci6 miibus0: on fxp0 inphy0: on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp0: Ethernet address: 00:02:b3:d5:4d:3f ahc0: port 0x1000-0x10ff mem 0x88020000-0x88020fff irq 22 at device 1.0 on pci6 aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs isab0: at device 31.0 on pci0 isa0: on isab0 atapci0: port 0x20b0-0x20bf,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 irq 18 at device 31.1 on pci0 ata0: channel #0 on atapci0 ata1: channel #1 on atapci0 atapci1: port 0x20a0-0x20af,0x20e8-0x20eb,0x20c0-0x20c7,0x20ec-0x20ef,0x20c8-0x20cf irq 19 at device 31.2 on pci0 ata2: channel #0 on atapci1 ata3: channel #1 on atapci1 pci0: at device 31.3 (no driver attached) fdc0: port 0x3f0,0x3f0-0x3f5 irq 6 drq 2 on acpi0 fd0: <1440-KB 3.5" drive> on fdc0 drive 0 sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A orm0: at iomem 0xcc800-0xccfff,0xcb000-0xcc7ff on isa0 pmtimer0 on isa0 atkbdc0: at port 0x64,0x60 on isa0 atkbd0: irq 1 on atkbdc0 kbd0 at atkbd0 ppc0: parallel port not found. sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> sio1: configured irq 3 not in bitmap of probed irqs 0 sio1: port may not be enabled vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 RTC BIOS diagnostic error 80 Timecounter "TSC" frequency 3200012824 Hz quality 800 Timecounters tick every 10.000 msec acd0: CDRW at ata0-slave PIO4 Interrupt storm detected on "irq19: uhci1+"; throttling interrupt source ad4: 238475MB [484521/16/63] at ata2-master UDMA33 ad5: 238475MB [484521/16/63] at ata2-slave UDMA33 ad6: 238475MB [484521/16/63] at ata3-master UDMA33 ad7: 238475MB [484521/16/63] at ata3-slave UDMA33 Waiting 15 seconds for SCSI devices to settle sa0 at ahc0 bus 0 target 6 lun 0 sa0: Removable Sequential Access SCSI-3 device sa0: 40.000MB/s transfers (20.000MHz, offset 8, 16bit) Mounting root from ufs:/dev/ad4s1a IP Filter: v3.4.35 initialized. Default = pass all, Logging = enabled