From owner-freebsd-questions@FreeBSD.ORG Fri Jan 7 03:21:08 2005 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EF10416A4CE for ; Fri, 7 Jan 2005 03:21:07 +0000 (GMT) Received: from smtp1.wanadoo.fr (smtp1.wanadoo.fr [193.252.22.30]) by mx1.FreeBSD.org (Postfix) with ESMTP id 744B743D55 for ; Fri, 7 Jan 2005 03:21:07 +0000 (GMT) (envelope-from atkielski.anthony@wanadoo.fr) Received: from me-wanadoo.net (localhost [127.0.0.1]) by mwinf0101.wanadoo.fr (SMTP Server) with ESMTP id ABCA71C00228 for ; Fri, 7 Jan 2005 04:21:06 +0100 (CET) Received: from pix.atkielski.com (ASt-Lambert-111-2-1-3.w81-50.abo.wanadoo.fr [81.50.80.3]) by mwinf0101.wanadoo.fr (SMTP Server) with ESMTP id 794481C002A0 for ; Fri, 7 Jan 2005 04:21:06 +0100 (CET) X-ME-UUID: 20050107032106496.794481C002A0@mwinf0101.wanadoo.fr Date: Fri, 7 Jan 2005 04:21:06 +0100 From: Anthony Atkielski X-Priority: 3 (Normal) Message-ID: <635409907.20050107042106@wanadoo.fr> To: freebsd-questions@freebsd.org In-Reply-To: <20050106152001.75c79b0d@fennec.24-119-122-191.cpe.cableone.net> References: <20050106152001.75c79b0d@fennec.24-119-122-191.cpe.cableone.net> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 8bit Subject: Re: Hardware or OS problem? System Crashing... X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: freebsd-questions@freebsd.org List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 07 Jan 2005 03:21:08 -0000 I had a very similar problem over the holidays. After a power failure over a month ago, I noticed some anomalies in FreeBSD, but they were very insidious and didn't seem like hardware (and the system was on a UPS plus a surge protector, so I didn't think the PF alone could have done damage, unless the power cycled many times over a short period). I'd get strange faults in programs from time to time, usually some type of memory faults--usually in Apache (since it uses most of the processor time), but sometimes in system programs that had never given trouble before. As time passed, the system would occasionally freeze, or I would even get kernel panics. There never seemed to be any information left behind that could help me find out why the system was crashing (fault type, processes running, etc.), and error messages in logs were scarce. (If there is a way to debug FreeBSD crashes without running a kernel specifically set up for the purpose, I'd like to know what it is.) Anyway, I suspected a virus--I had seen a virus infection on the Web server, but it had apparently never been activated because the firewall prevented it from "calling home." FreeBSD had never faulted before, so the OS was excluded (it would not _suddenly_ develop a bug). I reinstalled everything just to see. It wasn't until I reinstalled and upgraded to FreeBSD 5.3 and got even more frequent mystery crashes that I felt sure that hardware was causing a problem. It turned out that (I think) something had been damaged before or during the power failures. A motherboard failure earlier on had turned off the CPU fan. The fan worked, but the MB had stopped powering it, so it wasn't running. The AMD processor stayed cool enough to operate most of the time because the system is very lightly loaded processor-wise. However, at some point, something got the system into a tight loop, and the processor reached something above 120° C (around 300° F at one point, I think--I could _smell_ the system when I got into the room). Amazingly, it still ran most of the time, but I think some part of the virtual memory logic was damaged, because most of the mystery faults were segment violations. The problem very gradually got worse, with the OS faulting more and more often, until it eventually got so bad that it would fault before the bootload completed. I finally replaced the entire machine--this time with _seven_ fans, and with an Intel processor that will simply shut down if it gets too hot, instead of cooking itself to death. I also upgraded to FreeBSD 5.3, and I updated all the other system software as well. There have been no problems since ... except for a panic in sysinstall during the first installation, which I think was an honest-to-goodness OS bug (it happened only once, and reminded me vaguely of a similar problem on my first installation of 4.3, years earlier). The gigabit Ethernet on the MB doesn't work reliably under FreeBSD, though, so I just reinstalled the 100 Mbps card from the old server, which works perfectly. In summary, this was a hardware problem, but so subtle in the beginning that it wasn't at all clear that hardware was at fault--for a long time I suspected traces of a virus infection or something. Obviously, running Linux would not have made any difference. I did see filesystem corruption after the panics, which was to be expected, but as far as I know I never lost any actual data; fsck corrected the structure errors each time (sometimes from single-user mode, since it wouldn't always succeed in automatic checks). No OS can guarantee against data corruption on unreliable hardware, not even all-knowing, all-seeing Linux. Maybe you need a new sysadmin. -- Anthony