From owner-freebsd-hardware@FreeBSD.ORG Wed May 14 13:49:44 2003 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id F236137B413 for ; Wed, 14 May 2003 13:49:43 -0700 (PDT) Received: from smtp-out.comcast.net (smtp-out.comcast.net [24.153.64.109]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5152743FB1 for ; Wed, 14 May 2003 13:49:43 -0700 (PDT) (envelope-from jshamlet@comcast.net) Received: from alexandria (bgp01561290bgs.gambrl01.md.comcast.net [68.50.33.221]) by mtaout05.icomcast.net (iPlanet Messaging Server 5.2 HotFix 1.14 (built Mar 18 2003)) with ESMTP id <0HEW00HDE9TA3S@mtaout05.icomcast.net> for freebsd-hardware@freebsd.org; Wed, 14 May 2003 16:48:49 -0400 (EDT) Date: Wed, 14 May 2003 16:48:14 -0400 (EDT) From: "J. Seth Henry" In-reply-to: <20030514190051.27E1A37B404@hub.freebsd.org> X-X-Sender: jshamlet@alexandria.gambrl01.md.comcast.net To: freebsd-hardware@freebsd.org Message-id: <20030514153652.U1336@alexandria.gambrl01.md.comcast.net> MIME-version: 1.0 Content-type: TEXT/PLAIN; charset=US-ASCII Content-transfer-encoding: 7BIT References: <20030514190051.27E1A37B404@hub.freebsd.org> Subject: SuperMicro P3TDL3-O locking under load with 4.8-REL X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 May 2003 20:49:44 -0000 I built a server around this board about 8 months ago. It has two 1GHz P-III processors, 1Gb of RAM using 4 identical Samsung ECC registered DIMM's. There is an el-cheapo VGA card for console access, two sound cards, and a digiboard (which is about to be removed). Both the server, and all network equipment, were protected by a 750VA SmartUPS (which ensures the power is fairly clean). The network gear is still protected by the 750VA UPS, and the server is on a 650 VA Back-UPS Pro. (read on, the server was moved) I have a 450W power supply in the chassis. The machine was intended to be a combination file/media server, X app server, home automation controller, and compile mule (being the fastest FreeBSD box in the house). When I first built it, I ran 4.7-REL on the system. It also ran folding@home when it was idle. This was a very stable setup for months, even during (and after) a heat problem (the A/C failed to come on). The onboard thermal alarm was going off, but the system was still running - so I manually halted the OS, and powered it down. It ran non-stop for 3 months after this without so much as a hiccup. I mentioned this because the symptoms are frightengly similar to thermal problems I've seen on other systems. Anyway, the overheat scared me enough to start reshuffling things. The closet it was stored in averages ~85degF, as long as the A/C is working - it got above 95deg when the A/C failed. I didn't want to risk the machine melting down should the air go out again, so I replaced the home automation system portion with a dedicated ITX based system, and moved the server to a cooler room, with much better ventilation (average temperature was 10degF cooler). (as a bonus, the closet temperature dropped as well - without the server, it runs about 80degF) OK - so, while I'm shuffling everything around, I figure it would be a good time to upgrade the OS. I did a binary upgrade to 4.8-REL, installed KDE 3.1 (to replace the icewm/mozilla combo), and upgraded a few other packages in the process. And now it is locking up... No kernel panics, no beeps, nothing. It just stops. I've actually been typing in a remote xterm, and it's stopped in mid-word... I've checked the temperature in the room, and in the case - and both are well within tolerance. I can't check the chip temps, because the ServerWorks LM78 setup isn't supported in FreeBSD (yet?), but they don't appear to be to warm to the touch. Heck, the environment is actually better than it was before! Since the machine was physically moved, I checked the obvious. I reseated all of the DIMM's, PCI boards - even the CPU's. I checked all the fans to make sure they were still functional (they are). The machine appears to be fine physically. Although I can't check after boot, I used the BIOS to verify that the power supply voltages were OK as well (they were, though the 12V line had dipped .13V to 11.87) The 5, 3.3, 2.5 supplies were spot on at 5.07, 3.34, and 2.51. Apparently, the -12 and -5 weren't deemed important enough to monitor. So, in summary - The differences: 1) Room temp dropped from 85degF avg to 76degF avg. System spent 8 months at 85degF (ambient air temp) 2) Went from 4.7-REL to 4.8-REL. System did not lock up in 4.7, despite adverse conditions - does lock up regularly in 4.8 even with more ideal conditions. 3) Starting serving up KDE 3.1 instead of icewm/mozilla/xterms (fairly significant increase in network IO) Although I'm looking for help, I'm going to try "downgrading" the server to 4.7-REL, and see if that improves the situation. I'm also considering pulling the drives out, and loading Linux on it, so I can monitor the LM78 subsystem, and put it under some extreme load. I'm also looking at reducing the load, by stopping folding@home. It ran on the system the entire time it was in the closet, but I'm desparate to stabilize this box. My suspicions, in order; Power supply - even though the air from the vent isn't unusually warm, this smells (so to speak), like a power problem. I REALLY wish I could access the voltage monitoring stuff. This board can monitor damn near everything, but FreeBSD doesn't support the monitoring hardware) Oh well, looks like it's voltmeter time. One or more CPU's are overheating under load, and some internal thermal protection circuit is kicking in (natually causing the system to halt) I would imagine that this, combined with a quirk in the kernel, is causing this. My guess would be that CPU0 is crapping out, and since the kernel can only run on the first CPU... RAM - I bought the best I could afford, but it seems like a likely suspect anyway. This seems unlikely, though. The board has reported exactly 2 ECC "problems" in nearly 8 months in the BIOS log. However, it hasn't been independently tested. As an aside, I know Linux users have a tool to read the ServerWorks LM78 monitoring system. Is there anything in the works for FreeBSD support? There are monitors for every voltage on the PSU, fan speeds, temperatures etc - just sitting there waiting to be accessed. As a SECOND aside, does anyone know of a reputable power supply vendor? I'm willing to spend some cash for a high quality PSU - just as soon as I find one. The current suppply is an Antec 450W. Thanks in advance, Seth Henry