From owner-freebsd-stable@FreeBSD.ORG Mon Jun 27 05:01:11 2005 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0231B16A41C for ; Mon, 27 Jun 2005 05:01:11 +0000 (GMT) (envelope-from matt@atopia.net) Received: from neptune.atopia.net (neptune.atopia.net [209.128.231.90]) by mx1.FreeBSD.org (Postfix) with ESMTP id CD30643D53 for ; Mon, 27 Jun 2005 05:01:10 +0000 (GMT) (envelope-from matt@atopia.net) Received: from [192.168.0.102] (pcp173257pcs.plsntv01.nj.comcast.net [68.46.70.16]) by neptune.atopia.net (Postfix) with ESMTP id 5140A412A for ; Mon, 27 Jun 2005 01:01:10 -0400 (EDT) Message-ID: <42BF8815.6090909@atopia.net> Date: Mon, 27 Jun 2005 01:01:09 -0400 From: Matt Juszczak User-Agent: Mozilla Thunderbird 0.9 (X11/20041129) X-Accept-Language: en-us, en MIME-Version: 1.0 To: freebsd-stable@freebsd.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: FreeBSD -STABLE servers repeatedly crashing. X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Jun 2005 05:01:11 -0000 Hello all, About three weeks ago, I upgraded my 5.3-RELEASE boxes to 5.4-RELEASE. I also turned on procmail globally on our mail server. Here is our current FreeBSD server setup: URANUS - primary ldap CALIBAN - secondary ldap ORION - primary mail Orion was the first one to crash, about three weeks ago. Orion is constantly talking to uranus, because uranus is our primary ldap server (we have a planet scheme), and caliban is our secondary ldap server. I ran an email flood test on orion to see if I could crash it again. This time, the high requests on Uranus caused Uranus to crash. With two different servers on two different hardware setups crashing, I had to start thinking of what could be causing the problem. Memory tests on both servers came back OK. Orion had some ECC errors which it was able to fix. I wasn't able to catch orion's first crash, but I was able to catch uranus's first crash: http://paste.atopia.net/126 I have the other crashes written down in pencil at my work. They all say mostly the same thing. I assume Caliban would also experience this behavior, but because it does not receive much load at all (only does anything when uranus dies), I am not able to confirm this. The only thing similar between the boxes is that all three have two processors in them, and are running SMP. Orion had hyperthreading turned on but I disabled this in the bios, to no avail. Someone with similar experiences running SMP informed to upgrade to -STABLE as of last week. For almost a week, Orion ran fine. This evening; however, Orion once again crashed, its fourth time in three weeks. Uranus has been stable for a few days but I am expecting it to crash again any day now (they usually take between 4-6 days). So now I am stuck. I have two -STABLE machines which continue to cause kernel traps. Tomorrow, I am going to compile a debugging kernel on orion and try to let it crash again to see what kind of errors it reports, but I was wondering if anyone else is experiencing these problems. Thanks in advance, Matt Juszczak