Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 27 Jun 2005 01:01:09 -0400
From:      Matt Juszczak <matt@atopia.net>
To:        freebsd-stable@freebsd.org
Subject:   FreeBSD -STABLE servers repeatedly crashing.
Message-ID:  <42BF8815.6090909@atopia.net>

next in thread | raw e-mail | index | archive | help
Hello all,

About three weeks ago, I upgraded my 5.3-RELEASE boxes to 5.4-RELEASE.  
I also turned on procmail globally on our mail server.  Here is our 
current FreeBSD server setup:

URANUS  -  primary ldap
CALIBAN -  secondary ldap
ORION     -  primary mail

Orion was the first one to crash, about three weeks ago.  Orion is 
constantly talking to uranus, because uranus is our primary ldap server 
(we have a planet scheme), and caliban is our secondary ldap server.  I 
ran an email flood test on orion to see if I could crash it again.  This 
time, the high requests on Uranus caused Uranus to crash. With two 
different servers on two different hardware setups crashing, I had to 
start thinking of what could be causing the problem.

Memory tests on both servers came back OK.  Orion had some ECC errors 
which it was able to fix.  I wasn't able to catch orion's first crash, 
but I was able to catch uranus's first crash:

http://paste.atopia.net/126

I have the other crashes written down in pencil at my work.  They all 
say mostly the same thing.  I assume Caliban would also experience this 
behavior, but because it does not receive much load at all (only does 
anything when uranus dies), I am not able to confirm this.

The only thing similar between the boxes is that all three have two 
processors in them, and are running SMP.  Orion had hyperthreading 
turned on but I disabled this in the bios, to no avail.

Someone with similar experiences running SMP informed to upgrade to 
-STABLE as of last week.  For almost a week, Orion ran fine.  This 
evening; however, Orion once again crashed, its fourth time in three 
weeks.  Uranus has been stable for a few days but I am expecting it to 
crash again any day now (they usually take between 4-6 days).

So now I am stuck.  I have two -STABLE machines which continue to cause 
kernel traps.  Tomorrow, I am going to compile a debugging kernel on 
orion and try to let it crash again to see what kind of errors it 
reports, but I was wondering if anyone else is experiencing these problems.

Thanks in advance,

Matt Juszczak



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?42BF8815.6090909>