Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 22 Jun 2005 10:12:04 -0600
From:      "Chad Leigh -- Shire.Net LLC" <chad@shire.net>
To:        Matt Juszczak <matt@atopia.net>
Cc:        freebsd-questions questions <freebsd-questions@freebsd.org>
Subject:   Re: FreeBSD Machines dieing, we've tried so much....
Message-ID:  <41AD7E3D-E59C-4AAF-803F-11048A005D44@shire.net>
In-Reply-To: <42B98AD0.7080508@atopia.net>
References:  <LOBBIFDAGNMAMLGJJCKNGEMKFBAA.tedm@toybox.placo.com> <42B98AD0.7080508@atopia.net>

next in thread | previous in thread | raw e-mail | index | archive | help

On Jun 22, 2005, at 9:59 AM, Matt Juszczak wrote:

>
>
>> The vast majority of panics are hardware-related.  It is rare  
>> nowadays
>> for a usermode program to make the system panic.  In particular  
>> you said
>> the problem happens more under load.  That really points even more  
>> to a
>> hardware problem - bad CPU cache ram, bad ram, scsi termination, that
>> sort of thing.
>>
>> Ted
>>
>>
>
> This is kind of going to be a blanket post to all the recent  
> suggestions to me.  I appreciate suggestions :)   Ted, sorry, my  
> other posts had dmesg and hardware specs, etc. I just couldn't  
> remember the subject line of that thread. I'll be more descriptive  
> here.
>
> We have two different servers crashing.  Both are SMP, but on  
> different hardware.  We have five freeBSD servers in total, and  
> only two are affected.  That is why I do not believe this is a  
> hardware problem.
>
> In any case, the machines are in a cold room where the temperature  
> is constantly maintained.  20 other servers in there are perfectly  
> stable, with no probs.
>
> This particular machine that crashed last night while running  
> portsdb -uU is a Super Micro machine, with hyperthreading disabled  
> in the bios, dual CPU 3.06 ghz, with 4 gigs memory.  We ran mem  
> test on orion (the machine that crashed last night) a week or so  
> ago, and it found 70,000 ECC errors.  Those were fixed and that  
> machine has been stable until last night.  I've now disabled SMP  
> support, we'll see if that keeps it stable or not. Portsdb -uU ran  
> without problems after I disabled SMP.
>
> As far as uranus, the other box (we keep a planet scheme for a  
> certain set of servers), we ran memtest86 and found no errors at  
> all.  That box crashed about two days ago but has been stable  
> since.  It has not lasted more than a week without doing a kernel  
> trap and freezing.
>
> It seems that both these servers have this problem.  Out of the  
> five FreeBSD servers we have, these two are the ones with the  
> highest load.  Maybe a higher load on the other three servers would  
> cause the same problem.  I agree with you that this is a hardware  
> problem, but on more than one server with two different  
> architectures and our highest load makes me re-consider.
>
> If this is truly a bug in FreeBSD 5.4-RELEASE, maybe this is  
> something that has been fixed in -stable?  I will compile a debug  
> kernel today and try to provide a trace to the problem.  I'll do it  
> on which ever server crashes next.


What do they have in common?  Disk controller?  Network controller?

Chad

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad@shire.net





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?41AD7E3D-E59C-4AAF-803F-11048A005D44>