Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 18 Jan 2005 10:59:52 +0000 (GMT)
From:      Robert Watson <rwatson@freebsd.org>
To:        Vivek Khera <vivek@khera.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: 5.3-RELEASE crashes during make buildworld (and other problems)
Message-ID:  <Pine.NEB.3.96L.1050118105302.17975B-100000@fledge.watson.org>
In-Reply-To: <557348B4-6906-11D9-B522-000A95D14982@khera.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On Mon, 17 Jan 2005, Vivek Khera wrote:

> On Jan 13, 2005, at 4:46 AM, Peter Jeremy wrote:
> 
> > That doesn't totally rule out hardware.  Pattern-sensitive memory
> > problems may not show up on different operating systems (or even
> > different kernels).  That said, based on the trap information, I'd
> > look at a software cause first.
> 
> Indeed.  I once had a box that would run Linux 100% stable under any
> load for months on end, but with BSD/OS it would crap out (random
> processes fail) after a max of 3 weeks requiring a reboot. 
> 
> Never rule out bad hardware, especially with PC crap. 

Even minor OS revisions can reveal or hide memory problems.  For example,
for quite a while one of my Pentium (1!) server boxes had a single bit
error (a stuck on bit) that fell into a section of memory that always held
pinned kernel pages, and in particular, ended up holding a fairly obscure
kernel code branch in a module that was loaded.  Then one day kernel
memory layout got chaged a bit, and the page ended up being paged into
user memory, resulting in frequent application segfaults and data
corruption.  I was sure it was the OS upgrade, since backing out to the
previous kernel/modules fixed it reliably ... until I ran a memory test
and figured out what was actually happening.  It was pretty frustrating to
try to debug, and reinforces the conclusion that doing a bit of legwork on
a badly behaving system to confirm it's not a hardware fault that can be
easily ruled out can go a long way.  Which isn't to say that the problem
in this thread is hardware, but you don't want to spend two weeks tracking
a kernel bug to find out that swapping out the memory with a seemingly
identical DIMM fixes it.

Checking ethernet cabling and link negotiation, a decent memory test run,
checking SCSI termination, checking ATA cable type, etc, as first steps to
debugging a problem that would have similar symptoms is a good strt.  Oh,
and if it's your parents calling on the phone at 6:30am with a printer
problem, the first thing to ask is whether their printer is plugged in.
:-) 

Robert N M Watson





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.NEB.3.96L.1050118105302.17975B-100000>