Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 2 Apr 2002 02:23:04 +0200
From:      Bernd Walter <ticso@cicely8.cicely.de>
To:        Terry Lambert <tlambert2@mindspring.com>
Cc:        Christian Weisgerber <naddy@mips.inka.de>, freebsd-alpha@FreeBSD.ORG
Subject:   Re: Source of "processor correctable error"?
Message-ID:  <20020402002303.GH41357@cicely8.cicely.de>
In-Reply-To: <3CA8EADE.C11C8DF7@mindspring.com>
References:  <a89rrl$2vek$1@kemoauc.mips.inka.de> <3CA8EADE.C11C8DF7@mindspring.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Apr 01, 2002 at 03:18:54PM -0800, Terry Lambert wrote:
> Christian Weisgerber wrote:
> > 
> > Since the weekend my PC164 has taken to almost continuously spewing
> > gobs of
> > 
> > Warning: received processor correctable error.
> > 
> > In fact I first noticed this because writing the error messages to
> > the serial console took so much time the machine became sluggish.
> > I've switched to a graphics console now.
> > 
> > Anyway, is there a way to narrow down the source of the underlying
> > hardware problem?  What are the candidates anyway?  On-chip cache,
> > off-chip cache, main memory?
> 
> FWIW, if they are correctable, it's complaining about memory
> errors which are correctable using the ECC bits, in the use of
> ECC memory.

Right - it also hasn't to be main memory as D, I and B caches
and some data paths are ECC protected too.
But I never saw a message about cache failures so they might
look different.

> There are generally three causes of this problem which I have
> seen in natures:
> 
> 1)	Thermal cooling of the system is insufficient, which
> 	introduces thermal related errors (fix: better cooling).

Possible, but I would first guess in bad simm or bad contact.

> 2)	The memory was being clocked faster than the speed it
> 	was rated to run at (fix: clock it slower or buy more
> 	expensive memory).

Unlikely as overclocked memory tend to multibit errors in my
expirience.

> 3)	The "ECC" memory was face ECC instead of real ECC, so
> 	the correction codes were incorrect, either as a result
> 	of a cheap vendor ripping a buyer off, or a cheap buyer
> 	not jumpering the system to not use ECC... or the system
> 	not having the option to be jumpered that way (fix: use
> 	real ECC memory, and not forgeries).

Beleave me - these boards don't let you even boot with such
a kind of ram as you never get a chance to come over SRM
because of all those error messages.
I can speak from expirience here as I had an unsoldered pin
on a simm which my stupid x86 box silently corrected for years...

> It's always possible that you have bad RAM, or that the PCI
> bus-on time is set to high in the PCI chipset for the amount
> of rAM in the system, such that the DRAM referesh is delayed
> enough under load that your memory starts losing bits, Etc..

I doubt that this is a refresh problem, as the chipset has well
designed datapaths.

> But while there are other possibilities, I have never seen them
> personally in nature (with ECC; I've seen the DRAM refresh
> starvation with an improperly BIOS programmed Cyrix Media GX
> chipset [5532?]).

Phew that's bad.

-- 
B.Walter              COSMO-Project         http://www.cosmo-project.de
ticso@cicely.de         Usergroup           info@cosmo-project.de


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-alpha" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020402002303.GH41357>