Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 7 Feb 2008 17:21:21 +0100
From:      Bernd Walter <ticso@cicely12.cicely.de>
To:        ticso@cicely.de, freebsd-alpha@freebsd.org
Subject:   Re: DS10L - "processor correctable error"
Message-ID:  <20080207162120.GG24583@cicely12.cicely.de>
In-Reply-To: <20080207154024.GA9605@mech-aslap33.men.bris.ac.uk>
References:  <20080206121738.GA91825@mech-aslap33.men.bris.ac.uk> <20080207145311.GF24583@cicely12.cicely.de> <20080207154024.GA9605@mech-aslap33.men.bris.ac.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Feb 07, 2008 at 03:40:24PM +0000, Anton Shterenlikht wrote:
> On Thu, Feb 07, 2008 at 03:53:12PM +0100, Bernd Walter wrote:
> > On Wed, Feb 06, 2008 at 12:17:38PM +0000, Anton Shterenlikht wrote:
> > >
> > > "Warning: received processor correctable error."
> > > 
> > > What is the meaning of this warning? Something wrong with hardware?
> > 
> > This is an ECC memory correction.
> > It is OK to see it once in a while, since even 100% working DRAM has
> > failures from time to time (called softerror rate) - therefor the need
> > to have ECC in important systems
> > If however you see a lot of them it is time to replace the faulty
> > memory.
> 
> Bernd, thank you.
> Can I know which DIMM (DS10L has 2 DIMMs) is faulty?

Unfortunately not.
IIRC Tru64 and VMS have support for this, but we never had enough
information to handle this and this is board specific as well.

> If I run SRM memexer I get:
> 
> >>>show_status
>  ID       Program      Device       Pass  Hard/Soft Bytes Written  Bytes Read
> -------- ------------ ------------ ------ --------- ------------- -------------
> 00000001         idle system            0    0    0             0             0
> 000003ab      memtest memory            6    0    0    5586812928    5586812928
> >>>
> Processor correctable error through vector 630.
> 
> Machine Check Logout Frame @ 0x6000 Code = 0x86
> 
> Alpha 21264 IPRs (CPU 0):
> I_STAT:         0000000000000000    DC_STAT:        000000000000000C
> C_ADDR:         00000000296287C0    DC1_SYNDROME:   0000000000000000
> DC0_SYNDROME:   000000000000008F    C_STAT:         0000000000000003
> C_STS:          000000000000000A    MM_STAT:        0000000000000000
> 
> >>>
> 
> The message appears approx. once every other pass.
> The address is always the same.

Don't be worried too much about this.
Alphas are using the memory in pairs and can correct multiple faulty
bits in a single dataword.
However - you could try to remove and reconnect the Modules, since it
can happen that a contact isn't good after that many years.

-- 
B.Walter                http://www.bwct.de      http://www.fizon.de
bernd@bwct.de           info@bwct.de            support@fizon.de



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080207162120.GG24583>