Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Aug 2010 08:25:34 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        freebsd-stable@freebsd.org
Cc:        Andriy Gapon <avg@icyb.net.ua>, Jeremy Chadwick <freebsd@jdc.parodius.com>, Dan Langille <dan@langille.org>
Subject:   Re: kernel MCA messages
Message-ID:  <201008250825.34903.jhb@freebsd.org>
In-Reply-To: <4C74F7FF.8000704@icyb.net.ua>
References:  <4C71CC62.6060803@langille.org> <4C74F36B.2060200@langille.org> <4C74F7FF.8000704@icyb.net.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday, August 25, 2010 7:01:19 am Andriy Gapon wrote:
> on 25/08/2010 13:41 Dan Langille said the following:
> > On 8/25/2010 3:11 AM, Andriy Gapon wrote:
> > 
> >> Have you read the decoded message?
> >> Please re-read it.
> >>
> >> I still recommend reading at least the summary of the RAM ECC research article
> >> to make your own judgment about need to replace DRAM.
> > 
> > Andriy: What is your interpretation of the decoded message?  What is your view on
> > replacing DRAM?  What do you conclude from the summary?
> 
> Most likely you have a small defect in one of your memory modules, perhaps a
> "stuck" bit.  You will be getting correctable ECC errors for that module.
> Eventually you might get uncorrectable error.  That may happen soon or it may
> never happen during lifetime of that modules.
> 
> As that study has demonstrated a significant percentage of systems and modules
> report at least one correctable ECC error.  ECC correctable errors at present
> correlate with correctable ECC errors in the future.  They also correlate with
> uncorrectable errors in the future.  But percentage of systems developing
> uncorrectable errors is significantly smaller, so chances of false positives are
> substantial.
> 
> You should decide whether you want to replace the module (if you can pinpoint it)
> or all modules depending on your resources (money, etc), importance of service
> that the server in question provides (allowable downtime and cost of it and
> fault-tolerance of a larger system, of which the server may be a part (e.g. it may
> have a standby server for failover).
> 
> I think that most of what I've just said was kind of obvious from the start.
> The important bit from that study is that ECC errors are not as random and as rare
> as was previously thought, and they can be attributed to a number of factors like
> manufacturing defects, layout of memory lanes on motherboard, etc.

A while back I found a slide deck from some Intel presentation that claimed
that a modern 4GB DIMM should average 18 corrected errors a month.  Your
rate is a bit higher than that, but corrected ECC errors are not that
unexpected.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201008250825.34903.jhb>