Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Aug 2010 10:11:29 +0300
From:      Andriy Gapon <avg@icyb.net.ua>
To:        Jeremy Chadwick <freebsd@jdc.parodius.com>
Cc:        freebsd-stable <freebsd-stable@freebsd.org>, Dan Langille <dan@langille.org>
Subject:   Re: kernel MCA messages
Message-ID:  <4C74C221.5020702@icyb.net.ua>
In-Reply-To: <20100824233849.GA35100@icarus.home.lan>
References:  <4C71CC62.6060803@langille.org> <4C745213.3050004@langille.org> <20100824233849.GA35100@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
on 25/08/2010 02:38 Jeremy Chadwick said the following:
> On Tue, Aug 24, 2010 at 07:13:23PM -0400, Dan Langille wrote:
>> On 8/22/2010 9:18 PM, Dan Langille wrote:
>>> What does this mean?
>>>
>>> kernel: MCA: Bank 4, Status 0x940c4001fe080813
>>> kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000
>>> kernel: MCA: Vendor "AuthenticAMD", ID 0xf5a, APIC ID 0
>>> kernel: MCA: CPU 0 COR BUSLG Source RD Memory
>>> kernel: MCA: Address 0x7ff6b0
>>>
>>> FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43
>>
>> FYI, these are occurring every hour, almost to the second. e.g.
>> xx:56:yy, where yy is 09, 10, or 11.
>>
>> Checking logs, I don't see anything that correlates with this point
>> in the hour (i.e 56 minutes past) that doesn't also occur at other
>> times.
>>
>> It seems very odd to occur so regularly.

I still think that everything of essence has already been said in this thread.

> 1) Why haven't you replaced the DIMM in Bank 4 -- or better yet, all

Bank 4 here is MCA reporting bank, it has nothing to do with RAM slots.
Currently on FreeBSD we don't have a standard tool to map physical address to
DRAM module, but I am sure that there could be some ways to do it.

>    the DIMMs just to be sure?  Do this and see if the problem goes
>    away.  If not, no harm done, and you've narrowed it down.
> 
> 2) What exact manufacturer and model of motherboard is this?  If
>    you can provide a link to a User Manual that would be great.
> 
> 3) Please go into your system BIOS and find where "ECC ChipKill"
>    options are available (likely under a Memory, Chipset, or
>    Northbridge section).  Please write down and provide here all
>    of the options and what their currently selected values are.
> 
> 4) Please make sure you're running the latest system BIOS.  I've seen
>    on certain Rackable AMD-based systems where Northbridge-related
>    features don't work quite right (at least with Solaris), resulting
>    in atrocious memory performance on the system.  A BIOS upgrade
>    solved the problem.
> 
> There's a ChipKill feature called "ECC BG Scrubbing" that's vague in
> definition, given that it's a "background memory scrub" that happens at
> intervals which are unknown to me.  Maybe 60 minutes?  I don't know.
> This is why I ask question #3.
> 
> For John and other devs: I assume the decoded MCA messages indicate with
> absolute certainty that the ECC error is coming from external DRAM and
> not, say, bad L1 or L2 cache?

Have you read the decoded message?
Please re-read it.

I still recommend reading at least the summary of the RAM ECC research article
to make your own judgment about need to replace DRAM.

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C74C221.5020702>