From owner-freebsd-stable@FreeBSD.ORG Wed Aug 25 07:11:39 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C69731065697 for ; Wed, 25 Aug 2010 07:11:39 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 18EB58FC1B for ; Wed, 25 Aug 2010 07:11:38 +0000 (UTC) Received: from porto.topspin.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id KAA23611; Wed, 25 Aug 2010 10:11:31 +0300 (EEST) (envelope-from avg@icyb.net.ua) Received: from localhost.topspin.kiev.ua ([127.0.0.1]) by porto.topspin.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1OoA8w-000KUU-P0; Wed, 25 Aug 2010 10:11:30 +0300 Message-ID: <4C74C221.5020702@icyb.net.ua> Date: Wed, 25 Aug 2010 10:11:29 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.2.8) Gecko/20100822 Lightning/1.0b2 Thunderbird/3.1.2 MIME-Version: 1.0 To: Jeremy Chadwick References: <4C71CC62.6060803@langille.org> <4C745213.3050004@langille.org> <20100824233849.GA35100@icarus.home.lan> In-Reply-To: <20100824233849.GA35100@icarus.home.lan> X-Enigmail-Version: 1.1.2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-stable , Dan Langille Subject: Re: kernel MCA messages X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 07:11:39 -0000 on 25/08/2010 02:38 Jeremy Chadwick said the following: > On Tue, Aug 24, 2010 at 07:13:23PM -0400, Dan Langille wrote: >> On 8/22/2010 9:18 PM, Dan Langille wrote: >>> What does this mean? >>> >>> kernel: MCA: Bank 4, Status 0x940c4001fe080813 >>> kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000 >>> kernel: MCA: Vendor "AuthenticAMD", ID 0xf5a, APIC ID 0 >>> kernel: MCA: CPU 0 COR BUSLG Source RD Memory >>> kernel: MCA: Address 0x7ff6b0 >>> >>> FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 >> >> FYI, these are occurring every hour, almost to the second. e.g. >> xx:56:yy, where yy is 09, 10, or 11. >> >> Checking logs, I don't see anything that correlates with this point >> in the hour (i.e 56 minutes past) that doesn't also occur at other >> times. >> >> It seems very odd to occur so regularly. I still think that everything of essence has already been said in this thread. > 1) Why haven't you replaced the DIMM in Bank 4 -- or better yet, all Bank 4 here is MCA reporting bank, it has nothing to do with RAM slots. Currently on FreeBSD we don't have a standard tool to map physical address to DRAM module, but I am sure that there could be some ways to do it. > the DIMMs just to be sure? Do this and see if the problem goes > away. If not, no harm done, and you've narrowed it down. > > 2) What exact manufacturer and model of motherboard is this? If > you can provide a link to a User Manual that would be great. > > 3) Please go into your system BIOS and find where "ECC ChipKill" > options are available (likely under a Memory, Chipset, or > Northbridge section). Please write down and provide here all > of the options and what their currently selected values are. > > 4) Please make sure you're running the latest system BIOS. I've seen > on certain Rackable AMD-based systems where Northbridge-related > features don't work quite right (at least with Solaris), resulting > in atrocious memory performance on the system. A BIOS upgrade > solved the problem. > > There's a ChipKill feature called "ECC BG Scrubbing" that's vague in > definition, given that it's a "background memory scrub" that happens at > intervals which are unknown to me. Maybe 60 minutes? I don't know. > This is why I ask question #3. > > For John and other devs: I assume the decoded MCA messages indicate with > absolute certainty that the ECC error is coming from external DRAM and > not, say, bad L1 or L2 cache? Have you read the decoded message? Please re-read it. I still recommend reading at least the summary of the RAM ECC research article to make your own judgment about need to replace DRAM. -- Andriy Gapon