From owner-freebsd-current@FreeBSD.ORG Sat Jan 17 21:46:58 2015 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0F8CF348 for ; Sat, 17 Jan 2015 21:46:58 +0000 (UTC) Received: from resqmta-ch2-03v.sys.comcast.net (resqmta-ch2-03v.sys.comcast.net [IPv6:2001:558:fe21:29:69:252:207:35]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (Client CN "Bizanga Labs SMTP Client Certificate", Issuer "Bizanga Labs CA" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id C9ACFD32 for ; Sat, 17 Jan 2015 21:46:57 +0000 (UTC) Received: from resomta-ch2-05v.sys.comcast.net ([69.252.207.101]) by resqmta-ch2-03v.sys.comcast.net with comcast id h9mu1p0052Bo0NV019mu5m; Sat, 17 Jan 2015 21:46:54 +0000 Received: from jdc.koitsu.org ([69.181.142.213]) by resomta-ch2-05v.sys.comcast.net with comcast id h9mt1p00A4cTVs5019mtP9; Sat, 17 Jan 2015 21:46:54 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 189491AF10E; Sat, 17 Jan 2015 13:46:53 -0800 (PST) Date: Sat, 17 Jan 2015 13:46:53 -0800 From: Jeremy Chadwick To: Matthias Apitz , Eric van Gyzen , freebsd-current@freebsd.org Subject: Re: kernel: MCA: CPU 0 COR (1) internal parity error Message-ID: <20150117214653.GA65337@icarus.home.lan> References: <20150116194539.GA2230@c720-r276659> <54B96EE4.5090702@vangyzen.net> <20150117174326.GA40205@c720-r276659> MIME-Version: 1.0 Content-Type: text/plain; charset=unknown-8bit Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20150117174326.GA40205@c720-r276659> User-Agent: Mutt/1.5.23 (2014-03-12) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20140121; t=1421531214; bh=NFH9lI6eFqKOTa0GqHImXpEXKZa+KjH96/0d2F+1dpI=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=JsUJLpxcOquaMn+0aYbsBtujJgyZInV4SAlNqhDUzUNBgf1ZHe5XTFiI3Ve8eCbVk hPWZ8K/KDlhvbN1JJVxpjiuT5MAy5w5OXIEG2B5eLEg55amU2EUOQVpbmTnvuPW1gN iA8aoHAQKmdv3RuA13gaRuqxT+vrtpIsLDS0/GVgm/tPnEGR6OHG9rg3EUnti4jsrf ecr2/04SBfDh2tDZBHr6Df53bQu9Rioiay5UCJ6PcLmspvl/fdVfHlx4CsleVF1SLQ oyYHEVWFthODn39O8v0LUqfBJwnPXJNm/MiTXAsFFzKwoudI4YJlSUpY4HjqPB643n 7FO2VpAItliGw== X-Mailman-Approved-At: Sun, 18 Jan 2015 01:21:07 +0000 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 17 Jan 2015 21:46:58 -0000 On Sat, Jan 17, 2015 at 06:43:26PM +0100, Matthias Apitz wrote: > El día Friday, January 16, 2015 a las 03:04:52PM -0500, Eric van Gyzen escribió: > > > On 01/16/2015 14:45, Matthias Apitz wrote: > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005 > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000 > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0 > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error > > > > Try ports/sysutils/mcelog. > > I have installed that port and launched it as > > # mcelog > mcelog.txt > ... > mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors > mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors > mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors > ... > > (the messages are STDERR); > > in 'mcelog.txt' it has for the last event from /var/log/messages: > > Jan 17 18:23:54 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005 > Jan 17 18:23:54 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000 > Jan 17 18:23:54 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0 > Jan 17 18:23:54 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error > > the following lines (the uptime matches): > > ... > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > MCE 32 > CPU 0 BANK 0 TSC 36eec80fd688 [at 1397 Mhz 0 days 12:0:41 uptime (unreliable)] > MCG status: > MCi status: > Error enabled > MCA: Unknown Error 5 > STATUS 90000040000f0005 MCGSTATUS 0 > MCGCAP c07 APICID 0 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 69 > > Questions: > a) Is the output of mcelog valid (regardless of the msg on STDERR of > 'unsupported model')? It may or may not be reliable. For MCE decoding to work accurately, the software (read: kernel) needs to have full support for the processor model and revision in question. mcelog simply tries to decode the output that the kernel spits out and provide a more "user-friendly" explanation. That isn't as simple as just modifying some table of supported CPUs; it involves reading Intel documentation and implementing what can be figured out through that. VMware has a small KB about this, to give you some insight into the complexity: http://kb.vmware.com/kb/1005184 There are some capabilities of MCA that are "semi-universal" across series of CPUs, so sometimes those can be decoded (mostly) accurately, but other times such isn't the case. Sometimes there are certain MCEs that have be ignored by the kernel (i.e. the kernel MCE support has to be updated to reflect changes in MCEs for that newer model of processor). The version of mcelog available in ports is extremely old, and the amount of work to upgrade it to the latest Linux mcelog (1.08) I imagine would be quite large: http://git.kernel.org/cgit/utils/cpu/mce/mcelog.git The existing FreeBSD port involves a large number of patches written by John Baldwin, and whether or not those can be correctly backported to newer mcelog releases is unknown. I really need to renounce my maintainer flag of that port and let someone else take care of it. > b) Is it worth to contact the dealer or wait until it is broken > completely? To me, the above message indicates that one of the CPU cores is damaged/misbehaving. I cannot determine if it's referring to L1, L2, or L3 cache, but I don't see any clear indicator of that (possibly due to the aforementioned explanation I gave about accuracy). However, I will point you to this thread, which may indicate that the model of CPU in question (or series or models of Intel CPUs) have MCEs that happen which are considered "normal" and are thus not being decoded correctly: https://lists.freebsd.org/pipermail/freebsd-questions/2014-January/255873.html I would suggest providing relevant dmesg lines about your exact processor in this system and possibly ask for help from either John Baldwin or someone on freebsd-hackers@. I myself cannot help with this. The dmesg lines I'm referring to, by the way, look like this (all of them matter, particularly the first two): CPU: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (2833.59-MHz K8-class CPU) Origin = "GenuineIntel" Id = 0x10677 Family = 0x6 Model = 0x17 Stepping = 7 Features=0xbfebfbff Features2=0x8e3fd AMD Features=0x20100800 AMD Features2=0x1 TSC: P-state invariant, performance statistics The OP of that freebsd-questions thread should have provided this but didn't (instead just says "Intel i3-4310" -- this isn't precise enough), so whether or not you two are using the same CPU is unknown. There simply could be "new MCEs" or changes to the MCA that Intel implemented in some newer models of Core iX that aren't being handled correctly by the kernel (i.e. misreporting or mis-decoding). Good luck! -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB |