Date: Thu, 22 Oct 2015 11:09:50 -0700 From: John Baldwin <jhb@freebsd.org> To: freebsd-hardware@freebsd.org Cc: Dieter BSD <dieterbsd@gmail.com>, freebsd-hackers@freebsd.org Subject: Re: ECC support Message-ID: <1492434.22kxSKhHEJ@ralph.baldwin.cx> In-Reply-To: <CAA3ZYrDjTNM7AShdpFOjT-3wZnEV2u-2X6MnLksON61bw7=XiQ@mail.gmail.com> References: <CAA3ZYrDjTNM7AShdpFOjT-3wZnEV2u-2X6MnLksON61bw7=XiQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday, September 16, 2015 10:56:52 AM Dieter BSD wrote: > Chris: > > MCA: Bank 1, Status 0x9400000000000151 > > MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000 > > MCA: Vendor "AuthenticAMD", ID 0x100f52, APIC ID 2 > > > > MCA: Address 0x81cc0e9f0 > > > > Kind of freaky. I've never had this error on this board before. > > On others tho. > > > > Try a search for MCA instead. > > Is there a decoder ring for those messages? I don't recall seeing > messages like that, although I wasn't looking for them, and they > don't leap out at you screaming ERROR! ERROR! Digital Unix had its > problems, but at least the error messages were fairly clear. > Something like "single bit memory error at address 0x12345..." > A simple edit to sys/x86/x86/mca.c > s/printf("UNCOR ");/printf("Uncorrectable ");/ > s/printf("COR ");/printf("Correctable ");/ > would make the messages at least slightly more meaningful to a viewer > who isn't intimently(sp) familiar with the mca. Which most people aren't. The problem is that there are other fields to decode and you can only fit so much in one line. Also, there is not a CPU-independent way to know the address of an ECC error. On Intel Core i3/5/7 (anything with QPI) you can identify the individual DIMM at least, but the label that the motherboard manufacturer uses varies by manufacturer. (You can maybe scrape that text from the SMBIOS tables, but only if they aren't wrong which they sometimes are, and good luck knowing if they are wrong or right.) Digital UNIX had the luxury of running on hardware built by the same company, not on a random assortment of boards built by various vendors. FreeBSD does not. sysutils/mcelog does some more verbose decoding of MCA records, but I find it to be equally gibberish for anyone not intimately familiar with a specific CPU. I wrote a tool for a previous employer that was able to do some simple parsing of MCA errors for Supermicro X7-X10 boards (Intel CPUs) and give a short summary that was used in a nagios check. However, it only handles a narrow set of systems. https://github.com/freebsd/freebsd/compare/master...bsdjhb:ecc -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1492434.22kxSKhHEJ>