Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 22 Oct 2015 14:17:07 -0700
From:      John Baldwin <jhb@freebsd.org>
To:        Bob Bishop <rb@gid.co.uk>
Cc:        freebsd-hardware@freebsd.org, freebsd-hackers@freebsd.org, Dieter BSD <dieterbsd@gmail.com>
Subject:   Re: ECC support
Message-ID:  <1483396.WZc3qgD2yz@ralph.baldwin.cx>
In-Reply-To: <74705089-408A-4FD3-899E-CA677390F855@gid.co.uk>
References:  <CAA3ZYrDjTNM7AShdpFOjT-3wZnEV2u-2X6MnLksON61bw7=XiQ@mail.gmail.com> <1492434.22kxSKhHEJ@ralph.baldwin.cx> <74705089-408A-4FD3-899E-CA677390F855@gid.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday, October 22, 2015 07:49:13 PM Bob Bishop wrote:
> HI,
> 
> > On 22 Oct 2015, at 19:09, John Baldwin <jhb@freebsd.org> wrote:
> > 
> > On Wednesday, September 16, 2015 10:56:52 AM Dieter BSD wrote:
> >> Chris:
> >>> MCA: Bank 1, Status 0x9400000000000151
> >>> MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
> >>> MCA: Vendor "AuthenticAMD", ID 0x100f52, APIC ID 2
> >>> 
> >>> MCA: Address 0x81cc0e9f0
> >>> 
> >>> Kind of freaky. I've never had this error on this board before.
> >>> On others tho.
> >>> 
> >>> Try a search for MCA instead.
> >> 
> >> Is there a decoder ring for those messages?  I don't recall seeing
> >> messages like that, although I wasn't looking for them, and they
> >> don't leap out at you screaming ERROR! ERROR!  Digital Unix had its
> >> problems, but at least the error messages were fairly clear.
> >> Something like "single bit memory error at address 0x12345..."
> >> A simple edit to sys/x86/x86/mca.c
> >>   s/printf("UNCOR ");/printf("Uncorrectable ");/
> >>   s/printf("COR ");/printf("Correctable ");/
> >> would make the messages at least slightly more meaningful to a viewer
> >> who isn't intimently(sp) familiar with the mca.  Which most people aren't.
> > 
> > The problem is that there are other fields to decode and you can only fit so
> > much in one line.  Also, there is not a CPU-independent way to know the
> > address of an ECC error. [etc]
> 
> On server-class hardware, the platform management (BMC or whatever) is probably decoding this stuff for event logs and can be interrogated via IPMI (or whatever).

Not always well and not always with side effects you want.  On Core 2 and
Nehalem i7 class hardware I measured that it took on the order of 400
milliseconds (not micro) in SMM (system management mode, so your entire
OS is halted) to write out each log entry to NVRAM.  At least one place I
worked at turned the BIOS ECC logging off because that delay was too costly.

Also, even though your BMC may log it, the format for doing so isn't
standard.  The details such as the affected DIMM are in the OEM bits of
the log record, so not something you can easily extract from, say,
ipmitool sel elist.  You'd have to log into the BIOS itself (or the BMC's
web UI) to see which DIMM is affected.  Neither of those are really great
for automated reporting.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1483396.WZc3qgD2yz>