Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 23 Oct 2015 12:37:31 +0100
From:      Bob Bishop <rb@gid.co.uk>
To:        John Baldwin <jhb@freebsd.org>
Cc:        freebsd-hackers@freebsd.org, Dieter BSD <dieterbsd@gmail.com>, freebsd-hardware@freebsd.org
Subject:   Re: ECC support
Message-ID:  <97482413-D2AA-4C32-AEFF-EB65D5D8542B@gid.co.uk>
In-Reply-To: <1483396.WZc3qgD2yz@ralph.baldwin.cx>
References:  <CAA3ZYrDjTNM7AShdpFOjT-3wZnEV2u-2X6MnLksON61bw7=XiQ@mail.gmail.com> <1492434.22kxSKhHEJ@ralph.baldwin.cx> <74705089-408A-4FD3-899E-CA677390F855@gid.co.uk> <1483396.WZc3qgD2yz@ralph.baldwin.cx>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi,

> On 22 Oct 2015, at 22:17, John Baldwin <jhb@freebsd.org> wrote:
>=20
> On Thursday, October 22, 2015 07:49:13 PM Bob Bishop wrote:
>> HI,
>>=20
>>> On 22 Oct 2015, at 19:09, John Baldwin <jhb@freebsd.org> wrote:
>>>=20
>>> On Wednesday, September 16, 2015 10:56:52 AM Dieter BSD wrote:
>>>> Chris:
>>>>> MCA: Bank 1, Status 0x9400000000000151
>>>>> MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
>>>>> MCA: Vendor "AuthenticAMD", ID 0x100f52, APIC ID 2
>>>>>=20
>>>>> MCA: Address 0x81cc0e9f0
>>>>>=20
>>>>> Kind of freaky. I've never had this error on this board before.
>>>>> On others tho.
>>>>>=20
>>>>> Try a search for MCA instead.
>>>>=20
>>>> Is there a decoder ring for those messages?  I don't recall seeing
>>>> messages like that, although I wasn't looking for them, and they
>>>> don't leap out at you screaming ERROR! ERROR!  Digital Unix had its
>>>> problems, but at least the error messages were fairly clear.
>>>> Something like "single bit memory error at address 0x12345..."
>>>> A simple edit to sys/x86/x86/mca.c
>>>>  s/printf("UNCOR ");/printf("Uncorrectable ");/
>>>>  s/printf("COR ");/printf("Correctable ");/
>>>> would make the messages at least slightly more meaningful to a =
viewer
>>>> who isn't intimently(sp) familiar with the mca.  Which most people =
aren't.
>>>=20
>>> The problem is that there are other fields to decode and you can =
only fit so
>>> much in one line.  Also, there is not a CPU-independent way to know =
the
>>> address of an ECC error. [etc]
>>=20
>> On server-class hardware, the platform management (BMC or whatever) =
is probably decoding this stuff for event logs and can be interrogated =
via IPMI (or whatever).
>=20
> Not always well and not always with side effects you want.  On Core 2 =
and
> Nehalem i7 class hardware I measured that it took on the order of 400
> milliseconds (not micro) in SMM (system management mode, so your =
entire
> OS is halted) to write out each log entry to NVRAM.  At least one =
place I
> worked at turned the BIOS ECC logging off because that delay was too =
costly.
>=20
> Also, even though your BMC may log it, the format for doing so isn't
> standard.  The details such as the affected DIMM are in the OEM bits =
of
> the log record, so not something you can easily extract from, say,
> ipmitool sel elist.  You'd have to log into the BIOS itself (or the =
BMC's
> web UI) to see which DIMM is affected.  Neither of those are really =
great
> for automated reporting.

All agreed. I was just flagging up the existence of another possible =
channel to get at ECC logging.

> --=20
> John Baldwin

--
Bob Bishop
rb@gid.co.uk







Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?97482413-D2AA-4C32-AEFF-EB65D5D8542B>