Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 24 Aug 2010 08:14:29 +0200
From:      "Ronald Klop" <ronald-freebsd8@klop.yi.org>
To:        freebsd-stable@freebsd.org
Subject:   Re: kernel MCA messages
Message-ID:  <op.vhxiafp38527sy@212-123-145-58.ip.telfort.nl>
In-Reply-To: <201008230820.35260.jhb@freebsd.org>
References:  <4C71CC62.6060803@langille.org> <4C71D756.5080205@langille.org> <4C7218D6.6090408@icyb.net.ua> <201008230820.35260.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 23 Aug 2010 14:20:35 +0200, John Baldwin <jhb@freebsd.org> wrote:

> On Monday, August 23, 2010 2:44:38 am Andriy Gapon wrote:
>> on 23/08/2010 05:05 Dan Langille said the following:
>> > On 8/22/2010 9:18 PM, Dan Langille wrote:
>> >> What does this mean?
>> >>
>> >> kernel: MCA: Bank 4, Status 0x940c4001fe080813
>> >> kernel: MCA: Global Cap 0x0000000000000105, Status 0x00000000000000=
00
>> >> kernel: MCA: Vendor "AuthenticAMD", ID 0xf5a, APIC ID 0
>> >> kernel: MCA: CPU 0 COR BUSLG Source RD Memory
>> >> kernel: MCA: Address 0x7ff6b0
>> >>
>> >> FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43
>> >
>> > And another one:
>> >
>> > kernel: MCA: Bank 4, Status 0x9459c0014a080813
>> > kernel: MCA: Global Cap 0x0000000000000105, Status 0x000000000000000=
0
>> > kernel: MCA: Vendor "AuthenticAMD", ID 0xf5a, APIC ID 0
>> > kernel: MCA: CPU 0 COR BUSLG Source RD Memory
>> > kernel: MCA: Address 0x7ff670
>>
>> I believe that you get correctable RAM ECC errors, but not entirely =20
>> sure.
>> There is mcelog utility that decodes such messages into human-friendly=
 =20
>> descriptions.
>> The utility is available on Linux-based systems.
>> John Baldwin has a port of it to FreeBSD, but it seems to be WIP and i=
s =20
>> private
>> so far.  Wait and watch John posting decoded text in this thread :-)
>
> It is not private, it is in //depot/projects/mcelog/... in p4.  It is =20
> not a
> complete port yet though (doesn't support the daemon and client modes f=
or
> example).
>
> Details for these errors:
>
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 0 4 northbridge
> ADDR 7ff6b0
>   Northbridge RAM Chipkill ECC error
>   Chipkill ECC syndrome =3D fe18
>        bit32 =3D err cpu0
>        bit46 =3D corrected ecc error
>   bus error 'local node origin, request didn't time out
>              generic read mem transaction
>              memory access, level generic'
> STATUS 940c4001fe080813 MCGSTATUS 0
> MCGCAP 105 APICID 0 SOCKETID 0
> CPUID Vendor AMD Family 15 Model 5
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 0 4 northbridge
> ADDR 7ff670
>   Northbridge RAM Chipkill ECC error
>   Chipkill ECC syndrome =3D 4ab3
>        bit32 =3D err cpu0
>        bit46 =3D corrected ecc error
>   bus error 'local node origin, request didn't time out
>              generic read mem transaction
>              memory access, level generic'
> STATUS 9459c0014a080813 MCGSTATUS 0
> MCGCAP 105 APICID 0 SOCKETID 0
> CPUID Vendor AMD Family 15 Model 5
>
> As Andriy guessed, I believe both of these are corrected ECC errors.  Y=
ou
> can likely ignore them as a low rate of corrected ECC errors is not
> unexpected.
>

Hi,

A little off topic, but what is 'a low rate of corrected ECC errors'? At =
=20
work one machine has them like ones per day, but runs ok. Is ones per day=
 =20
much?

Ronald.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?op.vhxiafp38527sy>