Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 24 Feb 2016 15:51:01 -0500
From:      Ultima <ultima1252@gmail.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        freebsd-hardware@freebsd.org
Subject:   Re: MCA error, possible causes?
Message-ID:  <CANJ8om4PWNtP2jK5=RE_9w5qhn6EJGASoSoft5ZHzKnHpso%2BGA@mail.gmail.com>
In-Reply-To: <1599604.5jmidy9vDx@ralph.baldwin.cx>
References:  <CANJ8om7C2UreYEkm-=XxL222Gqmc9i5kQH2p=oc8ntgbkehn5A@mail.gmail.com> <1599604.5jmidy9vDx@ralph.baldwin.cx>

next in thread | previous in thread | raw e-mail | index | archive | help
 Hi John,

 Thanks for the explanation. I ran some tests and ended up being a power
savings mode (aka unstable mode?). Disabling this feature put an end to the
freezes. I came to this conclusion by stress testing the box for 3 days,
and there were no issues. Nothing, then I stopped the stress test and about
15-30 min later it froze. It seemed to only occur during periods of low
load. I have not received any of these errors after turning off this power
savings mode.

On Wed, Feb 24, 2016 at 3:14 PM, John Baldwin <jhb@freebsd.org> wrote:

> On Friday, February 12, 2016 08:11:37 PM Ultima wrote:
> >  Recently installed some cpus and received two MCA errors. Using mcelog,
> I
> > found that the version in ports is about 5 years out of dated and didn't
> > support my cpu. Decided to update it to the newest version (Will post on
> > bugzilla shortly) to pull some more info. Going to post orig and decoded
> > mcelog.
> >
> >
> > Raw:
> > MCA: Bank 20, Status 0xc800084000310e0f
> > MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
> > MCA: Vendor "GenuineIntel", ID 0x306f1, APIC ID 0
> > MCA: CPU 0 COR (33) OVER BUSLG ??? ERR Other
> > MCA: Misc 0x1df87b000d9eff
> > MCA: Bank 5, Status 0xc800008000310e0f
> > MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
> > MCA: Vendor "GenuineIntel", ID 0x306f1, APIC ID 42
> > MCA: CPU 34 COR (2) OVER BUSLG ??? ERR Other
> > MCA: Misc 0xdf87b008d9eff
> >
> > mcelog v131:
> > Hardware event. This is not a software error.
> > CPU 0 BANK 20
> > MISC 1df87b000d9eff
> > MCG status:
> > QPI: Rx detected CRC error - successful LLR wihout Phy re-init
> > STATUS c800084000310e0f MCGSTATUS 0
> > MCGCAP 7000c16 APICID 0 SOCKETID 0
> > CPUID Vendor Intel Family 6 Model 63
> > Hardware event. This is not a software error.
> > CPU 34 BANK 5
> > MISC df87b008d9eff
> > MCG status:
> > QPI: Rx detected CRC error - successful LLR wihout Phy re-init
> > STATUS c800008000310e0f MCGSTATUS 0
> > MCGCAP 7000c16 APICID 2a SOCKETID 0
> > CPUID Vendor Intel Family 6 Model 63
> >
> >  After receiving this error, the system was in a frozen state. Any ideas
> > what may cause this?
>
> Well, hardware causes it.  QPI is the interconnect bus between your
> CPUs and RAM.  "Rx detected CRC error" implies that a CPU detected a
> corrupted message on that bus, but when it requested a resend the
> resent message was ok.  Normally corrected errors shouldn't hang your
> machine, but perhaps your machine had another hardware error after this
> that broke it too badly to report and/or log the subsequent error.
>
> --
> John Baldwin
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANJ8om4PWNtP2jK5=RE_9w5qhn6EJGASoSoft5ZHzKnHpso%2BGA>