Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 18 Oct 2011 11:11:34 +0200
From:      John Hay <jhay@meraka.org.za>
To:        freebsd-stable@freebsd.org
Subject:   Re: MCA: CPU 0 UNCOR PCC DTLB L1 error
Message-ID:  <20111018091134.GA8700@zibbi.meraka.csir.co.za>
In-Reply-To: <20110516165123.GA30171@icarus.home.lan>
References:  <20110510125220.GA88338@zibbi.meraka.csir.co.za> <BANLkTik79gjQKsdrz_8mQdLc3e9KGiGzzQ@mail.gmail.com> <20110516162319.GA58581@zibbi.meraka.csir.co.za> <20110516165123.GA30171@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi Guys,

On Mon, May 16, 2011 at 09:51:23AM -0700, Jeremy Chadwick wrote:
> On Mon, May 16, 2011 at 06:23:19PM +0200, John Hay wrote:
> > On Wed, May 11, 2011 at 05:26:50PM -0500, Alan Cox wrote:
> > > On Tue, May 10, 2011 at 7:52 AM, John Hay <jhay@meraka.org.za> wrote:
> > > 
> > > > Hi,
> > > >
> > > > I have seen this panic a few times on a Gigabyte E350N-USB3 running
> > > > 8-STABLE.
> > > > I have only seen it while in X, but then the machine is always in X. At
> > > > first,
> > > > I just got these hangs, so bought a PCI-express RS232 card and could see
> > > > these
> > > > at last. For some reason it does not go past this, so I have not been able
> > > > to
> > > > get a dump yet.
> > > >
> > > > Have anybody an idea of why this is or how to debug it further? I searched
> > > > the archives and found something similar about a year ago, but it looks
> > > > like it was solved with a fix that got committed.
> > > >
> > > > http://www.freebsd.org/cgi/query-pr.cgi?pr=140338
> > > >
> > > > I have now disabled mca in loader.conf with 'hw.mca.enabled="0"' and I have
> > > > not seen that panic again. I do occasionally see a panic in devfs_open(),
> > > > but I guess that should be handled in another thread.
> > > >
> > > > The kernel is basically a GENERIC kernel with puc uncommented and the
> > > > following in loader.conf
> > > >
> > > > vm.kmem_size="12G"
> > > > hw.mca.enabled="0"
> > > > zfs_load="YES"
> > > > ahci_load="YES"
> > > > xhci_load="YES"
> > > > amdtemp_load="YES"
> > > > ng_ubt_load="YES"
> > > > uplcom_load="YES"
> > > >
> > > > Here is the panic message and after that dmesg.
> > > >
> > > > John
> > > > --
> > > > John Hay -- jhay@meraka.csir.co.za / jhay@FreeBSD.org
> > > >
> > > > ####################################################
> > > > MCA: Bank 0, Status 0xb600000000010015
> > > > MCA: Global Cap 0x0000000000000106, Status 0x0000000000000004
> > > > MCA: Vendor "AuthenticAMD", ID 0x500f10, APIC ID 0
> > > > MCA: CPU 0 UNCOR PCC DTLB L1 error
> > > > MCA: Address 0x8016c4000
> > > >
> > > >
> > > > Fatal trap 28: machine check trap while in user mode
> > > > cpuid = 0; apic id = 00
> > > > instruction pointer     = 0x43:0x80156af85
> > > > stack pointer           = 0x3b:0x7fffffffcb18
> > > > frame pointer           = 0x3b:0x80fe87800
> > > > code segment            = base 0x0, limit 0xfffff, type 0x1b
> > > >                        = DPL 3, pres 1, long 1, def32 0, gran 1
> > > > processor eflags        = interrupt enabled, IOPL = 0
> > > > current process         = 2484 (initial thread)
> > > > trap number             = 28
> > > > panic: machine check trap
> > > > cpuid = 0
> > > > KDB: stack backtrace:
> > > > #0 0xffffffff80608d5e at kdb_backtrace+0x5e
> > > > #1 0xffffffff805d6707 at panic+0x187
> > > > #2 0xffffffff808bf4c0 at trap_fatal+0x290
> > > > #3 0xffffffff808bfaa9 at trap+0x109
> > > > #4 0xffffffff808a7d94 at calltrap+0x8
> > > > ####################################################
> > > >
> > > >
> > > Please try the following patch:
> > > 
> > > Index: x86/x86/mca.c
> > > ===================================================================
> > > --- x86/x86/mca.c       (revision 219060)
> > > +++ x86/x86/mca.c       (working copy)
> > > @@ -665,7 +665,8 @@ mca_setup(uint64_t mcg_cap)
> > >          * for Erratum 383.
> > >          */
> > >         if (cpu_vendor_id == CPU_VENDOR_AMD &&
> > > -           CPUID_TO_FAMILY(cpu_id) == 0x10 && amd10h_L1TP)
> > > +           (CPUID_TO_FAMILY(cpu_id) == 0x10 ||
> > > +           CPUID_TO_FAMILY(cpu_id) == 0x14) && amd10h_L1TP)
> > >                 workaround_erratum383 = 1;
> > > 
> > >         mtx_init(&mca_lock, "mca", NULL, MTX_SPIN);
> > > Index: i386/i386/pmap.c
> > > ===================================================================
> > > --- i386/i386/pmap.c    (revision 219060)
> > > +++ i386/i386/pmap.c    (working copy)
> > > @@ -758,7 +758,8 @@ pmap_init(void)
> > >          * machine monitor.
> > >          */
> > >         if (vm_guest == VM_GUEST_VM && cpu_vendor_id == CPU_VENDOR_AMD &&
> > > -           CPUID_TO_FAMILY(cpu_id) == 0x10)
> > > +           (CPUID_TO_FAMILY(cpu_id) == 0x10 ||
> > > +           CPUID_TO_FAMILY(cpu_id) == 0x14))
> > >                 workaround_erratum383 = 1;
> > > 
> > >         /*
> > > Index: amd64/amd64/pmap.c
> > > ===================================================================
> > > --- amd64/amd64/pmap.c  (revision 219060)
> > > +++ amd64/amd64/pmap.c  (working copy)
> > > @@ -727,7 +727,8 @@ pmap_init(void)
> > >          * machine monitor.
> > >          */
> > >         if (vm_guest == VM_GUEST_VM && cpu_vendor_id == CPU_VENDOR_AMD &&
> > > -           CPUID_TO_FAMILY(cpu_id) == 0x10)
> > > +           (CPUID_TO_FAMILY(cpu_id) == 0x10 ||
> > > +           CPUID_TO_FAMILY(cpu_id) == 0x14))
> > >                 workaround_erratum383 = 1;
> > > 
> > >         /*
> > 
> > I have applied the patch, but got another one today. I still do not get
> > a prompt or dump. :-( It just get stuck right after #4. If there is anything
> > more that I can try, just ask.
> > 
> > #####################################################################
> > MCA: Bank 0, Status 0xb600000000010015
> > MCA: Global Cap 0x0000000000000106, Status 0x0000000000000004
> > MCA: Vendor "AuthenticAMD", ID 0x500f10, APIC ID 0
> > MCA: CPU 0 UNCOR PCC DTLB L1 error
> > MCA: Address 0x808ace000
> > 
> > 
> > Fatal trap 28: machine check trap while in user mode
> > cpuid = 1; apic id = 01
> > instruction pointer	= 0x43:0x80af206d5
> > stack pointer	        = 0x3b:0x7fffffffb8e8
> > frame pointer	        = 0x3b:0x809b92450
> > code segment		= base 0x0, limit 0xfffff, type 0x1b
> > 			= DPL 3, pres 1, long 1, def32 0, gran 1
> > processor eflags	= interrupt enabled, IOPL = 0
> > current process		= 22228 (initial thread)
> > trap number		= 28
> > panic: machine check trap
> > cpuid = 1
> > KDB: stack backtrace:
> > #0 0xffffffff80608f6e at kdb_backtrace+0x5e
> > #1 0xffffffff805d6917 at panic+0x187
> > #2 0xffffffff808bf7c0 at trap_fatal+0x290
> > #3 0xffffffff808bfda9 at trap+0x109
> > #4 0xffffffff808a8084 at calltrap+0x8
> > #####################################################################
> 
> The backtrace doesn't help in this situation.  I'm not sure anyone has
> taken the time to explain to you what's going on here exactly.  I don't
> know if you're like me, but when a machine panics I generally like to
> know what's going on.  :-)
> 
> Use of MCA (see Wikipedia for Machine Check Architecture) is generating
> an MCE (see Wikipedia for Machine Check Exception).  MCEs are generated
> by hardware when "something happens" -- they usually indicate a
> failure (bad RAM, CPU cache failing, etc.).
> 
> Certain MCEs are considered "normal"; for example, L2 cache (on-die in
> the CPU) being auto-corrected by ECC (that's ECC on-die, not ECC RAM
> like system RAM; this feature is only available on certain classes of
> CPUs) may be normal if seen, say, once every few months.  A large sum of
> them, however, is not normal.
> 
> MCE handling is done in the kernel.  Certain MCEs have to be ignored,
> and therefore there are handlers for those in the kernel.
> 
> MCEs vary greatly per every model (not class, but model) of CPU.  For
> example, Intel's documentation on their MCEs is immense and very complex
> given all the different CPU models and series'.
> 
> Any MCE without a handler will generate an exception (kernel panic) like
> what you see above.  This is normal on FreeBSD, as well as Solaris and
> many other OSes.  It's basically mandatory.  The reason being, if the
> situation/condition isn't known to be something that can be ignored, the
> hardware may be in a state of disarray and cannot be trusted.  Hence,
> panic.  The backtrace will therefore always be very short and indicate
> an intentional panic.
> 
> The MCE messages shown in FreeBSD are not very user-friendly, meaning
> you can't take what you see and go "omg!!! L1 cache failure!!" because
> that's not necessarily what that message means.  MCA is complex, and
> again, like I said, varies per model of CPU.
> 
> There is a utility on Linux called mcelog that can decode the messages
> to some degree.  John Baldwin ported this to FreeBSD (it's not in ports)
> and I've been occasionally downloading it and ensuring the patches work
> correctly + utility compiles and works (I have patches for patches,
> basically; no I haven't put them up anywhere).  "mcelog --ascii" will
> read data from stdin, specifically the messages you see from the kernel,
> and it outputs something a little more friendly.
> 
> In your case, however, mcelog does not have support for your specific
> model of CPU.  Possibly too new?  Here's the output that is returned:
> 
> $ ./mcelog --no-dmi --ascii
> MCA: Bank 0, Status 0xb600000000010015
> MCA: Global Cap 0x0000000000000106, Status 0x0000000000000004
> MCA: Vendor "AuthenticAMD", ID 0x500f10, APIC ID 0
> MCA: CPU 0 UNCOR PCC DTLB L1 error
> MCA: Address 0x808ace000
> 
> mcelog: Unknown CPU type vendor 2 family 14 model 1
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 0 BANK 0
> ADDR 808ace000
> STATUS b600000000010015 MCGSTATUS 4
> MCGCAP 106 APICID 0 SOCKETID 0
> CPUID Vendor AMD Family 20 Model 1
> 
> I'm not familiar with AMD CPUs so I can't really look up what's going on
> here or what the MCE indicates, but this information may help others on
> this list.
> 
> A workaround -- though risky -- may be to disable MCA entirely by
> setting hw.mca.enabled="0" in /boot/loader.conf and rebooting.  This
> will ensure your system won't panic whenever *any* MCE is seen.  Older
> FreeBSD defaulted to MCA being off.  However, since I don't know what
> the MCE indicates, it could be fatal (e.g. panic'ing might be a better
> choice).  Hard to say at this point.
> 
> Hope this helps educate in one way or another.  :-)
> 

Just to say that I have been running this box with hw.mca.enabled="0"
in loader.conf and it has been stable since. I do see the ocasional
coredump of npviewer.bin, but I see that on other boxes too. So I
think that maybe this particular error might be a case where FreeBSD
do something in a way that AMD did not expect on these processors.

John
-- 
John Hay -- jhay@meraka.csir.co.za / jhay@FreeBSD.org



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111018091134.GA8700>