Date: Fri, 31 Jan 2014 11:48:42 -0600 From: Tim Daneliuk <tundra@tundraware.com> To: John Baldwin <jhb@freebsd.org>, freebsd-stable@freebsd.org Cc: FreeBSD Hardware Mailing List <freebsd-hardware@freebsd.org> Subject: Re: Need Help With MCA Code Message-ID: <52EBE1FA.2040603@tundraware.com> In-Reply-To: <201401311222.12136.jhb@freebsd.org> References: <52E73717.3000503@tundraware.com> <52E99381.5050803@tundraware.com> <201401311222.12136.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 01/31/2014 11:22 AM, John Baldwin wrote: > On Wednesday, January 29, 2014 6:49:21 pm Tim Daneliuk wrote: >> Resending in hopes that people on one of the other lists will have some insight here: >> >> On 01/27/2014 10:50 PM, Tim Daneliuk wrote: >>> I am running 9.2 stable i386 r261207. As noted earlier: >>> >>>> I just replaced mobo/CPU on FBSD server (Gigabyte Z-87-D3HP with >>>> an Intel i3-4130). I am not overclocking ... but I continue to see this sort of thing: >>> >>>> MCA: CPU 0 COR (1) internal parity error >>> >>> Dmesg shows: >>> >>>> MCA: Vendor "GenuineIntel", ID 0x306c3, APIC ID 0 >>>> MCA: CPU 0 COR (1) internal parity error >>>> MCA: Bank 0, Status 0x90000040000f0005 >>>> MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000_ >>> >>> I've swapped CPUs (i5). I've fiddled with an endless supply of >>> mobo settings. I've switched power supplies. I've moved mem >>> sticks around .... No joy. >>> >>> So, I dug through the sources and found this: >>> >>> >>> >>> mca_log(const struct mca_record *rec) >>> { >>> uint16_t mca_error; >>> >>> printf("MCA: Bank %d, Status 0x%016llx\n", rec->mr_bank, >>> (long long)rec->mr_status); >>> printf("MCA: Global Cap 0x%016llx, Status 0x%016llx\n", >>> (long long)rec->mr_mcg_cap, (long long)rec->mr_mcg_status); >>> printf("MCA: Vendor \"%s\", ID 0x%x, APIC ID %d\n", cpu_vendor, >>> rec->mr_cpu_id, rec->mr_apic_id); >>> printf("MCA: CPU %d ", rec->mr_cpu); >>> if (rec->mr_status & MC_STATUS_UC) >>> printf("UNCOR "); >>> else { >>> printf("COR "); >>> if (rec->mr_mcg_cap & MCG_CAP_CMCI_P) >>> printf("(%lld) ", ((long long)rec->mr_status & >>> MC_STATUS_COR_COUNT) >> 38); >>> } >>> >>> >>> It looks like the trailing else clause is kicking out the error but I am >>> unclear what the error means, beyond the fact that it appears to be a parity >>> error somewhere within the CPU's internal memory (cache?). Is this error >>> getting corrected? Is this benign, Should I get a different mobo? >>> >>> Um .... Haaaaalp :) >> >> >> I have now tried different motherboards, CPUs, memory, and power supplies and >> this error is still showing up now and then. >> >> This points strongly to either FreeBSD bogus reporting, or these errors being >> benign. It's hard to believe that the exact same error might occur with >> completely different hardware ... unless it's being caused by the case. > > Are they all the same model CPU? Since it is a corrected error you can > probably ignore it, but it is not bogus reporting. FreeBSD only reports > these errors because they show up in registers on your CPU. > It's looking like this is an artifact of running 9.2-STABLE i386 on that hardware. I just installed 10-STABLE x64 and am beating the hardware to death and have yet to see an MCA check. It *is* possible the 9.2 install is boogered up (I went to grad school to learn how to say that), so I am pursuing a full rebuild of the server. While painful, this will also finally move this machine to x64 which is long overdue. -- ---------------------------------------------------------------------------- Tim Daneliuk tundra@tundraware.com PGP Key: http://www.tundraware.com/PGP/
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?52EBE1FA.2040603>