Date: Wed, 21 Jul 2010 02:57:07 +0200 From: Markus Gebert <markus.gebert@hostpoint.ch> To: John Baldwin <jhb@freebsd.org> Cc: freebsd-stable@freebsd.org Subject: Re: 8.1-RC2 MCE caused by some LAPIC/clock changes? (was: 8.1-RC2 - PCI fatal error or MCE triggered by USB/ehci on Sun X4100M2?) Message-ID: <6781BC8B-51E0-4F8B-9307-9C062DE70C21@hostpoint.ch> In-Reply-To: <201007201559.45081.jhb@freebsd.org> References: <6B57591F-9FA2-45EB-825F-1DB025C0635D@hostpoint.ch> <9DCFE2F6-D7CB-49CB-8EBC-06C1E5EBB727@hostpoint.ch> <F744F475-3D2B-4BC6-856A-A5D302AA8681@hostpoint.ch> <201007201559.45081.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 20.07.2010, at 21:59, John Baldwin wrote: >> I started narrowing the revisions down until I=20 >> found out, that while on r202386 I'm still able to trigger the MCE, = r202387=20 >> seems to solve the problem on CURRENT: >>=20 >> http://svn.freebsd.org/viewvc/base?view=3Drevision&revision=3D202387 >=20 > Although this change was MFC'd, it was later disabled by default = because it=20 > causes issues on other machines. I think there is a tunable you need = to set=20 > in loader.conf to enable it for 8.1. Attilio (the author of that = commit)=20 > should know which tunable to set. Might be this one in sys/amd64/amd64/clock.c: ---- static int lapic_allclocks =3D 1; TUNABLE_INT("machdep.lapic_allclocks", &lapic_allclocks); ---- The r202387 changes put this into local_apic.c, guess it was moved later = on (or after MFC), and that's why I couldn't find it on 8-stable. And, = indeed, this tunable seems to be gone again in current. Testing with = machdep.lapic_allclocks=3D0 right now. So far it looks very promising. = I'll let it run overnight. Another thing though: Today I compared verbose boot output from 8-stable = and the current box. I saw that the ioapic sets up IRQ routing = differently on these two systems although the hardware is the same. This = seemed not so interesting at first, but then I noticed that 8-stable = sets up two routes (to lapic0 and lapic2, or sometimes lapic3) for IRQ58 = (mpt0), while current only uses one route (to lapic0). I used 'cpuset -c -l 0 -x 58' in an attempt to make my 8-stable box = behave like the one running current. Indeed, this seems to have changed = IRQ58 to be routed to lapic0 only. And the box was running for hours = without showing the symptoms. I just checked boot verbose outpout of my 8-stable box again (booted = with machdep.lapic_allclocks=3D0 as mentioned above). And now it seems = to have set up IRQ routes just like the current box (one route for IRQ58 = to lapic0). So I don't get which issue came first... If either one is ruled out, the = problem seems to be gone. Was it the clock issue causing wrong IRQ = routing setup which in turn causes mpt or the CPU go nuts? Or is mpt = having two interrupt routes actually a normal thing (then why doesn't = current behave this way?), but the mpt driver causes strange thins when = operating with clock issues? Or have I misinterpreted something? Here's the boot verbose output of ioapic related to interrupts 56 (em0), = 57 (em1) and 58 (mpt0): ---- 1st X4100M2 - running 8-stable (machdep.lapic_allclocks=3D1, MCEs = can be reproduced easily) ---- # egrep '^ioapic' boot.normal | egrep 'IRQ 5[678]' | sort ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 0 vector 55 ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 1 vector 50 ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 0 vector 56 ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 2 vector 50 ioapic2: routing intpin 2 (PCI IRQ 58) to lapic 0 vector 57 ioapic2: routing intpin 2 (PCI IRQ 58) to lapic 3 vector 50 ---- ---- 1st X4100M2 - running 8-stable (machdep.lapic_allclocks=3D0, test = currently running, no MCEs so far) ---- # egrep '^ioapic' boot.lapic_allclocks0 | egrep 'IRQ 5[678]' | sort ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 0 vector 55 ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 2 vector 50 ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 0 vector 56 ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 3 vector 50 ioapic2: routing intpin 2 (PCI IRQ 58) to lapic 0 vector 57 ---- ---- 2nd X4100M2 - running current (MCEs cannot be reproduced) ---- # dmesg | egrep '^ioapic' | egrep 'IRQ 5[678]' | sort ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 0 vector 55 ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 2 vector 50 ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 0 vector 56 ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 3 vector 50 ioapic2: routing intpin 2 (PCI IRQ 58) to lapic 0 vector 57 ---- Markus
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?6781BC8B-51E0-4F8B-9307-9C062DE70C21>