Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 21 Jul 2010 02:57:07 +0200
From:      Markus Gebert <markus.gebert@hostpoint.ch>
To:        John Baldwin <jhb@freebsd.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: 8.1-RC2 MCE caused by some LAPIC/clock changes? (was: 8.1-RC2 - PCI fatal error or MCE triggered by USB/ehci on Sun X4100M2?)
Message-ID:  <6781BC8B-51E0-4F8B-9307-9C062DE70C21@hostpoint.ch>
In-Reply-To: <201007201559.45081.jhb@freebsd.org>
References:  <6B57591F-9FA2-45EB-825F-1DB025C0635D@hostpoint.ch> <9DCFE2F6-D7CB-49CB-8EBC-06C1E5EBB727@hostpoint.ch> <F744F475-3D2B-4BC6-856A-A5D302AA8681@hostpoint.ch> <201007201559.45081.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On 20.07.2010, at 21:59, John Baldwin wrote:

>> I started narrowing the revisions down until I=20
>> found out, that while on r202386 I'm still able to trigger the MCE, =
r202387=20
>> seems to solve the problem on CURRENT:
>>=20
>> http://svn.freebsd.org/viewvc/base?view=3Drevision&revision=3D202387
>=20
> Although this change was MFC'd, it was later disabled by default =
because it=20
> causes issues on other machines.  I think there is a tunable you need =
to set=20
> in loader.conf to enable it for 8.1.  Attilio (the author of that =
commit)=20
> should know which tunable to set.

Might be this one in sys/amd64/amd64/clock.c:

----
static int lapic_allclocks =3D 1;
TUNABLE_INT("machdep.lapic_allclocks", &lapic_allclocks);
----

The r202387 changes put this into local_apic.c, guess it was moved later =
on (or after MFC), and that's why I couldn't find it on 8-stable. And, =
indeed, this tunable seems to be gone again in current. Testing with =
machdep.lapic_allclocks=3D0 right now. So far it looks very promising. =
I'll let it run overnight.

Another thing though: Today I compared verbose boot output from 8-stable =
and the current box. I saw that the ioapic sets up IRQ routing =
differently on these two systems although the hardware is the same. This =
seemed not so interesting at first, but then I noticed that 8-stable =
sets up two routes (to lapic0 and lapic2, or sometimes lapic3) for IRQ58 =
(mpt0), while current only uses one route (to lapic0).

I used 'cpuset -c -l 0 -x 58' in an attempt to make my 8-stable box =
behave like the one running current. Indeed, this seems to have changed =
IRQ58 to be routed to lapic0 only. And the box was running for hours =
without showing the symptoms.

I just checked boot verbose outpout of my 8-stable box again (booted =
with machdep.lapic_allclocks=3D0 as mentioned above). And now it seems =
to have set up IRQ routes just like the current box (one route for IRQ58 =
to lapic0).

So I don't get which issue came first... If either one is ruled out, the =
problem seems to be gone. Was it the clock issue causing wrong IRQ =
routing setup which in turn causes mpt or the CPU go nuts? Or is mpt =
having two interrupt routes actually a normal thing (then why doesn't =
current behave this way?), but the mpt driver causes strange thins when =
operating with clock issues? Or have I misinterpreted something?

Here's the boot verbose output of ioapic related to interrupts 56 (em0), =
57 (em1) and 58 (mpt0):

---- 1st X4100M2 - running 8-stable (machdep.lapic_allclocks=3D1, MCEs =
can be reproduced easily) ----
# egrep '^ioapic' boot.normal | egrep 'IRQ 5[678]' | sort
ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 0 vector 55
ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 1 vector 50
ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 0 vector 56
ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 2 vector 50
ioapic2: routing intpin 2 (PCI IRQ 58) to lapic 0 vector 57
ioapic2: routing intpin 2 (PCI IRQ 58) to lapic 3 vector 50
----

---- 1st X4100M2 - running 8-stable (machdep.lapic_allclocks=3D0, test =
currently running, no MCEs so far) ----
# egrep '^ioapic' boot.lapic_allclocks0 | egrep 'IRQ 5[678]' | sort
ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 0 vector 55
ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 2 vector 50
ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 0 vector 56
ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 3 vector 50
ioapic2: routing intpin 2 (PCI IRQ 58) to lapic 0 vector 57
----

---- 2nd X4100M2 - running current (MCEs cannot be reproduced) ----
# dmesg | egrep '^ioapic' | egrep 'IRQ 5[678]' | sort
ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 0 vector 55
ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 2 vector 50
ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 0 vector 56
ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 3 vector 50
ioapic2: routing intpin 2 (PCI IRQ 58) to lapic 0 vector 57
----


Markus




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?6781BC8B-51E0-4F8B-9307-9C062DE70C21>