Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 21 Jul 2010 15:36:53 +0300
From:      Andriy Gapon <avg@icyb.net.ua>
To:        Markus Gebert <markus.gebert@hostpoint.ch>
Cc:        freebsd-stable@freebsd.org, John Baldwin <jhb@freebsd.org>
Subject:   Re: 8.1-RC2 MCE caused by some LAPIC/clock changes?
Message-ID:  <4C46E9E5.8000204@icyb.net.ua>
In-Reply-To: <5CABE3EC-1EE7-4B6B-85EA-70AA2A107948@hostpoint.ch>
References:  <6B57591F-9FA2-45EB-825F-1DB025C0635D@hostpoint.ch>	<9DCFE2F6-D7CB-49CB-8EBC-06C1E5EBB727@hostpoint.ch>	<F744F475-3D2B-4BC6-856A-A5D302AA8681@hostpoint.ch>	<201007201559.45081.jhb@freebsd.org> <6781BC8B-51E0-4F8B-9307-9C062DE70C21@hostpoint.ch> <4C46B0C6.4020400@icyb.net.ua> <5CABE3EC-1EE7-4B6B-85EA-70AA2A107948@hostpoint.ch>

next in thread | previous in thread | raw e-mail | index | archive | help
on 21/07/2010 15:25 Markus Gebert said the following:
> On 21.07.2010, at 10:33, Andriy Gapon wrote:
> 
>> on 21/07/2010 03:57 Markus Gebert said the following:
>>> Another thing though: Today I compared verbose boot output from 8-stable
>>> and the current box. I saw that the ioapic sets up IRQ routing differently
>>> on these two systems although the hardware is the same. This seemed not so 
>>> interesting at first, but then I noticed that 8-stable sets up two routes
>>> (to lapic0 and lapic2, or sometimes lapic3) for IRQ58 (mpt0), while current
>>> only uses one route (to lapic0).
>> My understanding that it's not "two routes", but re-routing. During early
>> boot all interrupts are bound to BSP; later, when APs become online, the
>> interrupts are re-distributed among available CPUs.
> 
> I guess you're right, misinterpretation on my side. Thanks for clarifying this.
> 
> 
> Now being aware of this, it seems to me that in the machdep.lapic_allclocks=0
> case, there might just be more interrupts to be assigned/routed due to "more
> clocks being used". If that's true, maybe it's just "luck" that in this case
> the mpt interrupt gets assigned to lapic0/cpu0 and the box runs fine. I'm just
> guessing though, since I have no clue how interrupts are assigned to lapics
> exactly (round-robin? some logic?).

Yes, round-robin, for interrupts that not explicitly bound to specific CPUs.
The process is deterministic, but hard to predict indeed.

>>> I used 'cpuset -c -l 0 -x 58' in an attempt to make my 8-stable box behave 
>>> like the one running current. Indeed, this seems to have changed IRQ58 to
>>> be routed to lapic0 only. And the box was running for hours without showing
>>> the symptoms.
>>> 
>>> I just checked boot verbose outpout of my 8-stable box again (booted with 
>>> machdep.lapic_allclocks=0 as mentioned above). And now it seems to have set
>>>  up IRQ routes just like the current box (one route for IRQ58 to lapic0).
>> Not sure how to interpret this properly. One possibility is a hardware
>> problem where interrupt message route between ioapic2 and CPU to which lapic3
>> belongs is flaky. Perhaps, this might be a FreeBSD problem: it could be that
>> the system somehow tells to not set up such routes, but we don't listen.  But
>> this is far fetched.
> 
> 
> I'm not sure either. If my "theory" above proved to be true, it would have been
> just luck, that 6.x and 7.x (and current) run just fine on the X4100M2. A
> (short) test on Ubuntu didn't trigger the problem, so the Linux kernel is
> either lucky too by selecting an interrupt route that is "not flaky", or
> there's indeed some way to figure out not to use some lapics for some
> interrupts. Or we didn't test Linux thoroughly enough.

Yep, it would be interesting to see how interrupts were distributed among CPUs on
that Linux.

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C46E9E5.8000204>