Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 10 Jul 2010 01:53:39 +0200
From:      Markus Gebert <markus.gebert@hostpoint.ch>
To:        John Baldwin <jhb@freebsd.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: 8.1-RC2 - PCI fatal error or MCE triggered by USB/ehci on Sun X4100M2?
Message-ID:  <08562D52-02AA-46CF-BFCD-00D0A3C4DC34@hostpoint.ch>
In-Reply-To: <201007091603.31843.jhb@freebsd.org>
References:  <6B57591F-9FA2-45EB-825F-1DB025C0635D@hostpoint.ch> <201007091603.31843.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi John

Am 09.07.2010 um 22:03 schrieb John Baldwin:

> On Friday, July 09, 2010 11:26:00 am Markus Gebert wrote:
>> --
>> MCA: Bank 4, Status 0xb400004000030c2b
>> MCA: Global Cap 0x0000000000000105, Status 0x0000000000000007
>> MCA: Vendor "AuthenticAMD", ID 0x40f13, APIC ID 2
>> MCA: CPU 2 UNCOR BUSLG Observer WR I/O
>> MCA: Address 0xfd00000000
>=20
> Using my local port of mcelog this is what I get for this check:
>=20
> CPU 2 4 northbridge=20
> ADDR fd00000000=20
>  Northbridge Master abort
>  link number =3D 4
>       bit61 =3D error uncorrected
>  bus error 'local node observed, request didn't time out
>             generic write mem transaction
>             i/o access, level generic'
> STATUS b400004000030c2b MCGSTATUS 7
> MCGCAP 105 APICID 2 SOCKETID 0=20
> CPUID Vendor AMD Family 15 Model 65
>=20
> I don't know what to tell you off hand.  Did you buy this hardware =
from Sun=20
> directly?  If so, I would try bugging them about this, especially =
given the=20
> error that the BIOS is logging.

Yes, this hardware comes from Sun directly, but getting Sun (/Oracle) =
support for this issue is gonna be tough. FreeBSD is unsupported, and in =
a short test we couldn't reproduce the problem with a Linux kernel. =
While I agree that a hardware issue has always been and still is a =
possibility to be considered, the fact that we tested this on two =
machines remains as well as the fact that 6.x, 7.x do not show the =
behavior. Another possibility is of course, that the X4100 is prone to =
such issues and somehow 6.x and 7.x have workarounds we're not aware of =
or just do something different in way so that this issue does not get =
triggered.

>  It does sound like a hardware issue, but in=20
> the chipset, not in the RAM, so you might need to swap out the main =
board=20
> rather than the RAM.

Yep. The MCA report does not indicate RAM problems, and the MCE itself =
was not our only reason to replace RAM. We found a Sun document about =
the X4200 series getting hypertransport errors when RAM of a certain =
vendor is installed, so we swapped RAM to rule this one out.

We did not replace the mainboard though, but testing on a second X4100 =
should do about the same.


> I'm curious if disabling USB legacy support in the BIOS causes it to =
still die=20
> even with ehci not loaded.  If so, then the SMI# for the ehci =
controller must=20
> somehow prevent the issue, perhaps by triggering frequently enough to =
slow the=20
> rate of I/O requests down?


I disabled usb legacy support in the BIOS and booted a kernel with =
usb+ohci+ukbd+ums but without ehci. Unfortunately, I cannot reproduce =
the MCE.

Just to get you right: your theory is that when we don't load the ehci =
driver, then the ehci-controller isn't taken over during boot and =
therefore handled through SMM so that SMIs might occur often enough to =
throttle the system just enough to not let the problem appear? I'm not =
very familiar with usb legacy support and SMM, but why would ehci be =
handled through SMM when the only usb devices (the virtual keyboard and =
mouse) actually sit on ohci? And why would disabling legacy support help =
getting more SMIs to throttle the system? As I unterstand this, and I =
might be terribly wrong, legacy support is what would cause SMIs in the =
first place.

Markus




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?08562D52-02AA-46CF-BFCD-00D0A3C4DC34>