Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 20 Jul 2010 04:15:59 -0400
From:      jhell <jhell@DataIX.net>
To:        Markus Gebert <markus.gebert@hostpoint.ch>
Cc:        freebsd-stable <freebsd-stable@freebsd.org>
Subject:   Re: 8.1-RC2 MCE caused by some LAPIC/clock changes? (was: 8.1-RC2 - PCI fatal error or MCE triggered by USB/ehci on Sun X4100M2?)
Message-ID:  <alpine.BSF.2.00.1007200406140.27685@pragry.qngnvk.ybpny>
In-Reply-To: <F744F475-3D2B-4BC6-856A-A5D302AA8681@hostpoint.ch>
References:  <6B57591F-9FA2-45EB-825F-1DB025C0635D@hostpoint.ch> <201007091603.31843.jhb@freebsd.org> <08562D52-02AA-46CF-BFCD-00D0A3C4DC34@hostpoint.ch> <FFB367B2-232D-460D-82B8-C3F03F1B53BE@hostpoint.ch> <9DCFE2F6-D7CB-49CB-8EBC-06C1E5EBB727@hostpoint.ch> <F744F475-3D2B-4BC6-856A-A5D302AA8681@hostpoint.ch>

next in thread | previous in thread | raw e-mail | index | archive | help

On Sat, 17 Jul 2010 14:35, Markus Gebert wrote:
In Message-Id: <F744F475-3D2B-4BC6-856A-A5D302AA8681@hostpoint.ch>

>
> On 13.07.2010, at 16:02, Markus Gebert wrote:
>
>> Unfortunately, I have not been able to get anything useful out the svn 
>> commit logs, which could explain this. Maybe someone else has an idea 
>> what could have changed between 7 and 8 to break it, and again between 
>> 8 and CURRENT to magically fix it again.
>
> I tracked this down further. I couldn't easily downgrade my 8.1 
> installation to see when the problem was introduced because the zpool 
> version used is 14. So I tried to figure out, when the problem was 
> solved in CURRENT.
>
> I started with the first possible revision that can boot off my v14 pool 
> (r201143, Dec 28, zfs v14 commit). With this revision, I was able to 
> trigger the MCE.
>
> Then I took some later revision (rev206010, Apr 1, chosen randomly), and 
> I couldn't reproduce the problem. I started narrowing the revisions down 
> until I found out, that while on r202386 I'm still able to trigger the 
> MCE, r202387 seems to solve the problem on CURRENT:
>
> http://svn.freebsd.org/viewvc/base?view=revision&revision=202387
>
> Since John Baldwin mentioned this problem could be timing related, it 
> seems reasonable, that a clock-related change could be fix it. But this 
> commit seems to have been MFC'd to 8-STABLE and 8.1 (at least as far as 
> I can tell) along with some other changes to amd64 specific code. I 
> thought that maybe these other changes that have been MFC'd could have 
> reintroduced the problem later on, but so far I could not reproduce the 
> problem with newer CURRENT revisions. So, I actually nailed this one 
> done to a single commit on CURRENT, but still cannot tell what the 
> actual difference is compared to 8-STABLE/8.1.
>
> Any ideas how to proceed?
>

Adding to this I remembered some specific commits that caught my attention 
when they happened. Specifically they were to mca.c (locate mca) on my 
machine provided the file paths and svn log provided the commit log.

When you said April and I seen the log it rang a bell.

These may be of interest to you:

------------------------------------------------------------------------
r210079 | jhb | 2010-07-14 17:10:14 -0400 (Wed, 14 Jul 2010) | 13 lines

MFC 208507,208556,208621:

Add support for corrected machine check interrupts.  CMCI is a new local 
APIC interrupt that fires when a threshold of corrected machine check 
events is reached.  CMCI also includes a count of events when reporting 
corrected errors in the bank's status register.  Note that individual 
banks may or may not support CMCI.  If they do, each bank includes its own 
threshold register that determines when the interrupt fires.  Currently 
the code uses a very simple strategy where it doubles the threshold on 
each interrupt until it succeeds in throttling the interrupt to occur only 
once a minute (this interval can be tuned via sysctl).  The threshold is 
also adjusted on each hourly poll which will lower the threshold once 
events stop occurring.

------------------------------------------------------------------------
r206183 | alc | 2010-04-05 12:11:42 -0400 (Mon, 05 Apr 2010) | 6 lines

MFC r204907, r204913, r205402, r205573, r205573
   Implement AMD's recommended workaround for Erratum 383 on Family 10h
   processors.

   Enable machine check exceptions by default.

------------------------------------------------------------------------

And a list of mca.c's within the stable/8 src tree:
/usr/src/sbin/mca/mca.c
/usr/src/sys/amd64/amd64/mca.c
/usr/src/sys/dev/aha/aha_mca.c
/usr/src/sys/dev/buslogic/bt_mca.c
/usr/src/sys/dev/ep/if_ep_mca.c
/usr/src/sys/i386/i386/mca.c
/usr/src/sys/ia64/ia64/mca.c


Regards & Good luck,

-- 

  jhell




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.1007200406140.27685>