Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Aug 2011 11:43:25 +0200
From:      Attilio Rao <attilio@freebsd.org>
To:        Jeremy Chadwick <freebsd@jdc.parodius.com>
Cc:        freebsd-stable@freebsd.org, Steven Hartland <killing@multiplay.co.uk>, Andriy Gapon <avg@freebsd.org>
Subject:   Re: debugging frequent kernel panics on 8.2-RELEASE
Message-ID:  <CAJ-FndBfiHMemNfmXtWkzzZTkZ-Cw9oYd8D%2BCQtjSAOMf=0a8w@mail.gmail.com>
In-Reply-To: <20110811092858.GA94514@icarus.home.lan>
References:  <47F0D04ADF034695BC8B0AC166553371@multiplay.co.uk> <A71C3ACF01EC4D36871E49805C1A5321@multiplay.co.uk> <4E4380C0.7070908@FreeBSD.org> <CAJ-FndAq2ASHzg_%2B9S__x=vTAgzHowMrv1DFSbXwroX27PF36A@mail.gmail.com> <44DD20E1CFA949E8A1B15B3847769DCB@multiplay.co.uk> <20110811092858.GA94514@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
2011/8/11 Jeremy Chadwick <freebsd@jdc.parodius.com>:
> On Thu, Aug 11, 2011 at 09:59:36AM +0100, Steven Hartland wrote:
>> That's not the issue as its happening across board over 130 machines :(
>
> Agreed, bad hardware sounds unlikely here. =C2=A0I could believe some str=
ange
> incompatibility (e.g. BIOS quirk or the like[1]) that might cause problem=
s
> en masse across many servers, but hardware issues are unlikely in this
> situation.
>
> [1]: I mention this because we had something similar happen at my
> workplace. =C2=A0For months we used a specific model of system from our
> vendor which worked reliably, zero issues. =C2=A0Then we got a new shipme=
nt
> of boxes (same model as prior) which started acting very odd (often AHCI
> timeout issues or MCEs which when decoded would usually turn out to be
> nonsensical). =C2=A0It took weeks to determine the cause given how slow t=
he
> vendor was to respond: root cause turned out to be that the vendor
> decided, on a whim, to start shipping a newer BIOS version which wasn't
> "as compatible" with Solaris as previous BIOSes. =C2=A0Downgrading all th=
e
> systems to the older BIOS fixed the problem.

That falls in the "hw problem" category for me.

Anyway, we really would need much more information in order to take a
proactive action.

Would it be possible to access to one of the panic'ing machine? Is it
always the same panic which is happening or it is variadic (like: once
page fault, once fatal double fault, once fatal trap, etc.).

Whatever informations you can provide may be valuable here.

Thanks,
Attilio


--=20
Peace can only be achieved by understanding - A. Einstein



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-FndBfiHMemNfmXtWkzzZTkZ-Cw9oYd8D%2BCQtjSAOMf=0a8w>