Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Aug 2011 11:38:17 +0100
From:      "Steven Hartland" <killing@multiplay.co.uk>
To:        "Jeremy Chadwick" <freebsd@jdc.parodius.com>
Cc:        Attilio Rao <attilio@freebsd.org>, freebsd-stable@freebsd.org, Andriy Gapon <avg@freebsd.org>
Subject:   Re: debugging frequent kernel panics on 8.2-RELEASE
Message-ID:  <D9D9B43EF8734A1893F58CF5EC03C24C@multiplay.co.uk>
References:  <47F0D04ADF034695BC8B0AC166553371@multiplay.co.uk> <A71C3ACF01EC4D36871E49805C1A5321@multiplay.co.uk> <4E4380C0.7070908@FreeBSD.org> <CAJ-FndAq2ASHzg_%2B9S__x=vTAgzHowMrv1DFSbXwroX27PF36A@mail.gmail.com> <44DD20E1CFA949E8A1B15B3847769DCB@multiplay.co.uk> <20110811092858.GA94514@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
----- Original Message ----- 
From: "Jeremy Chadwick" <freebsd@jdc.parodius.com>


> On Thu, Aug 11, 2011 at 09:59:36AM +0100, Steven Hartland wrote:
>> That's not the issue as its happening across board over 130 machines :(
> 
> Agreed, bad hardware sounds unlikely here.  I could believe some strange
> incompatibility (e.g. BIOS quirk or the like[1]) that might cause problems
> en masse across many servers, but hardware issues are unlikely in this
> situation.

Its affecting a range of hardware from supermicro blades / 2u's &
dell blades. So it seems more like a software bug.

> [1]: I mention this because we had something similar happen at my
> workplace.  For months we used a specific model of system from our
> vendor which worked reliably, zero issues.  Then we got a new shipment
> of boxes (same model as prior) which started acting very odd (often AHCI
> timeout issues or MCEs which when decoded would usually turn out to be
> nonsensical).  It took weeks to determine the cause given how slow the
> vendor was to respond: root cause turned out to be that the vendor
> decided, on a whim, to start shipping a newer BIOS version which wasn't
> "as compatible" with Solaris as previous BIOSes.  Downgrading all the
> systems to the older BIOS fixed the problem.

The machines have been working for months fine, the panics only started
last week.

We've been looking at the changes made last week to see if we can identify
the cause. The only change made in that time frame was the rollout
of the change to kern.ipc.nmbclusters to workaround the tcp re-assembly
issue.

In this case we raised the value from the default of 25600 to 262144.

We've used this value for a long time on our core webservers, which are
also running 8.2 so I'd be very surprised if this was the cause. That said
we're looking to roll out kern.ipc.nmbclusters=51200 to try and rule it
out.

Prior to this, 1-2 weeks previous, we rolled out a significant update which
included:-
1. Adding IPv6 to the kernel (although no machines are configued with it yet)
2. Adding ipmi module to the kernel, although not loaded.
3. Rebuilding ALL ports to the latest version
4. Restructuring the server layout to be one jail per java server (~60
servers per machine)
5. Restructing the filesystem to be a base nullfs mount + devfs +
zfs volume per server

This update had been testing for 2 weeks prior to that, so in total 3-4
weeks before any panics where seen but that doesn't mean the issue
didnt exist at that time.

Currently we're seeing 1-4 panics a day across all machines.

So currently the most likely suspects are:-
1. kern.ipc.nmbclusters
2. nullfs
3. ipv6
4. a package update, most likely being openjdk6-b23
5. jail

> In Steve's case this is unlikely to be the situation, but I thought I'd
> share the story anyway.  "SKU ABCXYZ-1" from August 2009 is not
> necessarily the same thing as "SKU ABCXYZ-1" from May 2010.  ;-)  This
> is also why I prefer to buy/build my own systems, since I cannot trust
> vendors to not mess about with settings w/out changing SKUs, P/Ns, or
> revision numbers.

This caused us much scratching of heads when looking for that tcp issue
the other day. As it seemed to effecting the newer machines more than
the old, we even found two machines with the same "version" of the bios
but that's clearly a different build as the date and available options
where different, quite frustrating!

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D9D9B43EF8734A1893F58CF5EC03C24C>