From owner-freebsd-stable@FreeBSD.ORG Thu Aug 11 10:38:44 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0471E106566C; Thu, 11 Aug 2011 10:38:44 +0000 (UTC) (envelope-from prvs=1204ca57bc=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 20A298FC1D; Thu, 11 Aug 2011 10:38:42 +0000 (UTC) X-MDAV-Processed: mail1.multiplay.co.uk, Thu, 11 Aug 2011 11:37:46 +0100 X-Spam-Processed: mail1.multiplay.co.uk, Thu, 11 Aug 2011 11:37:46 +0100 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on mail1.multiplay.co.uk X-Spam-Level: X-Spam-Status: No, score=-5.0 required=6.0 tests=USER_IN_WHITELIST shortcircuit=ham autolearn=disabled version=3.2.5 Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50014558581.msg; Thu, 11 Aug 2011 11:37:46 +0100 X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1204ca57bc=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: From: "Steven Hartland" To: "Jeremy Chadwick" References: <47F0D04ADF034695BC8B0AC166553371@multiplay.co.uk> <4E4380C0.7070908@FreeBSD.org> <44DD20E1CFA949E8A1B15B3847769DCB@multiplay.co.uk> <20110811092858.GA94514@icarus.home.lan> Date: Thu, 11 Aug 2011 11:38:17 +0100 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6109 Cc: Attilio Rao , freebsd-stable@freebsd.org, Andriy Gapon Subject: Re: debugging frequent kernel panics on 8.2-RELEASE X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Aug 2011 10:38:44 -0000 ----- Original Message ----- From: "Jeremy Chadwick" > On Thu, Aug 11, 2011 at 09:59:36AM +0100, Steven Hartland wrote: >> That's not the issue as its happening across board over 130 machines :( > > Agreed, bad hardware sounds unlikely here. I could believe some strange > incompatibility (e.g. BIOS quirk or the like[1]) that might cause problems > en masse across many servers, but hardware issues are unlikely in this > situation. Its affecting a range of hardware from supermicro blades / 2u's & dell blades. So it seems more like a software bug. > [1]: I mention this because we had something similar happen at my > workplace. For months we used a specific model of system from our > vendor which worked reliably, zero issues. Then we got a new shipment > of boxes (same model as prior) which started acting very odd (often AHCI > timeout issues or MCEs which when decoded would usually turn out to be > nonsensical). It took weeks to determine the cause given how slow the > vendor was to respond: root cause turned out to be that the vendor > decided, on a whim, to start shipping a newer BIOS version which wasn't > "as compatible" with Solaris as previous BIOSes. Downgrading all the > systems to the older BIOS fixed the problem. The machines have been working for months fine, the panics only started last week. We've been looking at the changes made last week to see if we can identify the cause. The only change made in that time frame was the rollout of the change to kern.ipc.nmbclusters to workaround the tcp re-assembly issue. In this case we raised the value from the default of 25600 to 262144. We've used this value for a long time on our core webservers, which are also running 8.2 so I'd be very surprised if this was the cause. That said we're looking to roll out kern.ipc.nmbclusters=51200 to try and rule it out. Prior to this, 1-2 weeks previous, we rolled out a significant update which included:- 1. Adding IPv6 to the kernel (although no machines are configued with it yet) 2. Adding ipmi module to the kernel, although not loaded. 3. Rebuilding ALL ports to the latest version 4. Restructuring the server layout to be one jail per java server (~60 servers per machine) 5. Restructing the filesystem to be a base nullfs mount + devfs + zfs volume per server This update had been testing for 2 weeks prior to that, so in total 3-4 weeks before any panics where seen but that doesn't mean the issue didnt exist at that time. Currently we're seeing 1-4 panics a day across all machines. So currently the most likely suspects are:- 1. kern.ipc.nmbclusters 2. nullfs 3. ipv6 4. a package update, most likely being openjdk6-b23 5. jail > In Steve's case this is unlikely to be the situation, but I thought I'd > share the story anyway. "SKU ABCXYZ-1" from August 2009 is not > necessarily the same thing as "SKU ABCXYZ-1" from May 2010. ;-) This > is also why I prefer to buy/build my own systems, since I cannot trust > vendors to not mess about with settings w/out changing SKUs, P/Ns, or > revision numbers. This caused us much scratching of heads when looking for that tcp issue the other day. As it seemed to effecting the newer machines more than the old, we even found two machines with the same "version" of the bios but that's clearly a different build as the date and available options where different, quite frustrating! Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk.