From owner-freebsd-stable@FreeBSD.ORG Fri Jun 25 07:16:48 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E14591065672 for ; Fri, 25 Jun 2010 07:16:47 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta09.westchester.pa.mail.comcast.net (qmta09.westchester.pa.mail.comcast.net [76.96.62.96]) by mx1.freebsd.org (Postfix) with ESMTP id 88B2C8FC1C for ; Fri, 25 Jun 2010 07:16:47 +0000 (UTC) Received: from omta19.westchester.pa.mail.comcast.net ([76.96.62.98]) by qmta09.westchester.pa.mail.comcast.net with comcast id a7Gm1e00127AodY597GneS; Fri, 25 Jun 2010 07:16:47 +0000 Received: from koitsu.dyndns.org ([98.248.46.159]) by omta19.westchester.pa.mail.comcast.net with comcast id a7Gm1e0023S48mS3f7Gmei; Fri, 25 Jun 2010 07:16:47 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 0A3219B425; Fri, 25 Jun 2010 00:16:45 -0700 (PDT) Date: Fri, 25 Jun 2010 00:16:45 -0700 From: Jeremy Chadwick To: Adam Vande More Message-ID: <20100625071644.GA75910@icarus.home.lan> References: <1276876031.7519.39.camel@almscliff.bubblegen.co.uk> <20100618174208.GA47470@icarus.home.lan> <1276889330.2210.44.camel@almscliff.bubblegen.co.uk> <1277155992.1860.3.camel@almscliff.bubblegen.co.uk> <20100622074541.GA71157@icarus.home.lan> <82A96ECD-676C-4A4D-A328-0CFAABD64D50@gid.co.uk> <1277401934.1874.12.camel@almscliff.bubblegen.co.uk> <20100624181535.GA58443@icarus.home.lan> <1277417182.1874.30.camel@almscliff.bubblegen.co.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Cc: Matthew Lear , freebsd-stable@freebsd.org Subject: Re: 7.2-RELEASE-p4, IO errors & RAID1 failure X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jun 2010 07:16:48 -0000 On Thu, Jun 24, 2010 at 05:22:41PM -0500, Adam Vande More wrote: > Haven't followed the entire thread, but wanted to point out something > important to remember. SMART is not a reliable indicator of failure. > It's certainly better than listening to it but it picks up less than > 1/2 of drive failures. Google released a study of their disks in data > centers a few years ago that was fairly in depth look into drive > failure rate. You might find it interesting. Anyone who relies on "overall SMART health" to determine the status of a drive will be disappointed when they see what thresholds vendors are choosing for most attributes. But that's as far as I'll go when it comes to agreeing with the "SMART is not a reliable indicator of X" argument. Due to vendors choosing what they do, it's best to use SMART as an indicator of overall drive health *at that moment* and not as a predictive form (though I have seen it work in this case successfully, especially on SCSI disks. I'd be more than happy to provide some examples if need be). Google's study was half-ass in some regards (I remember reading it and feeling left with more questions than answers), and I'm also aware of folks like Scott Moulton who insist SMART is an unreliable method of analysis. I like Scott's work in general, but I disagree with his view of SMART. You can see some of his presentations on Youtube; look up "Shmoocon 2010 DIY Hard Drive Diagnostics". We've already done the SMART analysis for this issue -- the disk isn't showing any signs of problems from a SMART perspective. Meaning, there's no indication of bad or reallocated sectors, or any other signs of internal drive failure. There's a lot of things SMART can't catch -- drive PCB flakiness (appears as literally anything, take your pick), drive cache going bad (usually shows itself as abysmal performance), or power-related problems (though SMART can help catch this by watching at Attributes 4 and 12, assuming the drive is losing power entirely; if there's dirty power or excessive ripple, or internal drive power circuitry problems, these can appear as practically anything). All in all, replacing a drive is a completely reasonable action when there's evidence confirming the need for its replacement. I don't like replacing hardware when there's no indication replacing it will necessarily fix the problem; I'd rather understand the problem. Matthew, if you're able to take the system down for 2-3 hours, I would recommend downloading Western Digital's Data Lifeguard Diagnostics software (for DOS; you'll need a CD burner to burn the ISO) and running that on your drive. If that fails on a Long/Extended test, yep, replace the disk. Said utility tests a lot more than just SMART. If it passes the test, then we're back at square one, and you can try replacing the disk if you'd like (then boot from the 2nd disk in the RAID-1 array). My concern is that replacing it isn't going to fix anything (meaning you might have a SATA port that's going bad or the controller itself is broken). -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |