From owner-freebsd-stable@FreeBSD.ORG Sat Aug 20 20:19:19 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E1D2E106566B for ; Sat, 20 Aug 2011 20:19:19 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta03.westchester.pa.mail.comcast.net (qmta03.westchester.pa.mail.comcast.net [76.96.62.32]) by mx1.freebsd.org (Postfix) with ESMTP id 8D1578FC13 for ; Sat, 20 Aug 2011 20:19:19 +0000 (UTC) Received: from omta23.westchester.pa.mail.comcast.net ([76.96.62.74]) by qmta03.westchester.pa.mail.comcast.net with comcast id NkCF1h0061c6gX853kKKcB; Sat, 20 Aug 2011 20:19:19 +0000 Received: from koitsu.dyndns.org ([67.180.84.87]) by omta23.westchester.pa.mail.comcast.net with comcast id NkKE1h0101t3BNj3jkKGCM; Sat, 20 Aug 2011 20:19:17 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 829EA102C1A; Sat, 20 Aug 2011 13:19:13 -0700 (PDT) Date: Sat, 20 Aug 2011 13:19:13 -0700 From: Jeremy Chadwick To: Dan Langille Message-ID: <20110820201913.GA39827@icarus.home.lan> References: <1B4FC0D8-60E6-49DA-BC52-688052C4DA51@langille.org> <20110819232125.GA4965@icarus.home.lan> <20110820032438.GA21925@icarus.home.lan> <4774BC00-F32B-4BF4-A955-3728F885CAA1@langille.org> <20110820195702.GA39109@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110820195702.GA39109@icarus.home.lan> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-stable@freebsd.org Subject: Re: bad sector in gmirror HDD X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Aug 2011 20:19:20 -0000 A follow-up given that I just viewed the SMART attribute data at the very bottom of this page as of this writing (Sat Aug 20 13:00:09 PDT 2011): http://beta.freebsddiary.org/smart-fixing-bad-sector.php And I see this: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail Always - 2 9 Power_On_Hours 0x0012 059 059 001 Old_age Always - 27440 196 Reallocated_Event_Count 0x0010 099 099 020 Old_age Offline - 1 197 Current_Pending_Sector 0x0032 100 100 020 Old_age Always - 2 198 Offline_Uncorrectable 0x0010 100 253 000 Old_age Offline - 0 These attributes USUALLY mean: 1) Reallocated_Sector_Ct == There are 2 remapped LBAs. 2) Reallocated_Event_Count == There is 1 remapping event which has been noticed (either failure or success). 3) Current_Pending_Sector == There are 2 LBAs which are suspect. Now, given my previous statement about this particular model of drive, Maxtor may have a firmware quirk or other oddities that don't cause Current_Pending_Sector to drop to 0 or Reallocated_Event_Count to match reality. I simply don't know. But keep reading. And remember, this is what we started with: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail Always - 1 9 Power_On_Hours 0x0012 059 059 001 Old_age Always - 27416 196 Reallocated_Event_Count 0x0010 100 100 020 Old_age Offline - 0 197 Current_Pending_Sector 0x0032 100 100 020 Old_age Always - 1 198 Offline_Uncorrectable 0x0010 100 253 000 Old_age Offline - 0 Anyway, in the SMART error log, I see 3 entries (2 new ones since the last time I saw the web page): * Error 3 occurred at disk power-on lifetime: 27422 hours (1142 days + 14 hours) 40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440 * Error 2 occurred at disk power-on lifetime: 27421 hours (1142 days + 13 hours) 40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440 * Error 1 occurred at disk power-on lifetime: 27400 hours (1141 days + 16 hours) 40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440 These are all for the same LBA -- 5566440. "Error 1" was something we already saw on the page the first time. So where did the other two come from? Earlier on the web page I saw these commands being executed: sh ./bad_block_scan /dev/ad2 5566400 5566500 <-- will hit bad LBA sh ./bad_block_scan /dev/ad2 5566000 5566500 <-- will hit bad LBA sh ./bad_block_scan /dev/ad2 5560000 5566000 <-- will not hit bad LBA sh ./bad_block_scan /dev/ad2 5560000 5566000 <-- will not hit bad LBA So there's the explanation for the two newly-added entries in the SMART error log. I'm very surprised if bad_block_scan did not echo that it had encountered read errors on LBA 5566440. It should have, unless I left the script in some weird state. The commands to use to verify would be: dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566439 dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566440 dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566441 (I tend to check "around" that LBA area as well, just to make sure, that's why there's 3 commands with -1 and +1 LBAs). One of these should return an I/O error, unless the LBA has been remapped already, in which case it shouldn't. Finally, there's this very interesting piece of information in the SMART self-test log (not selective scan log, but the self-test log; meaning this was the result of "smartctl -t long /dev/ad2" at some point): Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 27416 786767 So it seems this is one of those drives which does do a surface scan on a long test. But that's interesting -- LBA 786767. If that's true, then issuing the same dd commands as above (but with "skip" changed appropriately) should return an I/O error as well. Naturally check the SMART error log for verification. So, it's possible that there are actually two bad LBAs on this drive -- LBA 5566440 and LBA 786767. I simply don't know about the latter, but the former is confirmed in the SMART error log. If either of these LBAs are the ones which Current_Pending_Sector is referring to, then writes to them should be sufficient to induce re-analysis. E.g.: dd if=/dev/zero of=/dev/ad2 bs=512 count=1 seek=5566440 dd if=/dev/zero of=/dev/ad2 bs=512 count=1 seek=786767 The offsets for seek (not skip!!!) should probably be based on what the dd reads done earlier would show. Unless of course what we're seeing is just a batch of LBAs in a small region that are getting worse the more they're read from (possible). No idea if LBA 5566440 and LBA 786767 are anywhere near one another on the physical media. I don't have a way to determine that (way too complex). That's about all the light I can shed on this for now. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |