From owner-freebsd-stable@FreeBSD.ORG Sat Aug 20 17:34:45 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 405BC106566C for ; Sat, 20 Aug 2011 17:34:45 +0000 (UTC) (envelope-from dan@langille.org) Received: from nyi.unixathome.org (nyi.unixathome.org [64.147.113.42]) by mx1.freebsd.org (Postfix) with ESMTP id D43198FC0A for ; Sat, 20 Aug 2011 17:34:44 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by nyi.unixathome.org (Postfix) with ESMTP id EBC5850A09; Sat, 20 Aug 2011 17:34:43 +0000 (UTC) X-Virus-Scanned: amavisd-new at unixathome.org Received: from nyi.unixathome.org ([127.0.0.1]) by localhost (nyi.unixathome.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id glPSpkeHR3wf; Sat, 20 Aug 2011 18:34:43 +0100 (BST) Received: from smtp-auth.unixathome.org (smtp-auth.unixathome.org [10.4.7.7]) (Authenticated sender: hidden) by nyi.unixathome.org (Postfix) with ESMTPSA id 6B74B50A06 ; Sat, 20 Aug 2011 17:34:43 +0000 (UTC) Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: Dan Langille In-Reply-To: <20110820032438.GA21925@icarus.home.lan> Date: Sat, 20 Aug 2011 13:34:41 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <4774BC00-F32B-4BF4-A955-3728F885CAA1@langille.org> References: <1B4FC0D8-60E6-49DA-BC52-688052C4DA51@langille.org> <20110819232125.GA4965@icarus.home.lan> <20110820032438.GA21925@icarus.home.lan> To: Jeremy Chadwick X-Mailer: Apple Mail (2.1084) Cc: freebsd-stable@freebsd.org Subject: Re: bad sector in gmirror HDD X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Aug 2011 17:34:45 -0000 On Aug 19, 2011, at 11:24 PM, Jeremy Chadwick wrote: > On Fri, Aug 19, 2011 at 09:39:17PM -0400, Dan Langille wrote: >>=20 >> On Aug 19, 2011, at 7:21 PM, Jeremy Chadwick wrote: >>=20 >>> On Fri, Aug 19, 2011 at 04:50:01PM -0400, Dan Langille wrote: >>>> System in question: FreeBSD 8.2-STABLE #3: Thu Mar 3 04:52:04 GMT = 2011 >>>>=20 >>>> After a recent power failure, I'm seeing this in my logs: >>>>=20 >>>> Aug 19 20:36:34 bast smartd[1575]: Device: /dev/ad2, 2 Currently = unreadable (pending) sectors >>>=20 >>> I doubt this is related to a power failure. >>>=20 >>>> Searching on that error message, I was led to believe that = identifying the bad sector and >>>> running dd to read it would cause the HDD to reallocate that bad = block. >>>>=20 >>>> http://smartmontools.sourceforge.net/badblockhowto.html >>>=20 >>> This is incorrect (meaning you've misunderstood what's written = there). >>>=20 >>> Unreadable LBAs can be a result of the LBA being actually bad (as in >>> uncorrectable), or the LBA being marked "suspect". In either case = the >>> LBA will return an I/O error when read. >>>=20 >>> If the LBAs are marked "suspect", the drive will perform re-analysis = of >>> the LBA (to determine if the LBA can be read and the data re-mapped, = or >>> if it cannot then the LBA is marked uncorrectable) when you = **write** to >>> the LBA. >>>=20 >>> The above smartd output doesn't tell me much. Providing actual = SMART >>> attribute data (smartctl -a) for the drive would help. The brand of = the >>> drive, the firmware version, and the model all matter -- every drive >>> behaves a little differently. >>=20 >> Information such as this? = http://beta.freebsddiary.org/smart-fixing-bad-sector.php >=20 > Yes, perfect. Thank you. First thing first: upgrade smartmontools to > 5.41. Your attributes will be the same after you do this (the drive = is > already in smartmontools' internal drive DB), but I often have to = remind > people that they really need to keep smartmontools updated as often as > possible. The changes between versions are vast; this is especially > important for people with SSDs (I'm responsible for submitting some > recent improvements for Intel 320 and 510 SSDs). Done. > Anyway, the drive (albeit an old PATA Maxtor) appears to have three > anomalies: >=20 > 1) One confirmed reallocated LBA (SMART attribute 5) >=20 > 2) One "suspect" LBA (SMART attribute 197) >=20 > 3) A very high temperature of 51C (SMART attribute 194). If this = drive > is in an enclosure or in a system with no fans this would be > understandable, otherwise this is a bit high. My home workstation = which > has only one case fan has a drive with more platters than your Maxtor, > and it idles at ~38C. Possibly this drive has been undergoing = constant > I/O recently (which does greatly increase drive temperature)? Not = sure. > I'm not going to focus too much on this one. This is an older system. I suspect insufficient ventilation. I'll look = at getting a new case fan, if not some HDD fans. > The SMART error log also indicates an LBA failure at the 26000 hour = mark > (which is 16 hours prior to when you did smartctl -a /dev/ad2). = Whether > that LBA is the remapped one or the suspect one is unknown. The LBA = was > 5566440. >=20 > The SMART tests you did didn't really amount to anything; no surprise. > short and long tests usually do not test the surface of the disk. = There > are some drives which do it on a long test, but as I said before, > everything varies from drive to drive. >=20 > Furthermore, on this model of drive, you cannot do a surface scans via > SMART. Bummer. That's indicated in the "Offline data collection > capabilities" section at the top, where it reads: >=20 > No Selective Self-test supported. >=20 > So you'll have to use the dd method. This takes longer than if = surface > scanning was supported by the drive, but is acceptable. I'll get to = how > to go about that in a moment. FWIW, I've done a dd read of the entire suspect disk already. Just two = errors. =46rom the URL mentioned above: [root@bast:~] # dd of=3D/dev/null if=3D/dev/ad2 bs=3D1m conv=3Dnoerror dd: /dev/ad2: Input/output error 2717+0 records in 2717+0 records out 2848980992 bytes transferred in 127.128503 secs (22410246 bytes/sec) dd: /dev/ad2: Input/output error 38170+1 records in 38170+1 records out 40025063424 bytes transferred in 1544.671423 secs (25911701 bytes/sec) [root@bast:~] #=20 That seems to indicate two problems. Are those the values I should be = using=20 with dd? I did some more precise testing: # time dd of=3D/dev/null if=3D/dev/ad2 bs=3D512 iseek=3D5566440 dd: /dev/ad2: Input/output error 9+0 records in 9+0 records out 4608 bytes transferred in 5.368668 secs (858 bytes/sec) real 0m5.429s user 0m0.000s sys 0m0.010s NOTE: that's 9 blocks later than mentioned in smarctl The above generated this in /var/log/messages: Aug 20 17:29:25 bast kernel: ad2: FAILURE - READ_DMA = status=3D51 error=3D40 LBA=3D5566449 > [stuff snipped] > That said: >=20 > http://jdc.parodius.com/freebsd/bad_block_scan >=20 > If you run this on your ad2 drive, I'm hoping what you'll find are two > LBAs which can't be read -- one will be the remapped LBA and one will = be > the "suspect" LBA. If you only get one LBA error then that's fine = too, > and will be the "suspect" LBA. > Once you have the LBA(s), you can submit writes to them to get the = drive > to re-analyse them (assuming they're "suspect"): >=20 > dd if=3D/dev/zero of=3D/dev/XXX bs=3D512 count=3D1 seek=3DNNNNN >=20 > Where XXX is the device and NNNNN is the LBA number. >=20 > If this works properly, the dd command should sit there for a little = bit > (as the drive does its re-analysis magic) and then should complete. ad2 is part of a gmirror with ad0. Does this change things? I haven't tried the dd yet. >=20 > You'll want to check SMART stats after that; you should see > Current_Pending_Sector drop to 0. If Offline_Uncorrectable = incremented > then the LBA could not be re-read/remapped. It did increment: 197 Current_Pending_Sector 0x0032 100 100 020 Old_age Always = - 2 [was 1] > If Reallocated_Sector_Ct > incremented then you now have a total of 2 LBAs which are remapped. It did increment: $ diff smarctl.1 smarctl.3 | grep Reallocated_Sector_Ct < 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail = Always - 1 > 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail = Always - 2 Full output of smartctl has been appended to = http://beta.freebsddiary.org/smart-fixing-bad-sector.php > In > the case of remapping, you get to deal with the UFS/FFS thing above. > To get the stats to update in this situation you *might* (but probably > not) have to run "smartctl -t offline /dev/XXX". I didn't try that... >=20 > You might also be wondering "that dd command writes 512 bytes of zero = to > that LBA; what about the old data that was there, in the case that the > drive remaps the LBA?" This is a great question, and one I've never > actually taken the time to answer because at this present time I have > absolutely *no* bad disks in my possession. I'm under the impression > that the write does in fact write zeros if the LBA is remapped, but = that > might not be true at all. I've been waiting to test this for quite = some > time and document it/write about it. >=20 > I still suggest you replace the drive, although given its age I doubt > you'll be able to find a suitable replacement. I tend to keep disks > like this around for testing/experimental purposes and not for actual > use. I have several unused 80GB HDD I can place into this system. I think = that's what I'll wind up doing. But I'd like to follow this process through = and get it documented for future reference. --=20 Dan Langille - http://langille.org