From owner-freebsd-stable@FreeBSD.ORG  Sat Aug 20 19:57:05 2011
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 01657106564A
	for <freebsd-stable@freebsd.org>; Sat, 20 Aug 2011 19:57:05 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from qmta07.emeryville.ca.mail.comcast.net
	(qmta07.emeryville.ca.mail.comcast.net [76.96.30.64])
	by mx1.freebsd.org (Postfix) with ESMTP id DA1778FC16
	for <freebsd-stable@freebsd.org>; Sat, 20 Aug 2011 19:57:04 +0000 (UTC)
Received: from omta18.emeryville.ca.mail.comcast.net ([76.96.30.74])
	by qmta07.emeryville.ca.mail.comcast.net with comcast
	id Njss1h0011bwxycA7jx06k; Sat, 20 Aug 2011 19:57:00 +0000
Received: from koitsu.dyndns.org ([67.180.84.87])
	by omta18.emeryville.ca.mail.comcast.net with comcast
	id NjwX1h0051t3BNj8ejwXUX; Sat, 20 Aug 2011 19:56:31 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id A4ACA102C1A; Sat, 20 Aug 2011 12:57:02 -0700 (PDT)
Date: Sat, 20 Aug 2011 12:57:02 -0700
From: Jeremy Chadwick <freebsd@jdc.parodius.com>
To: Dan Langille <dan@langille.org>
Message-ID: <20110820195702.GA39109@icarus.home.lan>
References: <1B4FC0D8-60E6-49DA-BC52-688052C4DA51@langille.org>
	<20110819232125.GA4965@icarus.home.lan>
	<B6B0AD0F-A74C-4F2C-88B0-101443D7831A@langille.org>
	<20110820032438.GA21925@icarus.home.lan>
	<4774BC00-F32B-4BF4-A955-3728F885CAA1@langille.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4774BC00-F32B-4BF4-A955-3728F885CAA1@langille.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-stable@freebsd.org
Subject: Re: bad sector in gmirror HDD
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 20 Aug 2011 19:57:05 -0000

Dan, sorry for the previous mail.  Seems my schedule today has just
unexpected changed; I had social events to deal with but as I found out
a few minutes ago those events are cancelled, which means I have time
today to look at your mail.

On Sat, Aug 20, 2011 at 01:34:41PM -0400, Dan Langille wrote:
> On Aug 19, 2011, at 11:24 PM, Jeremy Chadwick wrote:
> > The SMART error log also indicates an LBA failure at the 26000 hour mark
> > (which is 16 hours prior to when you did smartctl -a /dev/ad2).  Whether
> > that LBA is the remapped one or the suspect one is unknown.  The LBA was
> > 5566440.
> > 
> > The SMART tests you did didn't really amount to anything; no surprise.
> > short and long tests usually do not test the surface of the disk.  There
> > are some drives which do it on a long test, but as I said before,
> > everything varies from drive to drive.
> > 
> > Furthermore, on this model of drive, you cannot do a surface scans via
> > SMART.  Bummer.  That's indicated in the "Offline data collection
> > capabilities" section at the top, where it reads:
> > 
> > 	No Selective Self-test supported.
> > 
> > So you'll have to use the dd method.  This takes longer than if surface
> > scanning was supported by the drive, but is acceptable.  I'll get to how
> > to go about that in a moment.
> 
> FWIW, I've done a dd read of the entire suspect disk already.  Just two errors.

Actually one error -- keep reading.

> From the URL mentioned above:
> 
> [root@bast:~] # dd of=/dev/null if=/dev/ad2 bs=1m conv=noerror
> dd: /dev/ad2: Input/output error
> 2717+0 records in
> 2717+0 records out
> 2848980992 bytes transferred in 127.128503 secs (22410246 bytes/sec)
> dd: /dev/ad2: Input/output error
> 38170+1 records in
> 38170+1 records out
> 40025063424 bytes transferred in 1544.671423 secs (25911701 bytes/sec)
> [root@bast:~] # 
> 
> That seems to indicate two problems.  Are those the values I should be using 
> with dd?

The "values" you refer to are byte offsets, not LBAs.  Furthermore, you
used a block size of 1 megabyte (not sure why people keep doing this).
LBA size on your drive is 512 bytes; asking for 1 megabyte in dd is
going to make the drive try to read() 1MByte, and an I/O error could
happen anywhere within that 1MByte range.  (1024*1024) / 512 == 2048
LBAs make up 1MByte.

Next, remember that the "noerror" attribute has some quirks associated
with it that need to be kept in mind.  The man page discusses these.

Finally, I believe the last I/O error you see (at byte 40025063424) is
normal given what you told dd to do.  It was trying to use bs=1m, and
your drive has a capacity limit of 40027029504 bytes.  I'm left to
believe you had a "short read" (less than 1MByte), so this is normal.
40027029504 / (1024*1024) == 38172.75, which is not a round number,
hence the error.

> I did some more precise testing:
> 
> # time dd of=/dev/null if=/dev/ad2 bs=512 iseek=5566440
> dd: /dev/ad2: Input/output error
> 9+0 records in
> 9+0 records out
> 4608 bytes transferred in 5.368668 secs (858 bytes/sec)
> 
> real	0m5.429s
> user	0m0.000s
> sys	0m0.010s
> 
> NOTE: that's 9 blocks later than mentioned in smarctl
> 
> The above generated this in /var/log/messages:
> 
> Aug 20 17:29:25 bast kernel: ad2: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=5566449

Your dd command above is saying "use a block size of 512 bytes, and read
indefinitely from /dev/ad2, starting with an lseek() on /dev/ad2 of
5566440".  You then get an I/O error "somewhere" from where you start to
when the device ends.  You're assuming that the "number of bytes
transferred" indicates where the actual error happened, which in my
experience is not always true.

What really needs to happen here is use of count=1, and you adjusting
iseek manually per each LBA.  Or you could use the script I wrote and
let the computer do it for you.  :-)

I understand what you're getting at, re: "that's 9 blocks later".  But
the OS does some caching of I/O and so on sometimes, or aggregates
block reads larger than physical LBA size, so that may be what's going
on here.  However, if you keep reading, you might find your answer is
that you may (still unsure) have other LBAs which are now marked suspect.

> > That said:
> > 
> > http://jdc.parodius.com/freebsd/bad_block_scan
> > 
> > If you run this on your ad2 drive, I'm hoping what you'll find are two
> > LBAs which can't be read -- one will be the remapped LBA and one will be
> > the "suspect" LBA.  If you only get one LBA error then that's fine too,
> > and will be the "suspect" LBA.
> 
> > Once you have the LBA(s), you can submit writes to them to get the drive
> > to re-analyse them (assuming they're "suspect"):
> > 
> > dd if=/dev/zero of=/dev/XXX bs=512 count=1 seek=NNNNN
> > 
> > Where XXX is the device and NNNNN is the LBA number.
> > 
> > If this works properly, the dd command should sit there for a little bit
> > (as the drive does its re-analysis magic) and then should complete.
> 
> ad2 is part of a gmirror with ad0.   Does this change things?
> 
> I haven't tried the dd yet.

It does not change things, but I don't know what's going to happen if
you do write commands to the device directly while the drive is still
attached in gmirror.

When I encounter a disk that's behaving like this, I immediately remove
it from the pool/mirror so I can work on it.  I do not trust the OS to
do things like not panic/crash/behave weirdly when doing these things.

> > You'll want to check SMART stats after that; you should see
> > Current_Pending_Sector drop to 0.  If Offline_Uncorrectable incremented
> > then the LBA could not be re-read/remapped.
> 
> It did increment:
> 
> 197 Current_Pending_Sector  0x0032   100   100   020    Old_age   Always       -       2
> 
> [was 1]

What this means is that you have *another* LBA the drive found and
marked suspect.  This could have happened any time; possibly during the
above dd you did, possibly during normal read operation (assuming the
drive is still handling I/O as part of your mirror).

> >  If Reallocated_Sector_Ct
> > incremented then you now have a total of 2 LBAs which are remapped.
> 
> It did increment:
> 
> $ diff smarctl.1 smarctl.3 | grep Reallocated_Sector_Ct
> <   5 Reallocated_Sector_Ct   0x0033   100   100   020    Pre-fail  Always       -       1
> >   5 Reallocated_Sector_Ct   0x0033   100   100   020    Pre-fail  Always       -       2
>
> Full output of smartctl has been appended to http://beta.freebsddiary.org/smart-fixing-bad-sector.php

But you didn't issue any writes to the drive (quote: "I haven't tried
the dd yet"), so I cannot explain why this attribute would increment.
Unless you *did* try the dd?  I don't know; there's not enough
information here for me to ascertain what may have happened between this
paragraph and a couple paragraphs up.

To me, this looks like a write to the drive was issued either manually
(with the dd or if the drive is still in use for I/O by gmirror) and
happened to hit an LBA which was previously marked suspect -- and
induced a remap.

Alternately -- and this is just as plausible as what I just described --
the drive may have a firmware quirk/bug/behavioural different from what
I'm used to, where Current_Pending_Sector acts as a counter (e.g. it
will never reset to zero).  Maxtor "should" be using
Reallocated_Event_Count for this (since that's what it's for; it
indicates failures OR successes), but as I've said time and time again,
the behaviour varies from drive to drive, model to model, and firmware
to firmware.

Also alternatively, there's the whole "smartctl -t offline" ordeal which
might update the attribute data, but it's labelled Old_age not Offline,
so I don't think this would be the case (unless there's a bug in the
firmware or mislabeling of the attribute in the firmware for this drive).

The thing about bad LBAs is that they often come in groups/bunches; dust
on the drive, some region loses its magnetic integrity, etc...  Your
drive is ""old"" (27416 hours = 1142 days = 3.1 years) so it's
understandable IMO.

The only way to know for sure would be to do a surface scan on the drive
and see if any more I/O errors show up.  If they do, I would recommend
just writing zeros from LBA 0 all the way to the end of the drive, then
afterward see what the SMART attributes look like.  "dd if=/dev/zero
of=/dev/ad2 bs=64k" would do the trick (in this case 'bs' doesn't matter
since all you're trying to do is zero the drive; doesn't matter if
writes get aggregated or not).

> > In
> > the case of remapping, you get to deal with the UFS/FFS thing above.
> > To get the stats to update in this situation you *might* (but probably
> > not) have to run "smartctl -t offline /dev/XXX".
> 
> I didn't try that...
> 
> > You might also be wondering "that dd command writes 512 bytes of zero to
> > that LBA; what about the old data that was there, in the case that the
> > drive remaps the LBA?"  This is a great question, and one I've never
> > actually taken the time to answer because at this present time I have
> > absolutely *no* bad disks in my possession.  I'm under the impression
> > that the write does in fact write zeros if the LBA is remapped, but that
> > might not be true at all.  I've been waiting to test this for quite some
> > time and document it/write about it.
> > 
> > I still suggest you replace the drive, although given its age I doubt
> > you'll be able to find a suitable replacement.  I tend to keep disks
> > like this around for testing/experimental purposes and not for actual
> > use.
> 
> I have several unused 80GB HDD I can place into this system.  I think that's
> what I'll wind up doing.  But I'd like to follow this process through and get it documented
> for future reference.

Yes, given the behaviour of the drive I would recommend you simply
replace it at this point in time.  What concerns me the most is
Current_Pending_Sector incrementing, but it's impossible for me to
determine if that incrementing means there are other LBAs which are bad,
or if the drive is behaving how its firmware is designed.

Keep the drive around for further experiments/tinkering if you're
interested.  Stuff like this is always interesting/fun as long as your
data isn't at risk, so doing the replacement first would be best
(especially if both drives in your mirror were bought at the same time
from the same place and have similar manufacturing plants/dates on
them).

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |