Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 31 Oct 2012 13:58:29 -0400
From:      Zaphod Beeblebrox <zbeeble@gmail.com>
To:        Ronald Klop <ronald-freebsd8@klop.yi.org>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: ZFS RaidZ-2 problems
Message-ID:  <CACpH0MeJpSg3ti-QUgT=XwaC0jkEo5JeBAfRGPTFfUE6eLJFJg@mail.gmail.com>
In-Reply-To: <op.wm1axoqv8527sy@ronaldradial.versatec.local>
References:  <508F98F9.3040604@fletchermoorland.co.uk> <1351598684.88435.19.camel@btw.pki2.com> <508FE643.4090107@fletchermoorland.co.uk> <op.wmz1vtrd8527sy@ronaldradial.versatec.local> <5090010A.4050109@fletchermoorland.co.uk> <op.wm1axoqv8527sy@ronaldradial.versatec.local>

next in thread | previous in thread | raw e-mail | index | archive | help
I'd start off by saying "smart is your friend."  Install smartmontools
and study the somewhat opaque "smartctl -a /dev/mydisk" output
carefully.  Try running a short and/or long test, too.  Many times the
disk can tell you what the problem is.  If too many blocks are being
replaced, your drive is dying.  If the drive sees errors in commands
it receives, the cable or the controller are at fault.   ZFS itself
does _exceptionally_ well at trying to use what it has.

I'll also say that bad power supplies make for bad disks.  Replacing a
power supply has often been the solution to bad disk problems I've
had.  Disks are sensitive to under voltage problems.  Brown-outs can
exacerbate this problem.  My parents live out where power is very
flaky.  Cheap UPSs didn't help much ... but a good power supply can
make all the difference.

But I've also had bad controllers of late, too.  My most recent
problem had my 9-disk raidZ1 array loose a disk.  Smartctl said that
it was loosing blocks fast, so I RMA'd the disk.  When the new disk
came, the array just wouldn't heal... it kept loosing the disks
attached to a certain controller.  Now it's possible the controller
was bad before the disk had died ... or that it died during the first
attempt at resilver ... or that FreeBSD drivers don't like it anymore
... I don't know.

My solution was to get two more 4 drive "pro box" SATA enclosures.
They use a 1-to-4 SATA breakout and the 6 motherboard ports I have are
a revision of the ICH11 intel chipset that supports SATA port
replication (I already had two of these boxes).  In this manner I
could remove the defective controller and put all disks onto the
motherboard ICH11 (it actually also allowed me to later expand the
array... but that's not part of this story).

The upshot was that I now had all the disks present for a raidZ array,
but tonnes of the errors had occured when there were not enough disks.
 zpool status -v listed hundresds thousands of files and directories
that were "bad" or lost.  But I'd seen this before and started a
scrub.  The result of the scrub was: perfect recovery.  Actually... it
took a 2nd scrub --- I don't know why.  It was happy after the 1st
scrub, but then some checksum errors were found --- and then fixed, so
I scrubbed again ... and that fixed it.

How does it do it?  Unlike other RAID systems, ZFS can tell a bad
block from a good one.  When it is asked to re-recover after really
bad multiple failures, it can tell if a block is good or not.  This
means that it can choose among alternate or partially recovered
versions and get the right one.  Certainly, my above experience would
have been a dead array ... or an array with much loss if I had used
any other RAID technology.

What does this mean?  Well... one thing it means is that for
non-essential systems (say my home media array), using cheap
technology is less risky.  None of these is enterprise level
technology, but none of it costs anywhere near what enterprise level,
either.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CACpH0MeJpSg3ti-QUgT=XwaC0jkEo5JeBAfRGPTFfUE6eLJFJg>