From owner-freebsd-questions@FreeBSD.ORG  Tue Jul 29 16:01:46 2014
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 0E559D40
 for <freebsd-questions@freebsd.org>; Tue, 29 Jul 2014 16:01:46 +0000 (UTC)
Received: from mail-qa0-f46.google.com (mail-qa0-f46.google.com
 [209.85.216.46])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id BFCE321B2
 for <freebsd-questions@freebsd.org>; Tue, 29 Jul 2014 16:01:44 +0000 (UTC)
Received: by mail-qa0-f46.google.com with SMTP id v10so9693609qac.33
 for <freebsd-questions@freebsd.org>; Tue, 29 Jul 2014 09:01:38 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:content-type:mime-version:subject:from
 :in-reply-to:date:content-transfer-encoding:message-id:references:to;
 bh=k1N3fFmJjcNIdiewYbi6po8JvBMm2tdmr8jiqxq6USw=;
 b=AP5f7xBfRgjdTNDI48OQEUowNDE50v+lviZGAF1FNuy5qyWIWXTaam4kZP37e9ubYK
 RgobHRt9PIHpQ0XXAu1x4eU97+5/0ZJ5nZ0Qyy+I7AHgYXVs/LEfj6HQH4RNgLUjXuq4
 k2o72OCvCQKXqtRO7oKy/usdive0/NZZH7awk7Nk09kEpQBiwXD8Tlpazr/GK0FYfGnO
 4XBbRAcdsgBvnoMvavhT35i9MJ7icXvyIMrCcADV/6Pmo79cMXGQ8BAExdfatuvB+8/c
 HMjwIreGDRxdvWhiIjhZ/v31Mi6zmcsgQQz4Y37YAnYEoBVAhnWYCf+B1k4RdZ4QcmDi
 J1XA==
X-Gm-Message-State: ALoCoQlUCE8xySKTOyJY2V9ZcX5TUddUJhy8CgZwQ255J9G3aiB5nW8FULO4jwMk+AlJlGapq3PF
X-Received: by 10.224.171.197 with SMTP id i5mr4929771qaz.55.1406649697752;
 Tue, 29 Jul 2014 09:01:37 -0700 (PDT)
Received: from [192.168.2.65] ([96.236.21.80])
 by mx.google.com with ESMTPSA id u1sm36026078qat.27.2014.07.29.09.01.37
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Tue, 29 Jul 2014 09:01:37 -0700 (PDT)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
Subject: Re: gvinum raid5 vs. ZFS raidz
From: Paul Kraus <paul@kraus-haus.org>
In-Reply-To: <201407290827.s6T8RCrl014461@sdf.org>
Date: Tue, 29 Jul 2014 12:01:36 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <5BBEA547-A284-4685-8F2F-4BC3401BC1CA@kraus-haus.org>
References: <201407290827.s6T8RCrl014461@sdf.org>
To: freebsd-questions@freebsd.org
X-Mailer: Apple Mail (2.1878.6)
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions/>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Jul 2014 16:01:46 -0000

On Jul 29, 2014, at 4:27, Scott Bennett <bennett@sdf.org> wrote:

>     I want to set up a couple of software-based RAID devices across
> identically sized partitions on several disks.  At first I thought =
that
> gvinum's raid5 would be the way to go, but now that I have finally =
found
> and read some information about raidz, I am unsure which to choose.  =
My
> current, and possibly wrong, understanding about the two methods' most
> important features (to me, at least) can be summarized as follows.

Disclaimer, I have experience with ZFS but not your other alternative.

https://www.listbox.com/subscribe/?listname=3Dzfs@lists.illumos.org

> 		raid5					raidz
>=20
> Has parity checking, but any parity		Has parity checking =
*and*
> errors identified are assumed to be		frequently spaced =
checksums

ZFS checksums all data for errors. If there is redundancy (mirror, raid, =
copies > 1) ZFS will transparently repair damaged data (but increment =
the =93checksum=94 error count so you can know via the zpool status =
command that you *are* hitting errors).

<snip>

> Can be expanded by the addition of more		Can only be =
expanded by
> spindles via a "gvinum grow" operation.		replacing all =
components with
> 						larger components.  The =
number

All ZFS devices are derived from what are called top level vdevs =
(virtual devices). The data is striped across all of the top level =
vdevs. Each vdev may be composed of a single drive. mirror, or raid (z1, =
z2, or z3). So you can create a mixed zpool (not recommended for a =
variety of reasons) with a different type of vdev for each vdev. The way =
to expand any ZFS zpool is to add additional vdevs (beyond the replace =
all drives in a single vdev and then grow to fill the new drives). So =
you can create a zpool with one raidz1 vdev and then later add a second =
raidz1 vdev. Or more commonly, start with a mirror vdev and then add a =
second, third, fourth (etc.) mirror vdev.

It is this two tier structure that is one of ZFSes strengths. It is also =
a feature that is not well understood.

<snip>

> Does not support migration to any other		Does not support =
migration
> RAID levels or their equivalents.		between raidz levels, =
even by

Correct. Once you have created a vdev, that vdev must remain the same =
type. You can add mirrors to a mirror vdev, but you cannot add drives or =
change raid level to raidz1, raidz2, or raidz3 vdevs.

<snip>

> Does not support additional parity		Supports one (raidz2) or =
two
> dimensions a la RAID6.				(raidz3) =
additional parity

ZFS parity is handled slightly differently than for traditional raid-5 =
(as well as the striping of data / parity blocks). So you cannot just =
count on loosing 1, 2, or 3 drives worth of space to parity. See Matt =
Ahren=92s Blog entry here =
http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for (probably) =
more data on this than you want :-) And here =
https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT6=
89wTjHv6CGVElrPqTA0w_ZY/edit?pli=3D1#gid=3D2126998674 is his spreadsheet =
that relates space lost due to parity to number of drives in raidz vdev =
and data block size (yes, the amount of space lost to parity caries with =
data block, not configured filesystem block size!). There is a separate =
tab for each of RAIDz1, RAIDz2, and RAIDz3.

<snip>

> Fast performance because each block		Slower performance =
because each
> is on a separate spindle from the		block is spread across =
all
> from the previous and next blocks.		spindles a la RAID3, so =
many
> 						simultaneous I/O =
operations are
> 						required for each block.

ZFS performance is never that simple as I/O is requested from the drive =
in parallel. Unless you are saturating the controller you should be able =
to keep all the drive busy at once. Also note that ZFS does NOT suffer =
the RAID-5 read-modify-write penalty on writes as every write is a new =
write to disk (there is no modification of existing disk blocks), this =
is referred to as being Copy On Write (COW).

> 				-----------------------
>     I hoped to start with a minimal number of components and =
eventually
> add more components to increase the space available in the raid5 or =
raidz
> devices.  Increasing their sizes that way would also increase the =
total
> percentage of space in the devices devoted to data rather than parity, =
as
> well as improving the performance enhancement of the striping.  For =
various
> reasons, having to replace all component spindles with larger-capacity
> components is not a viable method of increasing the size of the raid5 =
or
> raidz devices in my case.  That would appear to rule out raidz.

Yup.

>     OTOH, the very large-capacity drives available in the last two or
> three years appear not to be very reliable(*) compared to older drives =
of
> 1 TB or smaller capacities.  gvinum's raid5 appears not to offer good
> protection against, nor any repair of, damaged data blocks.

Yup. Unless you use ZFS plan on suffering silent data corruption due to =
the uncorrectable (and undetectable by the drive) error rate off of =
large drives. All drives suffer uncorrectable errors, read errors that =
the drive itself does not realize are errors. With traditional =
filesystems this bad data is returned to the OS and in some cases may =
cause a filesystem panic and in others just bad data returned to the =
application. This is one of the HUGE benefits of ZFS, it catches those =
errors.

<snip>

>   Thanks to three failed external drives and
> apparently not fully reliable replacements, compounded by a bad ports
> update two or three months ago, I have no functioning X11 and no space
> set up any longer in which to build ports to fix the X11 problem, so I
> really want to get the disk situation settled ASAP.  Trying to keep =
track
> of everything using only syscons and window(1) is wearing my patience
> awfully thin.

My home server is ZFS only and I have 2 drives mirrored for the OS and 5 =
drives in a raidz2 for data with one hot spare. I have suffered 3 drive =
failures (all Seagate), two of which took the four drives in my external =
enclosure offline (damn sata port multipliers). I have had NO data loss =
or corruption!

I started like you, wanting to have some drives and add more later. I =
started with a pair of 1TB drives mirrored, then added a second pair to =
double my capacity. The problem with 2-way mirrors is that the MTTDL =
(Mean Time To Data Loss) is much lower than with RAIDz2, with similar =
cost in spec for a 4 disk configuration. After I had a drive fail in the =
mirror configuration, I ordered a replacement and crossed my fingers =
that the other half to *that* mirror would not fail (the pairs of drives =
in the mirrors were the same make / model bought at the same time =85 =
not a good bet for reliability). When I got the replacement drive(s) I =
took some time and rebuilt my configuration to better handle growth and =
reliability by going from a 4 disk 2-way mirror configuration to a 5 =
disk RAIDz2. I went from net about 2TB to net about 3TB capacity and a =
hot spare.

If being able to easily grow capacity is the primary goal I would go =
with a 2-way mirror configuration and always include a hot spare (so =
that *when* a drive fails it immediately starts resilvering (the ZFS =
term for syncing) the vdev). Then you can simple add pairs of drives to =
add capacity. Just make sure that the hot spare is at least as large as =
the largest drive in use. When you buy drives, always buy from as many =
different manufacturers and models as you can. I just bought four 2TB =
drives for my backup server. One is a WD, the other 3 are HGST but they =
are four different model drives, so that they did not come off the same =
production line on the same week as each other. If I could have I would =
have gotten four different manufacturers. I also only buy server class =
(rated for 24x7 operation with 5 year warranty) drives. The additional =
cost has been offset by the savings due to being able to have a failed =
drive replaced under warranty.

> (*) [Last year I got two defective 3 TB drives in a row from Seagate.

Wow, the only time I have seen that kind of failure rate was buying from =
Newegg when they were packing them badly.

> I ended up settling for a 2 TB Seagate that is still running fine =
AFAIK.
> While that process was going on, I bought three 2 TB Seagate drives in
> external cases with USB 3.0 interfaces, two of which failed outright
> after about 12 months and have been replaced with two refurbished =
drives
> under warranty.

Yup, they all replace failed drives with refurb.

As a side note, on my home server I have had 6 Seagate ES.2 or ES.3 =
drives, 2 HGST UltraStar drives, and 2 WD RE4 in service on my home =
server. I have had 3 of the Seagates fail (and one of the Seagate =
replacements has failed, still under warranty). I have not had any HGST =
or WD drives fail (and they both have better performance than the =
Seagates). This does not mean that I do not buy Seagate drives. I spread =
my purchases around, keeping to the 24x7 5 year warranty drives and =
followup when I have a failure.

>  While waiting for those replacements to arrive, I bought
> a 2 TB Samsung drive in an external case with a USB 3.0 interface.  I
> discovered by chance that copying very large files to these drives is =
an
> error-prone process.

I would suspect the USB 3.0 layer problem, but that is just a guess.

>  A roughly 1.1 TB file on the one surviving external
> Seagate drive from last year's purchase of three, when copied to the
> Samsung drive, showed no I/O errors during the copy operation.  =
However,
> a comparison check using "cmp -l -z originalfile copyoforiginal" shows
> quite a few places where the contents don't match.

ZFS would not tolerate those kinds of errors. On reading the file ZFS =
would know via the checksum that the file was bad.

>  The same procedure
> applied to one of the refurbished Seagates gives similar results, =
although
> the locations and numbers of differing bytes are different from those =
on
> the Samsung drive.  The same procedure applied to the other =
refurbished
> drive resulted in a good copy the first time, but a later repetition =
ended
> up with a copied file that differed from the original by a single bit =
in
> each of two widely separated places in the files.  These problems have
> raised the priority of a self-healing RAID device in my mind.

Self healing RAID will be of little help=85 See more below

>     I have to say that these are new experiences to me.  The disk =
drives,
> controllers, etc. that I grew up with all had parity checking in the =
hardware,
> including the data encoded on the disks, so single-bit errors anywhere =
in
> the process showed up as hardware I/O errors instantly.  If the errors =
were
> not eliminated during a limited number of retries, they ended up as =
permanent
> I/O errors that a human would have to resolve at some point.

What controllers and drives? I have never seen a drive that does NOT =
have uncorrectable errors (these are undetectable by the drive). I have =
also never seen a controller that checksums the data. The controllers =
rely on the drive to report errors. If the drive does not report an =
error, then the controller trusts the data.

The big difference is that with drives under 1TB the odds of running =
into an uncorrectable error over the life of the drive is very, very =
small. The uncorrectable error rate does NOT scale down as the drives =
scale up. It has been stable at 1 in 10e-14 (for cheap drives) to 1 in =
10e-15 (for good drives) for over the past 10 years (when I started =
looking at that drive spec). So if the rate is not changing and the =
total amount of data written / read over the life of the drive join up =
by, in some cases, orders of magnitude, the real world occurrence of =
such errors is increasing.

>     FWIW, I also discovered that I cannot run two such multi-hour-long
> copy operations in parallel using two separate pairs of drives.  =
Running
> them together seems to go okay for a while, but eventually always =
results
> in a panic.  This is on 9.2-STABLE (r264339).  I know that that is not =
up
> to date, but I can't do anything about that until my disk hardware =
situation
> is settled.]

I have had mixed luck with large copy operations via USB on Freebsd 9.x =
Under 9.1 I have found it to be completely unreliable. With 9.2 I have =
managed without too many errors. USB really does not seem to be a good =
transport for large quantities of data at fast rates. See my rant on USB =
hubs here: http://pk1048.com/usb-beware/

--
Paul Kraus
paul@kraus-haus.org