From owner-freebsd-questions@FreeBSD.ORG Tue Jul 29 16:01:46 2014 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0E559D40 for ; Tue, 29 Jul 2014 16:01:46 +0000 (UTC) Received: from mail-qa0-f46.google.com (mail-qa0-f46.google.com [209.85.216.46]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id BFCE321B2 for ; Tue, 29 Jul 2014 16:01:44 +0000 (UTC) Received: by mail-qa0-f46.google.com with SMTP id v10so9693609qac.33 for ; Tue, 29 Jul 2014 09:01:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:content-transfer-encoding:message-id:references:to; bh=k1N3fFmJjcNIdiewYbi6po8JvBMm2tdmr8jiqxq6USw=; b=AP5f7xBfRgjdTNDI48OQEUowNDE50v+lviZGAF1FNuy5qyWIWXTaam4kZP37e9ubYK RgobHRt9PIHpQ0XXAu1x4eU97+5/0ZJ5nZ0Qyy+I7AHgYXVs/LEfj6HQH4RNgLUjXuq4 k2o72OCvCQKXqtRO7oKy/usdive0/NZZH7awk7Nk09kEpQBiwXD8Tlpazr/GK0FYfGnO 4XBbRAcdsgBvnoMvavhT35i9MJ7icXvyIMrCcADV/6Pmo79cMXGQ8BAExdfatuvB+8/c HMjwIreGDRxdvWhiIjhZ/v31Mi6zmcsgQQz4Y37YAnYEoBVAhnWYCf+B1k4RdZ4QcmDi J1XA== X-Gm-Message-State: ALoCoQlUCE8xySKTOyJY2V9ZcX5TUddUJhy8CgZwQ255J9G3aiB5nW8FULO4jwMk+AlJlGapq3PF X-Received: by 10.224.171.197 with SMTP id i5mr4929771qaz.55.1406649697752; Tue, 29 Jul 2014 09:01:37 -0700 (PDT) Received: from [192.168.2.65] ([96.236.21.80]) by mx.google.com with ESMTPSA id u1sm36026078qat.27.2014.07.29.09.01.37 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 29 Jul 2014 09:01:37 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: gvinum raid5 vs. ZFS raidz From: Paul Kraus In-Reply-To: <201407290827.s6T8RCrl014461@sdf.org> Date: Tue, 29 Jul 2014 12:01:36 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <5BBEA547-A284-4685-8F2F-4BC3401BC1CA@kraus-haus.org> References: <201407290827.s6T8RCrl014461@sdf.org> To: freebsd-questions@freebsd.org X-Mailer: Apple Mail (2.1878.6) X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jul 2014 16:01:46 -0000 On Jul 29, 2014, at 4:27, Scott Bennett wrote: > I want to set up a couple of software-based RAID devices across > identically sized partitions on several disks. At first I thought = that > gvinum's raid5 would be the way to go, but now that I have finally = found > and read some information about raidz, I am unsure which to choose. = My > current, and possibly wrong, understanding about the two methods' most > important features (to me, at least) can be summarized as follows. Disclaimer, I have experience with ZFS but not your other alternative. https://www.listbox.com/subscribe/?listname=3Dzfs@lists.illumos.org > raid5 raidz >=20 > Has parity checking, but any parity Has parity checking = *and* > errors identified are assumed to be frequently spaced = checksums ZFS checksums all data for errors. If there is redundancy (mirror, raid, = copies > 1) ZFS will transparently repair damaged data (but increment = the =93checksum=94 error count so you can know via the zpool status = command that you *are* hitting errors). > Can be expanded by the addition of more Can only be = expanded by > spindles via a "gvinum grow" operation. replacing all = components with > larger components. The = number All ZFS devices are derived from what are called top level vdevs = (virtual devices). The data is striped across all of the top level = vdevs. Each vdev may be composed of a single drive. mirror, or raid (z1, = z2, or z3). So you can create a mixed zpool (not recommended for a = variety of reasons) with a different type of vdev for each vdev. The way = to expand any ZFS zpool is to add additional vdevs (beyond the replace = all drives in a single vdev and then grow to fill the new drives). So = you can create a zpool with one raidz1 vdev and then later add a second = raidz1 vdev. Or more commonly, start with a mirror vdev and then add a = second, third, fourth (etc.) mirror vdev. It is this two tier structure that is one of ZFSes strengths. It is also = a feature that is not well understood. > Does not support migration to any other Does not support = migration > RAID levels or their equivalents. between raidz levels, = even by Correct. Once you have created a vdev, that vdev must remain the same = type. You can add mirrors to a mirror vdev, but you cannot add drives or = change raid level to raidz1, raidz2, or raidz3 vdevs. > Does not support additional parity Supports one (raidz2) or = two > dimensions a la RAID6. (raidz3) = additional parity ZFS parity is handled slightly differently than for traditional raid-5 = (as well as the striping of data / parity blocks). So you cannot just = count on loosing 1, 2, or 3 drives worth of space to parity. See Matt = Ahren=92s Blog entry here = http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for (probably) = more data on this than you want :-) And here = https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT6= 89wTjHv6CGVElrPqTA0w_ZY/edit?pli=3D1#gid=3D2126998674 is his spreadsheet = that relates space lost due to parity to number of drives in raidz vdev = and data block size (yes, the amount of space lost to parity caries with = data block, not configured filesystem block size!). There is a separate = tab for each of RAIDz1, RAIDz2, and RAIDz3. > Fast performance because each block Slower performance = because each > is on a separate spindle from the block is spread across = all > from the previous and next blocks. spindles a la RAID3, so = many > simultaneous I/O = operations are > required for each block. ZFS performance is never that simple as I/O is requested from the drive = in parallel. Unless you are saturating the controller you should be able = to keep all the drive busy at once. Also note that ZFS does NOT suffer = the RAID-5 read-modify-write penalty on writes as every write is a new = write to disk (there is no modification of existing disk blocks), this = is referred to as being Copy On Write (COW). > ----------------------- > I hoped to start with a minimal number of components and = eventually > add more components to increase the space available in the raid5 or = raidz > devices. Increasing their sizes that way would also increase the = total > percentage of space in the devices devoted to data rather than parity, = as > well as improving the performance enhancement of the striping. For = various > reasons, having to replace all component spindles with larger-capacity > components is not a viable method of increasing the size of the raid5 = or > raidz devices in my case. That would appear to rule out raidz. Yup. > OTOH, the very large-capacity drives available in the last two or > three years appear not to be very reliable(*) compared to older drives = of > 1 TB or smaller capacities. gvinum's raid5 appears not to offer good > protection against, nor any repair of, damaged data blocks. Yup. Unless you use ZFS plan on suffering silent data corruption due to = the uncorrectable (and undetectable by the drive) error rate off of = large drives. All drives suffer uncorrectable errors, read errors that = the drive itself does not realize are errors. With traditional = filesystems this bad data is returned to the OS and in some cases may = cause a filesystem panic and in others just bad data returned to the = application. This is one of the HUGE benefits of ZFS, it catches those = errors. > Thanks to three failed external drives and > apparently not fully reliable replacements, compounded by a bad ports > update two or three months ago, I have no functioning X11 and no space > set up any longer in which to build ports to fix the X11 problem, so I > really want to get the disk situation settled ASAP. Trying to keep = track > of everything using only syscons and window(1) is wearing my patience > awfully thin. My home server is ZFS only and I have 2 drives mirrored for the OS and 5 = drives in a raidz2 for data with one hot spare. I have suffered 3 drive = failures (all Seagate), two of which took the four drives in my external = enclosure offline (damn sata port multipliers). I have had NO data loss = or corruption! I started like you, wanting to have some drives and add more later. I = started with a pair of 1TB drives mirrored, then added a second pair to = double my capacity. The problem with 2-way mirrors is that the MTTDL = (Mean Time To Data Loss) is much lower than with RAIDz2, with similar = cost in spec for a 4 disk configuration. After I had a drive fail in the = mirror configuration, I ordered a replacement and crossed my fingers = that the other half to *that* mirror would not fail (the pairs of drives = in the mirrors were the same make / model bought at the same time =85 = not a good bet for reliability). When I got the replacement drive(s) I = took some time and rebuilt my configuration to better handle growth and = reliability by going from a 4 disk 2-way mirror configuration to a 5 = disk RAIDz2. I went from net about 2TB to net about 3TB capacity and a = hot spare. If being able to easily grow capacity is the primary goal I would go = with a 2-way mirror configuration and always include a hot spare (so = that *when* a drive fails it immediately starts resilvering (the ZFS = term for syncing) the vdev). Then you can simple add pairs of drives to = add capacity. Just make sure that the hot spare is at least as large as = the largest drive in use. When you buy drives, always buy from as many = different manufacturers and models as you can. I just bought four 2TB = drives for my backup server. One is a WD, the other 3 are HGST but they = are four different model drives, so that they did not come off the same = production line on the same week as each other. If I could have I would = have gotten four different manufacturers. I also only buy server class = (rated for 24x7 operation with 5 year warranty) drives. The additional = cost has been offset by the savings due to being able to have a failed = drive replaced under warranty. > (*) [Last year I got two defective 3 TB drives in a row from Seagate. Wow, the only time I have seen that kind of failure rate was buying from = Newegg when they were packing them badly. > I ended up settling for a 2 TB Seagate that is still running fine = AFAIK. > While that process was going on, I bought three 2 TB Seagate drives in > external cases with USB 3.0 interfaces, two of which failed outright > after about 12 months and have been replaced with two refurbished = drives > under warranty. Yup, they all replace failed drives with refurb. As a side note, on my home server I have had 6 Seagate ES.2 or ES.3 = drives, 2 HGST UltraStar drives, and 2 WD RE4 in service on my home = server. I have had 3 of the Seagates fail (and one of the Seagate = replacements has failed, still under warranty). I have not had any HGST = or WD drives fail (and they both have better performance than the = Seagates). This does not mean that I do not buy Seagate drives. I spread = my purchases around, keeping to the 24x7 5 year warranty drives and = followup when I have a failure. > While waiting for those replacements to arrive, I bought > a 2 TB Samsung drive in an external case with a USB 3.0 interface. I > discovered by chance that copying very large files to these drives is = an > error-prone process. I would suspect the USB 3.0 layer problem, but that is just a guess. > A roughly 1.1 TB file on the one surviving external > Seagate drive from last year's purchase of three, when copied to the > Samsung drive, showed no I/O errors during the copy operation. = However, > a comparison check using "cmp -l -z originalfile copyoforiginal" shows > quite a few places where the contents don't match. ZFS would not tolerate those kinds of errors. On reading the file ZFS = would know via the checksum that the file was bad. > The same procedure > applied to one of the refurbished Seagates gives similar results, = although > the locations and numbers of differing bytes are different from those = on > the Samsung drive. The same procedure applied to the other = refurbished > drive resulted in a good copy the first time, but a later repetition = ended > up with a copied file that differed from the original by a single bit = in > each of two widely separated places in the files. These problems have > raised the priority of a self-healing RAID device in my mind. Self healing RAID will be of little help=85 See more below > I have to say that these are new experiences to me. The disk = drives, > controllers, etc. that I grew up with all had parity checking in the = hardware, > including the data encoded on the disks, so single-bit errors anywhere = in > the process showed up as hardware I/O errors instantly. If the errors = were > not eliminated during a limited number of retries, they ended up as = permanent > I/O errors that a human would have to resolve at some point. What controllers and drives? I have never seen a drive that does NOT = have uncorrectable errors (these are undetectable by the drive). I have = also never seen a controller that checksums the data. The controllers = rely on the drive to report errors. If the drive does not report an = error, then the controller trusts the data. The big difference is that with drives under 1TB the odds of running = into an uncorrectable error over the life of the drive is very, very = small. The uncorrectable error rate does NOT scale down as the drives = scale up. It has been stable at 1 in 10e-14 (for cheap drives) to 1 in = 10e-15 (for good drives) for over the past 10 years (when I started = looking at that drive spec). So if the rate is not changing and the = total amount of data written / read over the life of the drive join up = by, in some cases, orders of magnitude, the real world occurrence of = such errors is increasing. > FWIW, I also discovered that I cannot run two such multi-hour-long > copy operations in parallel using two separate pairs of drives. = Running > them together seems to go okay for a while, but eventually always = results > in a panic. This is on 9.2-STABLE (r264339). I know that that is not = up > to date, but I can't do anything about that until my disk hardware = situation > is settled.] I have had mixed luck with large copy operations via USB on Freebsd 9.x = Under 9.1 I have found it to be completely unreliable. With 9.2 I have = managed without too many errors. USB really does not seem to be a good = transport for large quantities of data at fast rates. See my rant on USB = hubs here: http://pk1048.com/usb-beware/ -- Paul Kraus paul@kraus-haus.org