From owner-freebsd-stable@freebsd.org Wed May 8 01:01:24 2019 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 65D051598C28 for ; Wed, 8 May 2019 01:01:24 +0000 (UTC) (envelope-from michelle@sorbs.net) Received: from hades.sorbs.net (hades.sorbs.net [72.12.213.40]) by mx1.freebsd.org (Postfix) with ESMTP id 5542784323 for ; Wed, 8 May 2019 01:01:23 +0000 (UTC) (envelope-from michelle@sorbs.net) MIME-version: 1.0 Content-type: text/plain; charset=UTF-8; format=flowed Received: from isux.com (gate.mhix.org [203.206.128.220]) by hades.sorbs.net (Oracle Communications Messaging Server 7.0.5.29.0 64bit (built Jul 9 2013)) with ESMTPSA id <0PR50014SVHKZI00@hades.sorbs.net> for freebsd-stable@freebsd.org; Tue, 07 May 2019 18:15:23 -0700 (PDT) Subject: Re: ZFS... To: Karl Denninger , freebsd-stable@freebsd.org References: <30506b3d-64fb-b327-94ae-d9da522f3a48@sorbs.net> <56833732-2945-4BD3-95A6-7AF55AB87674@sorbs.net> <3d0f6436-f3d7-6fee-ed81-a24d44223f2f@netfence.it> <17B373DA-4AFC-4D25-B776-0D0DED98B320@sorbs.net> <70fac2fe3f23f85dd442d93ffea368e1@ultra-secure.de> <70C87D93-D1F9-458E-9723-19F9777E6F12@sorbs.net> <5ED8BADE-7B2C-4B73-93BC-70739911C5E3@sorbs.net> <2e4941bf-999a-7f16-f4fe-1a520f2187c0@sorbs.net> <20190430102024.E84286@mulder.mintsol.com> <41FA461B-40AE-4D34-B280-214B5C5868B5@punkt.de> <20190506080804.Y87441@mulder.mintsol.com> <08E46EBF-154F-4670-B411-482DCE6F395D@sorbs.net> <33D7EFC4-5C15-4FE0-970B-E6034EF80BEF@gromit.dlib.vt.edu> From: Michelle Sullivan Message-id: Date: Wed, 08 May 2019 11:01:19 +1000 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:51.0) Gecko/20100101 Firefox/51.0 SeaMonkey/2.48 In-reply-to: Content-transfer-encoding: quoted-printable X-Rspamd-Queue-Id: 5542784323 X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; spf=pass (mx1.freebsd.org: domain of michelle@sorbs.net designates 72.12.213.40 as permitted sender) smtp.mailfrom=michelle@sorbs.net X-Spamd-Result: default: False [-1.88 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-0.97)[-0.972,0]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+a:hades.sorbs.net]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[sorbs.net]; NEURAL_SPAM_SHORT(0.21)[0.211,0]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MX_GOOD(-0.01)[cached: battlestar.sorbs.net]; RCPT_COUNT_TWO(0.00)[2]; RCVD_IN_DNSWL_NONE(0.00)[40.213.12.72.list.dnswl.org : 127.0.10.0]; SUBJ_ALL_CAPS(0.45)[6]; IP_SCORE(-0.36)[ip: (-0.91), ipnet: 72.12.192.0/19(-0.47), asn: 11114(-0.37), country: US(-0.06)]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:11114, ipnet:72.12.192.0/19, country:US]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 May 2019 01:01:24 -0000 Karl Denninger wrote: > On 5/7/2019 00:02, Michelle Sullivan wrote: >> The problem I see with that statement is that the zfs dev mailing list= s constantly and consistently following the line of, the data is always r= ight there is no need for a =E2=80=9Cfsck=E2=80=9D (which I actually get)= but it=E2=80=99s used to shut down every thread... the irony is I=E2=80=99= m now installing windows 7 and SP1 on a usb stick (well it=E2=80=99s actu= ally installed, but sp1 isn=E2=80=99t finished yet) so I can install a zf= s data recovery tool which reports to be able to =E2=80=9Cwalk the data=E2= =80=9D to retrieve all the files... the irony eh... install windows7 on = a usb stick to recover a FreeBSD installed zfs filesystem... will let yo= u know if the tool works, but as it was recommended by a dev I=E2=80=99m = hopeful... have another array (with zfs I might add) loaded and ready to = go... if the data recovery is successful I=E2=80=99ll blow away the origi= nal machine and work out what OS and drive setup will be safe for the dat= a in the future. I might even put FreeBSD and zfs back on it, but if I d= o it won=E2=80=99t be in the current Zraid2 config. > Meh. > > Hardware failure is, well, hardware failure. Yes, power-related > failures are hardware failures. > > Never mind the potential for /software /failures. Bugs are, well, > bugs. And they're a real thing. Never had the shortcomings of UFS bit= e > you on an "unexpected" power loss? Well, I have. Is ZFS absolutely > safe against any such event? No, but it's safe*r*. Yes and no ... I'll explain... > > I've yet to have ZFS lose an entire pool due to something bad happening= , > but the same basic risk (entire filesystem being gone) Everytime I have seen this issue (and it's been more than once - though=20 until now recoverable - even if extremely painful) - its always been=20 during a resilver of a failed drive and something happening... panic,=20 another drive failure, power etc.. any other time its rock solid...=20 which is the yes and no... under normal circumstances zfs is very very=20 good and seems as safe as or safer than UFS... but my experience is ZFS=20 has one really bad flaw.. if there is a corruption in the metadata -=20 even if the stored data is 100% correct - it will fault the pool and=20 thats it it's gone barring some luck and painful recovery (backups=20 aside) ... this other file systems also suffer but there are tools that=20 *majority of the time* will get you out of the s**t with little pain. =20 Barring this windows based tool I haven't been able to run yet, zfs=20 appears to have nothing. > has occurred more > than once in my IT career with other filesystems -- including UFS, lowl= y > MSDOS and NTFS, never mind their predecessors all the way back to flopp= y > disks and the first 5Mb Winchesters. Absolutely, been there done that.. and btrfs...*ouch* still as bad..=20 however with the only one btrfs install I had (I didn't knopw it was=20 btrfs underneath, but netgear NAS...) I was still able to recover the=20 data even though it had screwed the file system so bad I vowed never to=20 consider or use it again on anything ever... > > I learned a long time ago that two is one and one is none when it comes= > to data, and WHEN two becomes one you SWEAT, because that second failur= e > CAN happen at the worst possible time. and does.. > > As for RaidZ2 .vs. mirrored it's not as simple as you might think. > Mirrored vdevs can only lose one member per mirror set, unless you use > three-member mirrors. That sounds insane but actually it isn't in > certain circumstances, such as very-read-heavy and high-performance-rea= d > environments. I know - this is why I don't use mirrored - because wear patterns will=20 ensure both sides of the mirror are closely matched. > > The short answer is that a 2-way mirrored set is materially faster on > reads but has no acceleration on writes, and can lose one member per > mirror. If the SECOND one fails before you can resilver, and that > resilver takes quite a long while if the disks are large, you're dead. > However, if you do six drives as a 2x3 way mirror (that is, 3 vdevs eac= h > of a 2-way mirror) you now have three parallel data paths going at once= > and potentially six for reads -- and performance is MUCH better. A > 3-way mirror can lose two members (and could be organized as 3x2) but > obviously requires lots of drive slots, 3x as much *power* per gigabyte= > stored (and you pay for power twice; once to buy it and again to get th= e > heat out of the room where the machine is.) my problem (as always) is slots not so much the power. > > Raidz2 can also lose 2 drives without being dead. However, it doesn't > get any of the read performance improvement *and* takes a write > performance penalty; Z2 has more write penalty than Z1 since it has to > compute and write two parity entries instead of one, although in theory= > at least it can parallel those parity writes -- albeit at the cost of > drive bandwidth congestion (e.g. interfering with other accesses to the= > same disk at the same time.) In short RaidZx performs about as "well" > as the *slowest* disk in the set. Which is why I built mine with identical drives (though different=20 production batches :) ) ... majority of the data in my storage array is=20 write once (or twice) read many. > So why use it (particularly Z2) at > all? Because for "N" drives you get the protection of a 3-way mirror > and *much* more storage. A six-member RaidZ2 setup returns ~4Tb of > usable space, where with a 2-way mirror it returns 3Tb and a 3-way > mirror (which provides the same protection against drive failure as Z2)= > you have only *half* the storage. IMHO ordinary Raidz isn't worth the > trade-offs, but Z2 frequently is. > > In addition more spindles means more failures, all other things being > equal, so if you need "X" TB of storage and organize it as 3-way mirror= s > you now have twice as many physical spindles which means on average > you'll take twice as many faults. If performance is more important the= n > the choice is obvious. If density is more important (that is, a lot or= > even most of the data is rarely accessed at all) then the choice is > fairly simple too. In many workloads you have some of both, and thus > the correct choice is a hybrid arrangement; that's what I do here, > because I have a lot of data that is rarely-to-never accessed and > read-only but also have some data that is frequently accessed and > frequently written. One size does not fit all in such a workload. This is where I came to 2 systems (with different data) .. one was for=20 density, the other performance. Storage vs working etc.. > MOST systems, by the way, have this sort of paradigm (a huge percentage= > of the data is rarely read and never written) but it doesn't become > economic or sane to try to separate them until you get well into the > terabytes of storage range and a half-dozen or so physical volumes. > There's a very clean argument that prior to that point but with greate= r > than one drive mirrored is always the better choice. > > Note that if you have an *adapter* go insane (and as I've noted here > I've had it happen TWICE in my IT career!) then *all* of the data on th= e > disks served by that adapter is screwed. 100% with you - been there done that... and it doesn't matter what os or = filesystem, hardware failure where silent data corruption happens=20 because of an adapter will always take you out (and zfs will not save=20 you in many cases of that either.) > > It doesn't make a bit of difference what filesystem you're using in tha= t > scenario and thus you had better have a backup scheme and make sure it > works as well, never mind software bugs or administrator stupidity ("dd= " > as root to the wrong target, for example, will reliably screw you every= > single time!) > > For a single-disk machine ZFS is no *less* safe than UFS and provides a= > number of advantages, with arguably the most-important being easily-use= d > snapshots. Depends in normal operating I agree... but when it comes to all or=20 nothing, that is a matter of perspective. Personally I prefer to have=20 in place recovery options and/or multiple *possible* recovery options=20 rather than ... "destroy the pool and recreate it from scratch, hope you = have backups"... > Not only does this simplify backups since coherency during > the backup is never at issue and incremental backups become fast and > easily-done in addition boot environments make roll-forward and even > *roll-back* reasonable to implement for software updates -- a critical > capability if you ever run an OS version update and something goes > seriously wrong with it. If you've never had that happen then consider= > yourself blessed; I have been there (especially in the early days (pre 0.83 kernel)=20 versions of Linux :) ) > it's NOT fun to manage in a UFS environment and often > winds up leading to a "restore from backup" scenario. (To be fair it > can be with ZFS too if you're foolish enough to upgrade the pool before= > being sure you're happy with the new OS rev.) > Actually I have a simple way with UFS (and ext2/3/4 etc) ... split the=20 boot disk almost down the center.. create 3 partitions.. root, swap,=20 altroot. root and altroot are almost identical, one is always active,=20 new OS goes on the other, switch to make the other active (primary) when = you've tested... it's only gives one level of roll forward/roll back,=20 but it works for me and has never failed (boot disk/OS wise) since I=20 implemented it... but then I don't let anyone else in the company have=20 root access so they cannot dd or "rm -r . /" or "rm -r .*" (both of=20 which are the only way I have done that before - back in 1994 and never=20 done it since - its something you learn or get out of IT :P .. and for=20 those who didn't get the latter it should have been 'rm -r .??*' - and=20 why are you on '-stable' ...? :P ) Regards, --=20 Michelle Sullivan http://www.mhix.org/