FreeBSD Mail Archives

Date:      Wed, 08 May 2019 11:01:19 +1000
From:      Michelle Sullivan <michelle@sorbs.net>
To:        Karl Denninger <karl@denninger.net>, freebsd-stable@freebsd.org
Subject:   Re: ZFS...
Message-ID:  <a1b78a63-0ef1-af51-4e33-a9a97a257c8b@sorbs.net>
In-Reply-To: <a82bfabe-a8c3-fd9a-55ec-52530d4eafff@denninger.net>
References:  <30506b3d-64fb-b327-94ae-d9da522f3a48@sorbs.net> <CAOtMX2gf3AZr1-QOX_6yYQoqE-H%2B8MjOWc=eK1tcwt5M3dCzdw@mail.gmail.com> <56833732-2945-4BD3-95A6-7AF55AB87674@sorbs.net> <3d0f6436-f3d7-6fee-ed81-a24d44223f2f@netfence.it> <17B373DA-4AFC-4D25-B776-0D0DED98B320@sorbs.net> <70fac2fe3f23f85dd442d93ffea368e1@ultra-secure.de> <70C87D93-D1F9-458E-9723-19F9777E6F12@sorbs.net> <CAGMYy3tYqvrKgk2c==WTwrH03uTN1xQifPRNxXccMsRE1spaRA@mail.gmail.com> <5ED8BADE-7B2C-4B73-93BC-70739911C5E3@sorbs.net> <d0118f7e-7cfc-8bf1-308c-823bce088039@denninger.net> <2e4941bf-999a-7f16-f4fe-1a520f2187c0@sorbs.net> <20190430102024.E84286@mulder.mintsol.com> <41FA461B-40AE-4D34-B280-214B5C5868B5@punkt.de> <20190506080804.Y87441@mulder.mintsol.com> <08E46EBF-154F-4670-B411-482DCE6F395D@sorbs.net> <33D7EFC4-5C15-4FE0-970B-E6034EF80BEF@gromit.dlib.vt.edu> <A535026E-F9F6-4BBA-8287-87EFD02CF207@sorbs.net> <a82bfabe-a8c3-fd9a-55ec-52530d4eafff@denninger.net>

Karl Denninger wrote:
> On 5/7/2019 00:02, Michelle Sullivan wrote:
>> The problem I see with that statement is that the zfs dev mailing list=
s constantly and consistently following the line of, the data is always r=
ight there is no need for a =E2=80=9Cfsck=E2=80=9D (which I actually get)=
 but it=E2=80=99s used to shut down every thread... the irony is I=E2=80=99=
m now installing windows 7 and SP1 on a usb stick (well it=E2=80=99s actu=
ally installed, but sp1 isn=E2=80=99t finished yet) so I can install a zf=
s data recovery tool which reports to be able to =E2=80=9Cwalk the data=E2=
=80=9D to retrieve all the files...  the irony eh... install windows7 on =
a usb stick to recover a FreeBSD installed zfs filesystem...  will let yo=
u know if the tool works, but as it was recommended by a dev I=E2=80=99m =
hopeful... have another array (with zfs I might add) loaded and ready to =
go... if the data recovery is successful I=E2=80=99ll blow away the origi=
nal machine and work out what OS and drive setup will be safe for the dat=
a in the future.  I might even put FreeBSD and zfs back on it, but if I d=
o it won=E2=80=99t be in the current Zraid2 config.
> Meh.
>
> Hardware failure is, well, hardware failure.  Yes, power-related
> failures are hardware failures.
>
> Never mind the potential for /software /failures.  Bugs are, well,
> bugs.  And they're a real thing.  Never had the shortcomings of UFS bit=
e
> you on an "unexpected" power loss?  Well, I have.  Is ZFS absolutely
> safe against any such event?  No, but it's safe*r*.

Yes and no ... I'll explain...

>
> I've yet to have ZFS lose an entire pool due to something bad happening=
,
> but the same basic risk (entire filesystem being gone)

Everytime I have seen this issue (and it's been more than once - though=20
until now recoverable - even if extremely painful) - its always been=20
during a resilver of a failed drive and something happening... panic,=20
another drive failure, power etc.. any other time its rock solid...=20
which is the yes and no... under normal circumstances zfs is very very=20
good and seems as safe as or safer than UFS... but my experience is ZFS=20
has one really bad flaw.. if there is a corruption in the metadata -=20
even if the stored data is 100% correct - it will fault the pool and=20
thats it it's gone barring some luck and painful recovery (backups=20
aside) ... this other file systems also suffer but there are tools that=20
*majority of the time* will get you out of the s**t with little pain. =20
Barring this windows based tool I haven't been able to run yet, zfs=20
appears to have nothing.

> has occurred more
> than once in my IT career with other filesystems -- including UFS, lowl=
y
> MSDOS and NTFS, never mind their predecessors all the way back to flopp=
y
> disks and the first 5Mb Winchesters.

Absolutely, been there done that.. and btrfs...*ouch* still as bad..=20
however with the only one btrfs install I had (I didn't knopw it was=20
btrfs underneath, but netgear NAS...) I was still able to recover the=20
data even though it had screwed the file system so bad I vowed never to=20
consider or use it again on anything ever...

>
> I learned a long time ago that two is one and one is none when it comes=

> to data, and WHEN two becomes one you SWEAT, because that second failur=
e
> CAN happen at the worst possible time.

and does..

>
> As for RaidZ2 .vs. mirrored it's not as simple as you might think.
> Mirrored vdevs can only lose one member per mirror set, unless you use
> three-member mirrors.  That sounds insane but actually it isn't in
> certain circumstances, such as very-read-heavy and high-performance-rea=
d
> environments.

I know - this is why I don't use mirrored - because wear patterns will=20
ensure both sides of the mirror are closely matched.

>
> The short answer is that a 2-way mirrored set is materially faster on
> reads but has no acceleration on writes, and can lose one member per
> mirror.  If the SECOND one fails before you can resilver, and that
> resilver takes quite a long while if the disks are large, you're dead.
> However, if you do six drives as a 2x3 way mirror (that is, 3 vdevs eac=
h
> of a 2-way mirror) you now have three parallel data paths going at once=

> and potentially six for reads -- and performance is MUCH better.  A
> 3-way mirror can lose two members (and could be organized as 3x2) but
> obviously requires lots of drive slots, 3x as much *power* per gigabyte=

> stored (and you pay for power twice; once to buy it and again to get th=
e
> heat out of the room where the machine is.)

my problem (as always) is slots not so much the power.

>
> Raidz2 can also lose 2 drives without being dead.  However, it doesn't
> get any of the read performance improvement *and* takes a write
> performance penalty; Z2 has more write penalty than Z1 since it has to
> compute and write two parity entries instead of one, although in theory=

> at least it can parallel those parity writes -- albeit at the cost of
> drive bandwidth congestion (e.g. interfering with other accesses to the=

> same disk at the same time.)  In short RaidZx performs about as "well"
> as the *slowest* disk in the set.
Which is why I built mine with identical drives (though different=20
production batches :) ) ... majority of the data in my storage array is=20
write once (or twice) read many.

>    So why use it (particularly Z2) at
> all?  Because for "N" drives you get the protection of a 3-way mirror
> and *much* more storage.  A six-member RaidZ2 setup returns ~4Tb of
> usable space, where with a 2-way mirror it returns 3Tb and a 3-way
> mirror (which provides the same protection against drive failure as Z2)=

> you have only *half* the storage.  IMHO ordinary Raidz isn't worth the
> trade-offs, but Z2 frequently is.
>
> In addition more spindles means more failures, all other things being
> equal, so if you need "X" TB of storage and organize it as 3-way mirror=
s
> you now have twice as many physical spindles which means on average
> you'll take twice as many faults.  If performance is more important the=
n
> the choice is obvious.  If density is more important (that is, a lot or=

> even most of the data is rarely accessed at all) then the choice is
> fairly simple too.  In many workloads you have some of both, and thus
> the correct choice is a hybrid arrangement; that's what I do here,
> because I have a lot of data that is rarely-to-never accessed and
> read-only but also have some data that is frequently accessed and
> frequently written.  One size does not fit all in such a workload.
This is where I came to 2 systems (with different data) .. one was for=20
density, the other performance.  Storage vs working etc..

> MOST systems, by the way, have this sort of paradigm (a huge percentage=

> of the data is rarely read and never written) but it doesn't become
> economic or sane to try to separate them until you get well into the
> terabytes of storage range and a half-dozen or so physical volumes.
> There's a  very clean argument that prior to that point but with greate=
r
> than one drive mirrored is always the better choice.
>
> Note that if you have an *adapter* go insane (and as I've noted here
> I've had it happen TWICE in my IT career!) then *all* of the data on th=
e
> disks served by that adapter is screwed.

100% with you - been there done that... and it doesn't matter what os or =

filesystem, hardware failure where silent data corruption happens=20
because of an adapter will always take you out (and zfs will not save=20
you in many cases of that either.)
>
> It doesn't make a bit of difference what filesystem you're using in tha=
t
> scenario and thus you had better have a backup scheme and make sure it
> works as well, never mind software bugs or administrator stupidity ("dd=
"
> as root to the wrong target, for example, will reliably screw you every=

> single time!)
>
> For a single-disk machine ZFS is no *less* safe than UFS and provides a=

> number of advantages, with arguably the most-important being easily-use=
d
> snapshots.

Depends in normal operating I agree... but when it comes to all or=20
nothing, that is a matter of perspective.  Personally I prefer to have=20
in place recovery options and/or multiple *possible* recovery options=20
rather than ... "destroy the pool and recreate it from scratch, hope you =

have backups"...

>    Not only does this simplify backups since coherency during
> the backup is never at issue and incremental backups become fast and
> easily-done in addition boot environments make roll-forward and even
> *roll-back* reasonable to implement for software updates -- a critical
> capability if you ever run an OS version update and something goes
> seriously wrong with it.  If you've never had that happen then consider=

> yourself blessed;

I have been there (especially in the early days (pre 0.83 kernel)=20
versions of Linux :) )

>   it's NOT fun to manage in a UFS environment and often
> winds up leading to a "restore from backup" scenario.  (To be fair it
> can be with ZFS too if you're foolish enough to upgrade the pool before=

> being sure you're happy with the new OS rev.)
>
Actually I have a simple way with UFS (and ext2/3/4 etc) ... split the=20
boot disk almost down the center.. create 3 partitions.. root, swap,=20
altroot.  root and altroot are almost identical, one is always active,=20
new OS goes on the other, switch to make the other active (primary) when =

you've tested... it's only gives one level of roll forward/roll back,=20
but it works for me and has never failed (boot disk/OS wise) since I=20
implemented it... but then I don't let anyone else in the company have=20
root access so they cannot dd or "rm -r . /" or "rm -r .*" (both of=20
which are the only way I have done that before - back in 1994 and never=20
done it since - its something you learn or get out of IT :P .. and for=20
those who didn't get the latter it should have been 'rm -r .??*' - and=20
why are you on '-stable' ...? :P )

Regards,

--=20
Michelle Sullivan
http://www.mhix.org/

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?a1b78a63-0ef1-af51-4e33-a9a97a257c8b>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation