Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 30 Apr 2019 17:37:17 +0200
From:      Borja Marcos <borjam@sarenet.es>
To:        Michelle Sullivan <michelle@sorbs.net>
Cc:        Karl Denninger <karl@denninger.net>, freebsd-stable@freebsd.org
Subject:   Re: ZFS...
Message-ID:  <75A78DAC-DF85-481B-ABC5-70E5E3960341@sarenet.es>
In-Reply-To: <2e4941bf-999a-7f16-f4fe-1a520f2187c0@sorbs.net>
References:  <30506b3d-64fb-b327-94ae-d9da522f3a48@sorbs.net> <CAOtMX2gf3AZr1-QOX_6yYQoqE-H%2B8MjOWc=eK1tcwt5M3dCzdw@mail.gmail.com> <56833732-2945-4BD3-95A6-7AF55AB87674@sorbs.net> <3d0f6436-f3d7-6fee-ed81-a24d44223f2f@netfence.it> <17B373DA-4AFC-4D25-B776-0D0DED98B320@sorbs.net> <70fac2fe3f23f85dd442d93ffea368e1@ultra-secure.de> <70C87D93-D1F9-458E-9723-19F9777E6F12@sorbs.net> <CAGMYy3tYqvrKgk2c==WTwrH03uTN1xQifPRNxXccMsRE1spaRA@mail.gmail.com> <5ED8BADE-7B2C-4B73-93BC-70739911C5E3@sorbs.net> <d0118f7e-7cfc-8bf1-308c-823bce088039@denninger.net> <2e4941bf-999a-7f16-f4fe-1a520f2187c0@sorbs.net>

next in thread | previous in thread | raw e-mail | index | archive | help


> On 30 Apr 2019, at 15:30, Michelle Sullivan <michelle@sorbs.net> =
wrote:
>=20
>> I'm sorry, but that may well be what nailed you.
>>=20
>> ECC is not just about the random cosmic ray.  It also saves your =
bacon
>> when there are power glitches.
>=20
> No. Sorry no.  If the data is only half to disk, ECC isn't going to =
save you at all... it's all about power on the drives to complete the =
write.

Not necessarily. Depending on the power outage things can get really =
funny during the power loss event. 25+ years ago I witnessed
a severe 2 second voltage drop and during that time the hard disk in our =
SCO Unix server got really crazy. Even the low level format
was corrupted, damage was way beyond mere filesystem corruption.

During the start of a power outage (especially when it=E2=80=99s not a =
clean power cut, but it comes preceded by some voltage swings) data
corruption can be extensive. As far as I know high end systems include =
power management elements to reduce the impact.=20

I have other war stories about UPS systems providing an extremely dirty =
waveform and causing format problems in disks. That happened
in 1995 or so.


>>=20
>> Unfortunately however there is also cache memory on most modern hard
>> drives, most of the time (unless you explicitly shut it off) it's on =
for
>> write caching, and it'll nail you too.  Oh, and it's never, in my
>> experience, ECC.
>=20
> No comment on that - you're right in the first part, I can't comment =
if there are drives with ECC.

Even with cache corruption, ZFS being transaction oriented should offer =
a reasonable guarantee of integrity. You may lose
1 miunte, 5 minutes of changes, but there should be stable, committed =
data on the disk.

Unless the electronics got insane for some milliseconds during the =
outage event (see above).

>> Oh that is definitely NOT true.... again, from hard experience,
>> including (but not limited to) on FreeBSD.
>>=20
>> My experience is that ZFS is materially more-resilient but there is =
no
>> such thing as "can never be corrupted by any set of events."
>=20
> The latter part is true - and my blog and my current situation is not =
limited to or aimed at FreeBSD specifically,  FreeBSD is my experience.  =
The former part... it has been very resilient, but I think (based on =
this certain set of events) it is easily corruptible and I have just =
been lucky.  You just have to hit a certain write to activate the issue, =
and whilst that write and issue might be very very difficult (read: hit =
and miss) to hit in normal every day scenarios it can and will =
eventually happen.

>=20
>>   Backup
>> strategies for moderately large (e.g. many Terabytes) to very large
>> (e.g. Petabytes and beyond) get quite complex but they're also very
>> necessary.
>>=20
> and there in lies the problem.  If you don't have a many 10's of =
thousands of dollars backup solutions, you're either:
>=20
> 1/ down for a looooong time.
> 2/ losing all data and starting again...
>=20
> ..and that's the problem... ufs you can recover most (in most =
situations) and providing the *data* is there uncorrupted by the fault =
you can get it all off with various tools even if it is a complete =
mess....  here I am with the data that is apparently ok, but the =
metadata is corrupt (and note: as I had stopped writing to the drive =
when it started resilvering the data - all of it - should be intact... =
even if a mess.)

The advantage of ZFS is that it makes it feasible to replicate data. If =
you keep a mirror storage server your disaster recovery actions won=E2=80=99=
t require the recovery of a full backup (which can take an inordinate =
amount of time) but reconfiguring the replica server to assume the role =
of the master one.=20

Again, being transaction based somewhat reduces the likelyhood of a =
software bug on the master to propagate to the slave causing extensive =
corruption. Rewinding to
the previous snapshot should help.






Borja.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?75A78DAC-DF85-481B-ABC5-70E5E3960341>