FreeBSD Mail Archives

Date:      Thu, 02 May 2019 00:53:23 +1000
From:      Michelle Sullivan <michelle@sorbs.net>
To:        Paul Mather <paul@gromit.dlib.vt.edu>
Cc:        freebsd-stable <freebsd-stable@freebsd.org>
Subject:   Re: ZFS...
Message-ID:  <5c458075-351f-6eb6-44aa-1bd268398343@sorbs.net>
In-Reply-To: <7DBA7907-BE8F-4944-9A71-86E5AC1B85CA@gromit.dlib.vt.edu>
References:  <30506b3d-64fb-b327-94ae-d9da522f3a48@sorbs.net> <3d0f6436-f3d7-6fee-ed81-a24d44223f2f@netfence.it> <17B373DA-4AFC-4D25-B776-0D0DED98B320@sorbs.net> <70fac2fe3f23f85dd442d93ffea368e1@ultra-secure.de> <70C87D93-D1F9-458E-9723-19F9777E6F12@sorbs.net> <CAGMYy3tYqvrKgk2c==WTwrH03uTN1xQifPRNxXccMsRE1spaRA@mail.gmail.com> <5ED8BADE-7B2C-4B73-93BC-70739911C5E3@sorbs.net> <d0118f7e-7cfc-8bf1-308c-823bce088039@denninger.net> <2e4941bf-999a-7f16-f4fe-1a520f2187c0@sorbs.net> <CAOtMX2gOwwZuGft2vPpR-LmTpMVRy6hM_dYy9cNiw%2Bg1kDYpXg@mail.gmail.com> <34539589-162B-4891-A68F-88F879B59650@sorbs.net> <CAOtMX2iB7xJszO8nT_KU%2BrFuSkTyiraMHddz1fVooe23bEZguA@mail.gmail.com> <576857a5-a5ab-eeb8-2391-992159d9c4f2@denninger.net> <A7928311-8F51-4C72-839C-C9C2BA62C66E@sorbs.net> <b0fa0f8e-dc45-9d66-cc48-c733cbb9645b@denninger.net> <FD9802E0-E2E4-464A-8ABD-83B0A21C08F2@sorbs.net> <bf63007@sorbs.net> <CB86C16D-87D9-4D3F-9291-1E2586246E04@sorbs.net> <7DBA7907-BE8F-4944-9A71-86E5AC1B85CA@gromit.dlib.vt.edu>

Paul Mather wrote:
> On Apr 30, 2019, at 11:17 PM, Michelle Sullivan <michelle@sorbs.net> 
> wrote:
>
>> Been there done that though with ext2 rather than UFS..  still got 
>> all my data back... even though it was a nightmare..
>
>
> Is that an implication that had all your data been on UFS (or ext2:) 
> this time around you would have got it all back?  (I've got that 
> impression through this thread from things you've written.) That sort 
> of makes it sound like UFS is bulletproof to me.

Its definitely not (and far from it) bullet proof - however when the 
data on disk is not corrupt I have managed to recover it - even if it 
has been a nightmare - no structure - all files in lost+found etc... or 
even resorting to r-studio in the even of lost raid information etc..
>
> There are levels of corruption.  Maybe what you suffered would have 
> taken down UFS, too? 

Pretty sure not - and even if it would have - with the files intact I 
have always been able to recover them... r-studio being the last resort.

> I guess there's no way to know unless there's some way you can 
> recreate exactly the circumstances that took down your original system 
> (but this time your data on UFS). ;-)

True.

This case - from what my limited knowledge has managed to fathom is a 
spacemap has become corrupt due to partial write during the hard power 
failure. This was the second hard outage during the resilver process 
following a drive platter failure (on a ZRAID2 - so single platter 
failure should be completely recoverable all cases - except hba failure 
or other corruption which does not appear to be the case).. the spacemap 
fails checksum (no surprises there being that it was part written) 
however it cannot be repaired (for what ever reason))... how I get that 
this is an interesting case... one cannot just assume anything about the 
corrupt spacemap... it could be complete and just the checksum is wrong, 
it could be completely corrupt and ignorable.. but what I understand of 
ZFS (and please watchers chime in if I'm wrong) the spacemap is just the 
freespace map.. if corrupt or missing one cannot just 'fix it' because 
there is a very good chance that the fix would corrupt something that is 
actually allocated and therefore the best solution would be (to "fix 
it") would be consider it 100% full and therefore 'dead space' .. but 
zfs doesn't do that - probably a good thing - the result being that a 
drive that is supposed to be good (and zdb reports some +36m objects 
there) becomes completely unreadable ...  my thought (desire/want) on a 
'walk' tool would be a last resort tool that could walk the datasets and 
send them elsewhere (like zfs send) so that I could create a new pool 
elsewhere and send the data it knows about to another pool and then blow 
away the original - if there are corruptions or data missing, thats my 
problem it's a last resort.. but in the case the critical structures 
become corrupt it means a local recovery option is enabled.. it means 
that if the data is all there and the corruption is just a spacemap one 
can transfer the entire drive/data to a new pool whilst the original 
host is rebuilt... this would *significantly* help most people with 
large pools that have to blow them away and re-create the pools because 
of errors/corruptions etc... and with the addition of 'rsync' (the 
checksumming of files) it would be trivial to just 'fix' the data 
corrupted or missing from a mirror host rather than transferring the 
entire pool from (possibly) offsite....

Regards,

-- 
Michelle Sullivan
http://www.mhix.org/

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5c458075-351f-6eb6-44aa-1bd268398343>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation