Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 9 Apr 2019 20:09:48 -0500
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-stable@freebsd.org
Subject:   Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
Message-ID:  <1866e238-e2a1-ef4e-bee5-5a2f14e35b22@denninger.net>
In-Reply-To: <CACpH0MdLNQ_dqH%2Bto=amJbUuWprx3LYrOLO0rQi7eKw-ZcqWJw@mail.gmail.com>
References:  <f87f32f2-b8c5-75d3-4105-856d9f4752ef@denninger.net> <c96e31ad-6731-332e-5d2d-7be4889716e1@FreeBSD.org> <9a96b1b5-9337-fcae-1a2a-69d7bb24a5b3@denninger.net> <CACpH0MdLNQ_dqH%2Bto=amJbUuWprx3LYrOLO0rQi7eKw-ZcqWJw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a cryptographically signed message in MIME format.

--------------ms070207000405010707030509
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On 4/9/2019 16:27, Zaphod Beeblebrox wrote:
> I have a "Ghetto" home RAID array.  It's built on compromises and makes=
 use
> of RAID-Z2 to survive.  It consists of two plexes of 8x 4T units of
> "spinning rust".  It's been upgraded and upgraded.  It started as 8x 2T=
,
> then 8x 2T + 8x 4T then the current 16x 4T.  The first 8 disks are
> connected to motherboard SATA.  IIRC, there are 10.  Two ports are used=
 for
> a mirror that it boots from.  There's also an SSD in there somhow, so i=
t
> might be 12 ports on the motherboard.
>
> The other 8 disks started life in eSATA port multiplier boxes.  That wa=
s
> doubleplusungood, so I got a RAID card based on LSI pulled from a fujit=
su
> server in Japan.  That's been upgraded a couple of times... not always =
a
> good experience.  One problem is that cheap or refurbished drives don't=

> always "like" SAS controllers and FreeBSD.  YMMV.
>
> Anyways, this is all to introduce the fact that I've seen this behaviou=
r
> multiple times. You have a drive that leaves the array for some amount =
of
> time, and after resilvering, a scrub will find a small amount of bad da=
ta.
> 32 k or 40k or somesuch.  In my cranial schema of things, I've chalked =
it
> up to out-of-order writing of the drives ... or other such behavior s.t=
=2E
> ZFS doesn't know exactly what has been written.  I've often wondered if=
 the
> fix would be to add an amount of fuzz to the transaction range that is
> resilvered.
>
>
> On Tue, Apr 9, 2019 at 4:32 PM Karl Denninger <karl@denninger.net> wrot=
e:
>
>> On 4/9/2019 15:04, Andriy Gapon wrote:
>>> On 09/04/2019 22:01, Karl Denninger wrote:
>>>> the resilver JUST COMPLETED with no errors which means the ENTIRE DI=
SK'S
>>>> IN USE AREA was examined, compared, and blocks not on the "new membe=
r"
>>>> or changed copied over.
>>> I think that that's not entirely correct.
>>> ZFS maintains something called DTL, a dirty-time log, for a missing /=

>> offlined /
>>> removed device.  When the device re-appears and gets resilvered, ZFS
>> walks only
>>> those blocks that were born within the TXG range(s) when the device w=
as
>> missing.
>>> In any case, I do not have an explanation for what you are seeing.
>> That implies something much more-serious could be wrong such as given
>> enough time -- a week, say -- that the DTL marker is incorrect and som=
e
>> TXGs that were in fact changed since the OFFLINE are not walked throug=
h
>> and synchronized.  That would explain why it gets caught by a scrub --=

>> the resilver is in fact not actually copying all the blocks that got
>> changed and so when you scrub the blocks are not identical.  Assuming
>> the detached disk is consistent that's not catastrophically bad IF
>> CAUGHT; where you'd get screwed HARD is in the situation where (for
>> example) you had a 2-unit mirror, detached one, re-attached it, resilv=
er
>> says all is well, there is no scrub performed and then the
>> *non-detached* disk fails before there is a scrub.  In that case you
>> will have permanently destroyed or corrupted data since the other disk=

>> is allegedly consistent but there are blocks *missing* that were never=

>> copied over.
>>
>> Again this just showed up on 12.x; it definitely was *not* at issue in=

>> 11.1 at all.  I never ran 11.2 in production for a material amount of
>> time (I went from 11.1 to 12.0 STABLE after the IPv6 fixes were posted=

>> to 12.x) so I don't know if it is in play on 11.2 or not.
>>
>> I'll see if it shows up again with 20.00.07.00 card firmware.
>>
>> Of note I cannot reproduce this on my test box with EITHER 19.00.00.00=

>> or 20.00.07.00 firmware when I set up a 3-unit mirror, offline one, ma=
ke
>> a crap-ton of changes, offline the second and reattach the third (in
>> effect mirroring the "take one to the vault" thing) with a couple of
>> hours elapsed time and a synthetic (e.g. "dd if=3D/dev/random of=3Dout=
file
>> bs=3D1m" sort of thing) "make me some new data that has to be resilver=
ed"
>> workload.  I don't know if that's because I need more entropy in the
>> filesystem than I can reasonably generate this way (e.g. more
>> fragmentation of files, etc) or whether it's a time-based issue (e.g.
>> something's wrong with the DTL/TXG thing as you note above in terms of=

>> how it functions and it only happens if the time elapsed causes
>> something to be subject to a rollover or similar problem.)
>>
>> I spent quite a lot of time trying to make reproduce the issue on my
>> "sandbox" machine and was unable -- and of note it is never a large
>> quantity of data that is impacted, it's usually only a couple of dozen=

>> checksums that show as bad and fixed.  Of note it's also never just on=
e;
>> if there was a single random hit on a data block due to ordinary bitro=
t
>> sort of issues I'd expect only one checksum to be bad.  But generating=
 a
>> realistic synthetic workload over the amount of time involved on a
>> sandbox is not trivial at all; the system on which this is now happeni=
ng
>> handles a lot of email and routine processing of various sorts includi=
ng
>> a fair bit of database activity associated with network monitoring and=

>> statistical analysis.
>>
>> I'm assuming that using "offline" as a means to do this hasn't become
>> "invalid" as something that's considered "ok" as a means of doing this=

>> sort of thing.... it certainly has worked perfectly well for a very lo=
ng
>> time!
>>
>> --
>> Karl Denninger
>> karl@denninger.net <mailto:karl@denninger.net>
>> /The Market Ticker/
>> /[S/MIME encrypted email preferred]/

The problem with the theory you have put forward Zaphod (which is
logical) is that in my *specific* case it shouldn't happen -- which
implies (strongly) that there's a code problem somewhere and what you're
seeing is not a result of random drive write-reordering.

Specifically, I *explicitly* OFFLINE the disk in question, which is a
controlled operation and *should* result in a cache flush out of the ZFS
code into the drive before it is OFFLINE'd.

This should result in the "last written" TXG that the remaining online
members have, and the one in the offline member, being consistent.

Then I "camcontrol standby" the involved drive, which forces a writeback
cache flush and a spindown; in other words, re-ordered or not, the
on-platter data *should* be consistent with what the system thinks
happened before I yank the physical device.

But... it's not.

And these are not "ghetto" devices in my case either -- they're all
either NAS or enterprise drives and all of them are on a LSI/Avago
Sata/SAS adapter -- with no expanders involved (any more anyway; under
11.1 the expander was ENTIRELY reliable, but under 11.2 and 12.0 it's
not, so I removed it.)=C2=A0 No "grab whatever" sort of disks are involve=
d in
this and I've seen the same behavior across multiple makes and models,
which strongly implies it's a not a random firmware bug in some specific
model of disk.

In the case of a *random* detach I can see it, because the disk that
detaches and goes offline might have some number of transactions in the
write cache which are lost when the power fails on said device, and thus
the committed TXG number is inconsistent with what the remaining online
devices have, and write re-ordering could screw you with no way for the
code to know it occurred.

But in this case the offline was intentional and thus any queued writes
should have been flushed down from the OS to the device, and the standby
command should flush any on-drive cache to the physical media before the
spindle is spun down -- thus there should be no possible way for an
out-of-sync "last written" TXG to happen.

Unless there's a code problem somewhere....

--=20
Karl Denninger
karl@denninger.net <mailto:karl@denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

--------------ms070207000405010707030509
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature

MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgMFADCABgkqhkiG9w0BBwEAAKCC
DdgwggagMIIEiKADAgECAhMA5EiKghDOXrvfxYxjITXYDdhIMA0GCSqGSIb3DQEBCwUAMIGL
MQswCQYDVQQGEwJVUzEQMA4GA1UECAwHRmxvcmlkYTESMBAGA1UEBwwJTmljZXZpbGxlMRkw
FwYDVQQKDBBDdWRhIFN5c3RlbXMgTExDMRgwFgYDVQQLDA9DdWRhIFN5c3RlbXMgQ0ExITAf
BgNVBAMMGEN1ZGEgU3lzdGVtcyBMTEMgMjAxNyBDQTAeFw0xNzA4MTcxNjQyMTdaFw0yNzA4
MTUxNjQyMTdaMHsxCzAJBgNVBAYTAlVTMRAwDgYDVQQIDAdGbG9yaWRhMRkwFwYDVQQKDBBD
dWRhIFN5c3RlbXMgTExDMRgwFgYDVQQLDA9DdWRhIFN5c3RlbXMgQ0ExJTAjBgNVBAMMHEN1
ZGEgU3lzdGVtcyBMTEMgMjAxNyBJbnQgQ0EwggIiMA0GCSqGSIb3DQEBAQUAA4ICDwAwggIK
AoICAQC1aJotNUI+W4jP7xQDO8L/b4XiF4Rss9O0B+3vMH7Njk85fZ052QhZpMVlpaaO+sCI
KqG3oNEbuOHzJB/NDJFnqh7ijBwhdWutdsq23Ux6TvxgakyMPpT6TRNEJzcBVQA0kpby1DVD
0EKSK/FrWWBiFmSxg7qUfmIq/mMzgE6epHktyRM3OGq3dbRdOUgfumWrqHXOrdJz06xE9NzY
vc9toqZnd79FUtE/nSZVm1VS3Grq7RKV65onvX3QOW4W1ldEHwggaZxgWGNiR/D4eosAGFxn
uYeWlKEC70c99Mp1giWux+7ur6hc2E+AaTGh+fGeijO5q40OGd+dNMgK8Es0nDRw81lRcl24
SWUEky9y8DArgIFlRd6d3ZYwgc1DMTWkTavx3ZpASp5TWih6yI8ACwboTvlUYeooMsPtNa9E
6UQ1nt7VEi5syjxnDltbEFoLYcXBcqhRhFETJe9CdenItAHAtOya3w5+fmC2j/xJz29og1KH
YqWHlo3Kswi9G77an+zh6nWkMuHs+03DU8DaOEWzZEav3lVD4u76bKRDTbhh0bMAk4eXriGL
h4MUoX3Imfcr6JoyheVrAdHDL/BixbMH1UUspeRuqQMQ5b2T6pabXP0oOB4FqldWiDgJBGRd
zWLgCYG8wPGJGYgHibl5rFiI5Ix3FQncipc6SdUzOQIDAQABo4IBCjCCAQYwHQYDVR0OBBYE
FF3AXsKnjdPND5+bxVECGKtc047PMIHABgNVHSMEgbgwgbWAFBu1oRhUMNEzjODolDka5k4Q
EDBioYGRpIGOMIGLMQswCQYDVQQGEwJVUzEQMA4GA1UECAwHRmxvcmlkYTESMBAGA1UEBwwJ
TmljZXZpbGxlMRkwFwYDVQQKDBBDdWRhIFN5c3RlbXMgTExDMRgwFgYDVQQLDA9DdWRhIFN5
c3RlbXMgQ0ExITAfBgNVBAMMGEN1ZGEgU3lzdGVtcyBMTEMgMjAxNyBDQYIJAKxAy1WBo2kY
MBIGA1UdEwEB/wQIMAYBAf8CAQAwDgYDVR0PAQH/BAQDAgGGMA0GCSqGSIb3DQEBCwUAA4IC
AQCB5686UCBVIT52jO3sz9pKuhxuC2npi8ZvoBwt/IH9piPA15/CGF1XeXUdu2qmhOjHkVLN
gO7XB1G8CuluxofOIUce0aZGyB+vZ1ylHXlMeB0R82f5dz3/T7RQso55Y2Vog2Zb7PYTC5B9
oNy3ylsnNLzanYlcW3AAfzZcbxYuAdnuq0Im3EpGm8DoItUcf1pDezugKm/yKtNtY6sDyENj
tExZ377cYA3IdIwqn1Mh4OAT/Rmh8au2rZAo0+bMYBy9C11Ex0hQ8zWcvPZBDn4v4RtO8g+K
uQZQcJnO09LJNtw94W3d2mj4a7XrsKMnZKvm6W9BJIQ4Nmht4wXAtPQ1xA+QpxPTmsGAU0Cv
HmqVC7XC3qxFhaOrD2dsvOAK6Sn3MEpH/YrfYCX7a7cz5zW3DsJQ6o3pYfnnQz+hnwLlz4MK
17NIA0WOdAF9IbtQqarf44+PEyUbKtz1r0KGeGLs+VGdd2FLA0e7yuzxJDYcaBTVwqaHhU2/
Fna/jGU7BhrKHtJbb/XlLeFJ24yvuiYKpYWQSSyZu1R/gvZjHeGb344jGBsZdCDrdxtQQcVA
6OxsMAPSUPMrlg9LWELEEYnVulQJerWxpUecGH92O06wwmPgykkz//UmmgjVSh7ErNvL0lUY
UMfunYVO/O5hwhW+P4gviCXzBFeTtDZH259O7TCCBzAwggUYoAMCAQICEwCg0WvVwekjGFiO
62SckFwepz0wDQYJKoZIhvcNAQELBQAwezELMAkGA1UEBhMCVVMxEDAOBgNVBAgMB0Zsb3Jp
ZGExGTAXBgNVBAoMEEN1ZGEgU3lzdGVtcyBMTEMxGDAWBgNVBAsMD0N1ZGEgU3lzdGVtcyBD
QTElMCMGA1UEAwwcQ3VkYSBTeXN0ZW1zIExMQyAyMDE3IEludCBDQTAeFw0xNzA4MTcyMTIx
MjBaFw0yMjA4MTYyMTIxMjBaMFcxCzAJBgNVBAYTAlVTMRAwDgYDVQQIDAdGbG9yaWRhMRkw
FwYDVQQKDBBDdWRhIFN5c3RlbXMgTExDMRswGQYDVQQDDBJrYXJsQGRlbm5pbmdlci5uZXQw
ggIiMA0GCSqGSIb3DQEBAQUAA4ICDwAwggIKAoICAQC+HVSyxVtJhy3Ohs+PAGRuO//Dha9A
16l5FPATr6wude9zjX5f2lrkRyU8vhCXTZW7WbvWZKpcZ8r0dtZmiK9uF58Ec6hhvfkxJzbg
96WHBw5Fumd5ahZzuCJDtCAWW8R7/KN+zwzQf1+B3MVLmbaXAFBuKzySKhKMcHbK3/wjUYTg
y+3UK6v2SBrowvkUBC+jxNg3Wy12GsTXcUS/8FYIXgVVPgfZZrbJJb5HWOQpvvhILpPCD3xs
YJFNKEPltXKWHT7Qtc2HNqikgNwj8oqOb+PeZGMiWapsatKm8mxuOOGOEBhAoTVTwUHlMNTg
6QUCJtuWFCK38qOCyk9Haj+86lUU8RG6FkRXWgMbNQm1mWREQhw3axgGLSntjjnznJr5vsvX
SYR6c+XKLd5KQZcS6LL8FHYNjqVKHBYM+hDnrTZMqa20JLAF1YagutDiMRURU23iWS7bA9tM
cXcqkclTSDtFtxahRifXRI7Epq2GSKuEXe/1Tfb5CE8QsbCpGsfSwv2tZ/SpqVG08MdRiXxN
5tmZiQWo15IyWoeKOXl/hKxA9KPuDHngXX022b1ly+5ZOZbxBAZZMod4y4b4FiRUhRI97r9l
CxsP/EPHuuTIZ82BYhrhbtab8HuRo2ofne2TfAWY2BlA7ExM8XShMd9bRPZrNTokPQPUCWCg
CdIATQIDAQABo4IBzzCCAcswPAYIKwYBBQUHAQEEMDAuMCwGCCsGAQUFBzABhiBodHRwOi8v
b2NzcC5jdWRhc3lzdGVtcy5uZXQ6ODg4ODAJBgNVHRMEAjAAMBEGCWCGSAGG+EIBAQQEAwIF
oDAOBgNVHQ8BAf8EBAMCBeAwHQYDVR0lBBYwFAYIKwYBBQUHAwIGCCsGAQUFBwMEMDMGCWCG
SAGG+EIBDQQmFiRPcGVuU1NMIEdlbmVyYXRlZCBDbGllbnQgQ2VydGlmaWNhdGUwHQYDVR0O
BBYEFLElmNWeVgsBPe7O8NiBzjvjYnpRMIHKBgNVHSMEgcIwgb+AFF3AXsKnjdPND5+bxVEC
GKtc047PoYGRpIGOMIGLMQswCQYDVQQGEwJVUzEQMA4GA1UECAwHRmxvcmlkYTESMBAGA1UE
BwwJTmljZXZpbGxlMRkwFwYDVQQKDBBDdWRhIFN5c3RlbXMgTExDMRgwFgYDVQQLDA9DdWRh
IFN5c3RlbXMgQ0ExITAfBgNVBAMMGEN1ZGEgU3lzdGVtcyBMTEMgMjAxNyBDQYITAORIioIQ
zl6738WMYyE12A3YSDAdBgNVHREEFjAUgRJrYXJsQGRlbm5pbmdlci5uZXQwDQYJKoZIhvcN
AQELBQADggIBAJXboPFBMLMtaiUt4KEtJCXlHO/3ZzIUIw/eobWFMdhe7M4+0u3te0sr77QR
dcPKR0UeHffvpth2Mb3h28WfN0FmJmLwJk+pOx4u6uO3O0E1jNXoKh8fVcL4KU79oEQyYkbu
2HwbXBU9HbldPOOZDnPLi0whi/sbFHdyd4/w/NmnPgzAsQNZ2BYT9uBNr+jZw4SsluQzXG1X
lFL/qCBoi1N2mqKPIepfGYF6drbr1RnXEJJsuD+NILLooTNf7PMgHPZ4VSWQXLNeFfygoOOK
FiO0qfxPKpDMA+FHa8yNjAJZAgdJX5Mm1kbqipvb+r/H1UAmrzGMbhmf1gConsT5f8KU4n3Q
IM2sOpTQe7BoVKlQM/fpQi6aBzu67M1iF1WtODpa5QUPvj1etaK+R3eYBzi4DIbCIWst8MdA
1+fEeKJFvMEZQONpkCwrJ+tJEuGQmjoQZgK1HeloepF0WDcviiho5FlgtAij+iBPtwMuuLiL
shAXA5afMX1hYM4l11JXntle12EQFP1r6wOUkpOdxceCcMVDEJBBCHW2ZmdEaXgAm1VU+fnQ
qS/wNw/S0X3RJT1qjr5uVlp2Y0auG/eG0jy6TT0KzTJeR9tLSDXprYkN2l/Qf7/nT6Q03qyE
QnnKiBXWAZXveafyU/zYa7t3PTWFQGgWoC4w6XqgPo4KV44OMYIFBzCCBQMCAQEwgZIwezEL
MAkGA1UEBhMCVVMxEDAOBgNVBAgMB0Zsb3JpZGExGTAXBgNVBAoMEEN1ZGEgU3lzdGVtcyBM
TEMxGDAWBgNVBAsMD0N1ZGEgU3lzdGVtcyBDQTElMCMGA1UEAwwcQ3VkYSBTeXN0ZW1zIExM
QyAyMDE3IEludCBDQQITAKDRa9XB6SMYWI7rZJyQXB6nPTANBglghkgBZQMEAgMFAKCCAkUw
GAYJKoZIhvcNAQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMTkwNDEwMDEwOTQ4
WjBPBgkqhkiG9w0BCQQxQgRAytdO0YZHl2o6zsCuJjDHqQGyWJ8I2cX82GdURlWwlH4qPU3N
+3UOq/OwvbqwphibRqNuDSPKbiGzKxpOI1H1HTBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFl
AwQBKjALBglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3
DQMCAgFAMAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIGjBgkrBgEEAYI3EAQxgZUwgZIwezEL
MAkGA1UEBhMCVVMxEDAOBgNVBAgMB0Zsb3JpZGExGTAXBgNVBAoMEEN1ZGEgU3lzdGVtcyBM
TEMxGDAWBgNVBAsMD0N1ZGEgU3lzdGVtcyBDQTElMCMGA1UEAwwcQ3VkYSBTeXN0ZW1zIExM
QyAyMDE3IEludCBDQQITAKDRa9XB6SMYWI7rZJyQXB6nPTCBpQYLKoZIhvcNAQkQAgsxgZWg
gZIwezELMAkGA1UEBhMCVVMxEDAOBgNVBAgMB0Zsb3JpZGExGTAXBgNVBAoMEEN1ZGEgU3lz
dGVtcyBMTEMxGDAWBgNVBAsMD0N1ZGEgU3lzdGVtcyBDQTElMCMGA1UEAwwcQ3VkYSBTeXN0
ZW1zIExMQyAyMDE3IEludCBDQQITAKDRa9XB6SMYWI7rZJyQXB6nPTANBgkqhkiG9w0BAQEF
AASCAgAMYKhHpTSw1iAttMVRBwwECSxOy0qYFAcihUT30tHPLOkvtjq1bBu7uFB1bp/YzH8I
ZFRQM5GKcFUEi0pulsbnAwpvIVoSpjuGFB1d4GGzxDRmofbxLoKScEQtO7qmjwb1DNCXeJ8s
x5zF68q1ddBjKf82nI6ID3BcoPT+lOKJLZT7T9D9iOgLVnkbpy2mI+4YynTdgatpa5meb/x0
VJeq/OLkKs6B7xYV8layyTi9s30DpDrgiX0z+kq3akjbvZ/TBosrRFBfkefISM81JLxNl416
mV+sXf84/YZnA9Jq/2XDKRhFwIeBcvBslsTxBI911NBFOOuleF2IUn+Jjrv4aVP/ejmKiHjO
tYh11SnI91JJopTNjn4HejGAf3ZHHfj9A0N6wgl1rGFIZYEidW6fvO9EYhX9C9Xw1AFYyXnd
HnAfq2RWSThM+3IBgbhn2EwMfJ90Bqmf2CuLiuwZ/JGap47Jwl0o4FLFN+5pyt1L6z1PTMMw
77BYzJ/8bSLTZ+CQ9Bts3kfUYhvbKx4VFAJaTjnqHw/AEdibLQmPCgL7/b6D5Z2UEnvZCCgH
3nlqTqgAMGPoXEAFIZSga8SDRLS42AM14QWIlXyCLK9BKAoASHoAnvZrgawxtchTUXUcynMs
ihdC4zVfwp1bpcp1PAmUFfn4+ACGbyp6g6VrddqpqAAAAAAAAA==
--------------ms070207000405010707030509--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1866e238-e2a1-ef4e-bee5-5a2f14e35b22>