Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 22 May 2014 09:00:33 -0500
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-fs@freebsd.org
Subject:   Re: Turn off RAID read and write caching with ZFS? [SB QUAR: Thu May 22 08:33:59 2014]
Message-ID:  <537E0301.4010509@denninger.net>
In-Reply-To: <alpine.GSO.2.01.1405220825290.1735@freddy.simplesystems.org>
References:  <719056985.20140522033824@supranet.net> <537DF2F3.10604@denninger.net> <alpine.GSO.2.01.1405220825290.1735@freddy.simplesystems.org>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a cryptographically signed message in MIME format.

--------------ms000509040906010807010604
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable


On 5/22/2014 8:33 AM, Bob Friesenhahn wrote:
> On Thu, 22 May 2014, Karl Denninger wrote:
>>
>> Write-caching is very evil in a ZFS world, because ZFS checksums each =

>> block. If the filesystem gets back an "OK" for a block not actually=20
>> on the disk ZFS will presume the checksum is ok.  If that assumption=20
>> proves to be false down the road you're going to have a very bad day.
>
> I don't agree with the above statement.  Non-volatile write caching is =

> very beneficial for zfs since it allows transactions (particularly=20
> synchronous zil writes) to complete much quicker. This is important=20
> for NFS servers and for databases.  What is important is that the=20
> cache either be non-volatile (e.g. battery-backed RAM) or absolutely=20
> observe zfs's cache flush requests.  Volatile caches which don't obey=20
> cache flush requests can result in a corrupted pool on power loss,=20
> system panic, or controller failure.
>
> Some plug-in RAID cards have poorly performing firmware which causes=20
> problems.  Only testing or experience from other users can help=20
> identify such cards so that they can be avoided or set to their least=20
> harmful configuration.
>
Let's think this one though.

You have said disk on said controller.

It has a battery-backed RAM cache and JBOD drives on it.

Your database says "Write/Commit" and the controller does, to cache, and =

says "ok, done."  The data is now in the battery-backed cache. Let's=20
further assume the cache is ECC-corrected and we'll accept the risk of=20
an undetected ECC failure (very, very long odds on that one so that=20
seems reasonable.)

Some time passes and other I/O takes place without incident.

Now the *DRIVE* returns an unrecoverable data error during the actual=20
write to spinning rust when the controller (eventually) flushes its cache=
=2E

Note that the controller can't rebuild the drive as it doesn't have a=20
second copy; it's JBOD.  When does the operating system find out about=20
the fault and what locality of the fault does it learn about?

Be very careful with your assumptions here.  If there is more than one=20
filesystem on that drive the I/O that actually returns a fault (because=20
of when it is detected) may in fact be to a *different filesystem* than=20
the one that actually faulted!

The only safe thing for the adapter to do if it detects a failure on a=20
deferred (battery-backed) write is to declare the entire *disk* dead and =

return error for all subsequent I/O attempts to it, effectively forcing=20
all data on that pack to be declared "gone" at the OS level.  You better =

hope the adapter does that (are you sure yours does?) or you're going to =

get a surprise of a most-unpleasant sort because there is no way for the =

adapter to go back and declare a formerly-committed-and-confirmed I/O=20
invalid.

At a minimum by doing this you have multiplied a single-block failure=20
into a failure of *all* blocks on the media as soon as the first one=20
fails.  In practice that may not be all that far off the mark (drives=20
has a distressing habit of failing far more than one block at a time)=20
but to force that behavior is something you should be aware of.

There is a very good argument for what amounts to a battery-backed RAM=20
"disk" for ZIL for the reasons you noted.  And I do agree there are=20
significant performance improvements to be had from battery-backed RAM=20
adapters in a ZFS environment (by the way, set the zfs logbias to=20
"throughput" rather than "latency" if you're using a controller cache=20
since ZFS is incapable of deterministically predicting latency and that=20
can lead to some really odd behavior) but in terms of operational=20
integrity you are taking risk by doing this.

Then again we lived with that risk in the world before ZFS and=20
hardware-backed RAID in that an *undetected* sector fault was=20
potentially ruinous, and since individual blocks were not checksummed it =

did occasionally happen.

All configurations carry risk and you have to evaluate which ones you're =

willing to live with and which ones you simply cannot accept.

--=20
-- Karl
karl@denninger.net



--------------ms000509040906010807010604
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIFTzCC
BUswggQzoAMCAQICAQgwDQYJKoZIhvcNAQEFBQAwgZ0xCzAJBgNVBAYTAlVTMRAwDgYDVQQI
EwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM
TEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkqhkiG9w0BCQEWIGN1c3Rv
bWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0MB4XDTEzMDgyNDE5MDM0NFoXDTE4MDgyMzE5
MDM0NFowWzELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExFzAVBgNVBAMTDkthcmwg
RGVubmluZ2VyMSEwHwYJKoZIhvcNAQkBFhJrYXJsQGRlbm5pbmdlci5uZXQwggIiMA0GCSqG
SIb3DQEBAQUAA4ICDwAwggIKAoICAQC5n2KBrBmG22nVntVdvgKCB9UcnapNThrW1L+dq6th
d9l4mj+qYMUpJ+8I0rTbY1dn21IXQBoBQmy8t1doKwmTdQ59F0FwZEPt/fGbRgBKVt3Quf6W
6n7kRk9MG6gdD7V9vPpFV41e+5MWYtqGWY3ScDP8SyYLjL/Xgr+5KFKkDfuubK8DeNqdLniV
jHo/vqmIgO+6NgzPGPgmbutzFQXlxUqjiNAAKzF2+Tkddi+WKABrcc/EqnBb0X8GdqcIamO5
SyVmuM+7Zdns7D9pcV16zMMQ8LfNFQCDvbCuuQKMDg2F22x5ekYXpwjqTyfjcHBkWC8vFNoY
5aFMdyiN/Kkz0/kduP2ekYOgkRqcShfLEcG9SQ4LQZgqjMpTjSOGzBr3tOvVn5LkSJSHW2Z8
Q0dxSkvFG2/lsOWFbwQeeZSaBi5vRZCYCOf5tRd1+E93FyQfpt4vsrXshIAk7IK7f0qXvxP4
GDli5PKIEubD2Bn+gp3vB/DkfKySh5NBHVB+OPCoXRUWBkQxme65wBO02OZZt0k8Iq0i4Rci
WV6z+lQHqDKtaVGgMsHn6PoeYhjf5Al5SP+U3imTjF2aCca1iDB5JOccX04MNljvifXgcbJN
nkMgrzmm1ZgJ1PLur/ADWPlnz45quOhHg1TfUCLfI/DzgG7Z6u+oy4siQuFr9QT0MQIDAQAB
o4HWMIHTMAkGA1UdEwQCMAAwEQYJYIZIAYb4QgEBBAQDAgWgMAsGA1UdDwQEAwIF4DAsBglg
hkgBhvhCAQ0EHxYdT3BlblNTTCBHZW5lcmF0ZWQgQ2VydGlmaWNhdGUwHQYDVR0OBBYEFHw4
+LnuALyLA5Cgy7T5ZAX1WzKPMB8GA1UdIwQYMBaAFF3U3hpBZq40HB5VM7B44/gmXiI0MDgG
CWCGSAGG+EIBAwQrFilodHRwczovL2N1ZGFzeXN0ZW1zLm5ldDoxMTQ0My9yZXZva2VkLmNy
bDANBgkqhkiG9w0BAQUFAAOCAQEAZ0L4tQbBd0hd4wuw/YVqEBDDXJ54q2AoqQAmsOlnoxLO
31ehM/LvrTIP4yK2u1VmXtUumQ4Ao15JFM+xmwqtEGsh70RRrfVBAGd7KOZ3GB39FP2TgN/c
L5fJKVxOqvEnW6cL9QtvUlcM3hXg8kDv60OB+LIcSE/P3/s+0tEpWPjxm3LHVE7JmPbZIcJ1
YMoZvHh0NSjY5D0HZlwtbDO7pDz9sZf1QEOgjH828fhtborkaHaUI46pmrMjiBnY6ujXMcWD
pxtikki0zY22nrxfTs5xDWGxyrc/cmucjxClJF6+OYVUSaZhiiHfa9Pr+41okLgsRB0AmNwE
f6ItY3TI8DGCBQowggUGAgEBMIGjMIGdMQswCQYDVQQGEwJVUzEQMA4GA1UECBMHRmxvcmlk
YTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExDMRwwGgYD
VQQDExNDdWRhIFN5c3RlbXMgTExDIENBMS8wLQYJKoZIhvcNAQkBFiBjdXN0b21lci1zZXJ2
aWNlQGN1ZGFzeXN0ZW1zLm5ldAIBCDAJBgUrDgMCGgUAoIICOzAYBgkqhkiG9w0BCQMxCwYJ
KoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDA1MjIxNDAwMzNaMCMGCSqGSIb3DQEJBDEW
BBTLVO0YxfZyq/y5yWqzHPdyOCucFjBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL
BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA
MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIG0BgkrBgEEAYI3EAQxgaYwgaMwgZ0xCzAJBgNV
BAYTAlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoT
EEN1ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkq
hkiG9w0BCQEWIGN1c3RvbWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0AgEIMIG2BgsqhkiG
9w0BCRACCzGBpqCBozCBnTELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExEjAQBgNV
BAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1zIExMQzEcMBoGA1UEAxMTQ3Vk
YSBTeXN0ZW1zIExMQyBDQTEvMC0GCSqGSIb3DQEJARYgY3VzdG9tZXItc2VydmljZUBjdWRh
c3lzdGVtcy5uZXQCAQgwDQYJKoZIhvcNAQEBBQAEggIApOXSJVcAmlMkrhpqUZuIT1xKhNb1
WKGUUTgVKjV7M4CkYGZkaXDBlRaiGLFI3njEhcafwDCT7eaMh45ARgS9u+EXvPuxwNpA9DFw
loijUP3rmi2iyJZ095SLwFJEu7hjKw7BsAwLqSRdmYgwV5lw06NzZf6Mrxe0Mb1ibHjdH2/0
ZDt+wCoiFGickMjjzpRRyc2uD0VqLnnrMpw0lWnaNPiV9NcU9RlMU50XjwOD7RqO8tosxMRm
4RA+HanrO38xXv8N6UcPonkxIAnOnNrxVaswYodY/ejDSXQUU0tUykGTzJYwxKukG7beV3CQ
Z7aqEU/YTj9AVbeqCrZWqBNnGp3jxumZKQz+Pw5gMeHx/TSe5wxsM5wixPlpz0wzd8ZwXLZV
wJXY09qEKdnIcVckpcEUcb3Xok6vhp5hK7su1FL/aaSiDl9KGWrBYXTvA9vHjOg3OupMe2j3
ybWFgRgqG7BPrh5K8LBljr/UKMtsCQY0gltTqGUByJzdGhqXoZ70Is8oK47+1CLyIQYTA6oa
vLCuXLqiBEWOHMQGvX1GGyXwgGknSIkIg75BAVuE+ycbSqIwqbryJITBYzWbex16PpztgJyQ
QLznClROadXvN/o8OuCd9CPeQ7bEdd4ifax30ztkpkeA8Swc15Im3+WPu+9AoKcMJrP7HxMs
u0uG+0AAAAAAAAA=
--------------ms000509040906010807010604--





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?537E0301.4010509>