Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 22 May 2014 09:19:32 -0500
From:      Rick Romero <rick@havokmon.com>
To:        freebsd-fs@freebsd.org
Subject:   Re: Turn off RAID read and write caching with ZFS? [SB QUAR: Thu May 22 08:33:59 2014]
Message-ID:  <20140522091932.Horde.hsT5LUjnShIYq2YrtCVdnA1@www.vfemail.net>
In-Reply-To: <537E0301.4010509@denninger.net>
References:  <719056985.20140522033824@supranet.net> <537DF2F3.10604@denninger.net> <alpine.GSO.2.01.1405220825290.1735@freddy.simplesystems.org> <537E0301.4010509@denninger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
  Quoting Karl Denninger <karl@denninger.net>:

> On 5/22/2014 8:33 AM, Bob Friesenhahn wrote:
>> On Thu, 22 May 2014, Karl Denninger wrote:
>>> Write-caching is very evil in a ZFS world, because ZFS checksums each
>>> block. If the filesystem gets back an "OK" for a block not actually on
>>> the disk ZFS will presume the checksum is ok.  If that assumption
>>> proves to be false down the road you're going to have a very bad day.
>>
>> I don't agree with the above statement.  Non-volatile write caching is
>> very beneficial for zfs since it allows transactions (particularly
>> synchronous zil writes) to complete much quicker. This is important for
>> NFS servers and for databases.  What is important is that the cache
>> either be non-volatile (e.g. battery-backed RAM) or absolutely observe
>> zfs's cache flush requests.  Volatile caches which don't obey cache
>> flush requests can result in a corrupted pool on power loss, system
>> panic, or controller failure.
>>
>> Some plug-in RAID cards have poorly performing firmware which causes
>> problems.  Only testing or experience from other users can help
>> identify such cards so that they can be avoided or set to their least
>> harmful configuration.
>
> Let's think this one though.
>
> You have said disk on said controller.
>
> It has a battery-backed RAM cache and JBOD drives on it.
>
> Your database says "Write/Commit" and the controller does, to cache, and
> says "ok, done."  The data is now in the battery-backed cache. Let's
> further assume the cache is ECC-corrected and we'll accept the risk of
> an undetected ECC failure (very, very long odds on that one so that
> seems reasonable.)
>
> Some time passes and other I/O takes place without incident.
>
> Now the *DRIVE* returns an unrecoverable data error during the actual
> write to spinning rust when the controller (eventually) flushes its
> cache.

Technically, you have the same problem on the local drive's cache. But
disabling write cache on every device just to satisfy ZFS causes it to be
ungodly slow - IMHO. 

Also, IMHO, your scenario is a bit overstated. In this case, the drive
should mark the sector as bad, and write it's cache data to a new sector -
instead of going down the path of having the controller disable the entire
disk as you described.

Which, in the case of the controller disabling the entire drive, that is
safer under a controller-based RAID scenario - because the controller cache
can write to a different drive if that entire drive fails. When run as
cached JBOD - then sure, you could be hosed if the entire drive fails and
it's not caught before a write.

So bascially, IMHO again, if you run write cache on the controller and have
BBC + UPS, then use controller-based RAID.  Don't disable the drive cache
in either case, unless you want complete ZFS protection at the cost of
performance.

I have had ZFS detect a power supply issue by repeatedly disabling drives -
so I don't recommend the controller based RAID + write cache, just take the
performance hit.

Rick



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140522091932.Horde.hsT5LUjnShIYq2YrtCVdnA1>