Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 14 Mar 2014 16:28:59 -0400
From:      Richard Yao <ryao@gentoo.org>
To:        =?utf-8?Q?Edward_Tomasz_Napiera=C5=82a?= <trasz@FreeBSD.org>
Cc:        "freebsd-hackers@FreeBSD.org" <freebsd-hackers@FreeBSD.org>, RW <rwmaillists@googlemail.com>, Ian Lepore <ian@FreeBSD.org>
Subject:   Re: GSoC proposition: multiplatform UFS2 driver
Message-ID:  <F5E8863B-7889-4B3A-9D3E-DC70EAC031C2@gentoo.org>
In-Reply-To: <9DA009CD-0629-4402-A2A0-0A6BDE1E86FD@FreeBSD.org>
References:  <CAA3ZYrCPJ1AydSS9n4dDBMFjHh5Ug6WDvTzncTtTw4eYrmcywg@mail.gmail.com> <20140314152732.0f6fdb02@gumby.homeunix.com> <1394811577.1149.543.camel@revolution.hippie.lan> <0405D29C-D74B-4343-82C7-57EA8BEEF370@FreeBSD.org> <53235014.1040003@gentoo.org> <9DA009CD-0629-4402-A2A0-0A6BDE1E86FD@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mar 14, 2014, at 3:18 PM, Edward Tomasz Napiera=C5=82a <trasz@FreeBSD.org=
> wrote:

> Wiadomo=C5=9B=C4=87 napisana przez Richard Yao w dniu 14 mar 2014, o godz.=
 19:53:
>> On 03/14/2014 02:36 PM, Edward Tomasz Napiera=C5=82a wrote:
>>> Wiadomo=C5=9B=C4=87 napisana przez Ian Lepore w dniu 14 mar 2014, o godz=
. 16:39:
>>>> On Fri, 2014-03-14 at 15:27 +0000, RW wrote:
>>>>> On Thu, 13 Mar 2014 18:22:10 -0800
>>>>> Dieter BSD wrote:
>>>>>=20
>>>>>> Julio writes,
>>>>>>> That being said, I do not like the idea of using NetBSD's UFS2
>>>>>>> code. It lacks Soft-Updates, which I consider to make FreeBSD UFS2
>>>>>>> second only to ZFS in desirability.
>>>>>>=20
>>>>>> FFS has been in production use for decades.  ZFS is still wet behind
>>>>>> the ears. Older versions of NetBSD have soft updates, and they work
>>>>>> fine for me. I believe that NetBSD 6.0 is the first release without
>>>>>> soft updates.  They claimed that soft updates was "too difficult" to
>>>>>> maintain.  I find that soft updates are *essential* for data
>>>>>> integrity (I don't know *why*, I'm not a FFS guru).
>>>>>=20
>>>>> NetBSD didn't simply drop soft-updates, they replaced it with
>>>>> journalling, which is the approach used by practically all modern
>>>>> filesystems.=20
>>>>>=20
>>>>> A number of people on the questions list have said that they find
>>>>> UFS+SU to be considerably less robust than the journalled filesystems
>>>>> of other OS's. =20
>>>=20
>>> Let me remind you that some other OS-es had problems such as truncation
>>> of files which were _not_ written (XFS), silently corrupting metadata wh=
en
>>> there were too many files in a single directory (ext3), and panicing ins=
tead
>>> of returning ENOSPC (btrfs).  ;->
>>=20
>> Lets be clear that such problems live between the VFS and block layer
>> and therefore are isolated to specific filesystems. Such problems
>> disappear when using ZFS.
>=20
> Such problems disappear after fixing bugs that caused them.  Just like
> with ZFS - some people _have_ lost zpools in the past.

People with problems who get in touch with me usually can save their pools. I=
 cannot recall an incident where a user came to me for help and suffered com=
plete loss of a pool. However, there have been incidents of partial data los=
s involving user error (running zfs destroy on data you want to keep is bad)=
, faulty memory (this user ignored my warnings about non-ECC memory and then=
 put it into production without running memtest; then blamed ZFS) and two in=
cidents where bugs in ZoL's autotools checks that disabled flushing to disk.=
 The latter two cases have had regression tests put into place to catch the e=
rrors that permitted them.

>=20
>>>> What I've seen claimed is that UFS+SUJ is less robust.  That's a very
>>>> different thing than UFS+SU.  Journaling was nailed onto the side of UFS=

>>>> +SU as an afterthought, and it shows.
>>>=20
>>> Not really - it was developed rather recently, and with filesystems it u=
sually
>>> shows, but it's not "nailed onto the side": it complements SU operation
>>> by journalling the few things which SU doesn't really handle and which
>>> used to require background fsck.
>>>=20
>>> One problem with SU is that it depends on hardware not lying about
>>> write completion.  Journalling filesystems usually just issue flushes
>>> instead.
>>=20
>> This point about write completion being done on unflushed data and no
>> flushes being done could explain the disconnect between RW's statements
>> and what Soft Updates should accomplish. However, it does not change my
>> assertion that placing UFS SU on a ZFS zvol will avoid such failure
>> modes.
>=20
> Assuming everything between UFS and ZFS below behaves correctly.

For ZFS, this means hardware honors flushes and does not deduplicate data (e=
.g. sandforce) so that ditto blocks have an effect. The latter failure mode d=
oes not appear to have been observed in the wild. The former has never been o=
bserved to my knowledge when ZFS is given the physical disks and the SAS/SAT=
A controller does not try doing a write cache. It has been observed on certa=
in iSCSI targets though.

>> In ZFS, we have a two stage transaction commit that issues a
>> flush at each stage to ensure that data goes to disk, no matter what the
>> drive reported. Unless the hardware disobeys flushes, the second stage
>> cannot happen if the first stage does not complete and if the second
>> stage does not complete, all changes are ignored.
>>=20
>> What keeps soft updates from issuing a flush following write completion?
>> If there are no pending writes, it is a noop. If the hardware lies, then
>> this will force the write. The internal dependency tracking mechanisms
>> in Soft Updates should make figuring out when a flush needs to be issued
>> should hardware have lied about completion rather simple. At a high
>> level, what needs to be done is to batch the things that can be done
>> simultaneously and separate those that cannot by flushes. If such
>> behavior is implemented, it should have a mount option for toggling it.
>> It simply is not needed on well behaved devices, such as ZFS zvols.
>=20
> As you say, it's not needed on well-behaved devices.  While it could
> help with crappy hardware, I think it would be either very complicated
> (batching, as described), or would perform very poorly.

For ZFS, a well behaved device is a device that honors flushes. As long as f=
lush semantics are obeyed, ZFS should be fine. The only exceptions known to m=
e involves drives that deduplicate zfs ditto blocks (so far unobserved in th=
e wild), non-ECC RAM (which breaks everything equally) and driver bugs (ZFS d=
oes not replace backups). UFS Soft Updates seems to have stricter requiremen=
ts than ZFS in that IO completion must be honest, but the end result is not a=
s good as there are no ditto blocks or checksums for a merkle tree. Also, in=
 all fairness, ZFS relies on this information too, but it is for performance=
 purposes, not consistency.

> To be honest, I wonder how many problems could be avoided by
> disabling write cache by default.  With NCQ it shouldn't cause
> performance problems, right?

I think you need to specify which cache causes the problem. There is the buf=
fer cache (removed in recent FreeBSD and bypassed in Linux by ZFSOnLinux), t=
he RAID controller cache (using this gives good performance numbers, but is t=
errible for reliability) and the actual drive cache (ZFS is okay with this; U=
FS2 with SU possibly not).



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F5E8863B-7889-4B3A-9D3E-DC70EAC031C2>