Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 14 Mar 2002 12:35:52 +0300
From:      "Parity Error" <bootup@mail.ru>
To:        "Terry Lambert" <tlambert2@mindspring.com>
Cc:        freebsd-fs@FreeBSD.org
Subject:   Re[2]: metadata update durability ordering/soft updates
Message-ID:  <E16lReK-000C3T-00@f10.mail.ru>
In-Reply-To: <3C8FA1E4.A89F52FF@mindspring.com>

next in thread | previous in thread | raw e-mail | index | archive | help
i am referring not to file data, but filesystem metadata, which is now
_delayed_ write.
When we did synch write to sequence multiple metadata updates belonging to one 
operation for ensuring recoverability of that one operation, we also got
inter-operation
ordering for free (and apps/users could have started depending on it) . Unix
provides 
no guarantess reg the order in which file data will become stable, and apps
should use 
fsync/O_SYNC or logging or whatever to ensure the consistency of their data
stores. 

But, the ordering in which different metadata operations becomes stables, if
not 
enforced could result in the following scenario. 

md a
touch a/file{0,1}{0,1}{0,1}{0,1}
md a/b
touch a/b/file{0,1}{0,1}{0,1}{0,1}

< a crash happens sometime later >

after recovery, it could turn out that all of a/b/file* is there, but only a
few of a/file* are 
there (possibly those in the first dir block). These kind of things would not
occur when 
we did synch write of metadata (disk scheduling would not affect this). unlink
could 
possibly produce even more dramatic effects.  Now the question is whether this
kind of 
behaviour from the filesystem is acceptable and whether some applications can 
actually fail badly due to this.


-----Original Message-----
From: Terry Lambert <tlambert2@mindspring.com>
To: Parity Error <bootup@mail.ru>
Date: Wed, 13 Mar 2002 11:00:52 -0800
Subject: Re: metadata update durability ordering/soft updates


Parity Error wrote:
> with soft-updates metadata updates are delayed write. I am
> wondering if, say there are two independent structural changes,
> one after another, and then a crash happens.
> 
> Is there a possibility that the latter structural change got
> written to disk before the former due to some memory replacement
> policy ?

Independent writes are independent, by definition.  They
are permitted to occur in either order.  Metadata updates
are only ordered by soft updates insofar as necessary to
satify dependencies.  Thus indepependent writes can occur
in any order, but will *usually* occur in order, due to
the way that a scheduled write can not be reordered once it
is given to the disk controller.

This is due to a locking issue on the disk operations queue
in the driver, and is arguably a bug.  It's likely that some
work currently in progress will forceed to the point that the
"likely ordering" of independent operations will "go away in
the future, so you can't even safely depend on it being likely.

This is normally an issue only for updates that do things
like update both an index and a record file, and imply a
dependency order in the operation.  In other words, there
is implied metadata between the two files, and therefore an
implied dependency.

It's the application's responsibility to signal the dependency
to the OS, so that the updates are ordered.  The normal way to
do this is to use a two stage commit operation (per standard
database theoury, Circa IBM, 1965).  In UNIX this is done by
requesting that the first operation be committed, before making
the request to begin the second operation (e.g. a software
barrier instruction).  To find out more about this, you should
use "man fsync" and "man open" (in the "open" page, look for
"O_FSYNC").


As to misordering of dependent writes, even if you use
synchronous I/O properly...

Yes, this can happen due to the memory replacement policy
on many IDE hard drives, which lie about data having been
committed to stable storage, when in fact it has only been
written to the disk write cache, which is far from stable
storage, being as it's not battery backed, and it is not
guaranteed to be written to the disk after a power failure,
except on some IBM and Quantum drives which are no longer
manufactured.

You can ensure this doesn't happen to you by using only
disks which can correctly support cache flush primitives
and tagged command queues, or disabling write caching on
the device.  SCSI devices don't have this problem.

Another potential problem is that some IDE  disks will
acknowledge disabling write caching, but will in fact not
disable it, no matter what commands you spit at them.  For
some of these disks, there are firmware updates available,
but if you are unlucky enough to own one of these disks,
then there is usually no option but to buy a good disk
instead.  May I recommend SCSI?


> could this affect the correctness of some applications ?

The disk caching issue could.  The implied metadata could
not.

If you have an application that uses implied metadata, but
does not take the necessary steps for UNIX to ensure that
the OS is signalled about the implied ordering dependency,
then by definition, your application can't have it's
correctness effected... since it has no correctness to lose.

8-).

-- Terry


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E16lReK-000C3T-00>