FreeBSD Mail Archives

Date:      Thu, 14 Mar 2002 12:47:19 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Parity Error <bootup@mail.ru>
Cc:        freebsd-fs@FreeBSD.org
Subject:   Re: metadata update durability ordering/soft updates
Message-ID:  <3C910C57.71C2D823@mindspring.com>
References:  <E16lReK-000C3T-00@f10.mail.ru>

Parity Error wrote:
> i am referring not to file data, but filesystem metadata, which
> is now _delayed_ write.

I understand this.  Do you understand that delaying the metatadata
writes in soft updates does not affect the dependency ordering, but
may affect the time ordering?

If I have two dependent lists of operations, A-B-C and D-B-E,
then I am ony guaranteed that A and D will occur before B,
and C andc E will occur after B, but there is no guarantee on
the order of [A,D] vs. [D,A] or [C,E] vs. [E,C].

If I have to OTHER dependent lists of operations, Q-R and S-T,
then I am only guaranteed that Q will occur before R, and S
will occur before T, but there is no guarantee on the order of
[ [Q,S], [Q,T], [R,S], [R,T] ] vs. [ [S,Q], [T,Q], [S,R], [T,R] ];
Q-R-S-T is a valid order, as is S-T-Q-R, as is [Q-S-T-R], as is
[Q-S-R-T], etc..

> When we did synch write to sequence multiple metadata updates
> belonging to one operation for ensuring recoverability of that
> one operation, we also got inter-operation ordering for free

Yes.

> (and apps/users could have started depending on it) .

No.  Only misinformed users.  The system *never* made *any*
guarantees with regard to implied metadata.  Your statement
"multiple metadata updates belonging to one operation" is
bogus.  There is no such thing as "one operation" in this
context.  Multiple metadata updates are multiple operations,
and the filesystem guarantees are only that the operations
will not return to the user until they have completed in
the guaranteed order, not that they have completed in any
time relative order compared to each other.

> Unix provides no guarantess reg the order in which file data
> will become stable, and apps should use fsync/O_SYNC or logging
> or whatever to ensure the consistency of their data stores.

That's nice, but it's irrelevant to this discussion, since
file data was never guaranteed for write anyway.

THe reason the fsync/O_SYNC work to serialize the metadata
operations is that the operations are guaranteed to occur
using synchronous I/O, before they return.

In other words, they are stall barriers instituted by the
application programmer in order to get the behaviour the
users ..."could have started depending on"... on purpose,
rather than getting it as a result of an accident of the
implementation of the underlying primitives.

> But, the ordering in which different metadata operations becomes
> stables, if not enforced could result in the following scenario.

[ ... demonstration of failure of bogus assumptions ... ]

Yes.  Bogus assumptions are bogus.  That's a circular argument.
One must not make bogus assumptions, if one wants one's code
to operate reliably.

Your example is poor, as well, unless you intended the "touch"
operations to occur concurrently.

>  These kind of things would not occur when we did synch write of
> metadata (disk scheduling would not affect this). unlink could
> possibly produce even more dramatic effects.  Now the question is
> whether this kind of behaviour from the filesystem is acceptable
> and whether some applications can actually fail badly due to this.

A1: The behaviour is acceptable, since the behaviour guarantees
for metadata stability are mandated by operational guarantees.

To boils this down to laymans language: the OS provides a set of
services upon which reliable services can be built, if they are
correctly engineered.  It is up to the people building the layers
of services on top of the OS services to provide those facilities
that do not exist within the OS proper, such that they are reliable.

In other words, the purpose of the OS is to provide an unconstrained
foundation.  So long as you don't mount the FS in such a way that
the metadata updates are not carried out in the correct order, (e.g.
async), then you can create a system in which the ordering guarantees
are maintained from end-to-end, and you can reliably know the state
that you would have been in had you not crashed, following a crash,
and can recover by rolling the operation forward, if all necessary
data is available, or backward, if it is not.

A2: Applications which expect behaviour other than that guaranteed
by the API definitions can be expected to fail badly when their
assumptions are proven to be unfounded in reality.

STANDARDS COMPLIANCE AND METADATA UPDATES, WITH A SURVEY OF OS/FS's

Certaint metadata updates, such as those to ctime, mtime, and
atime, are guaranteed by the POSIX standard.  These, in turn, imply
that the containers for these objects are similarly guaranteed, to
the root operation, such that the guaranteed operations are always
reliable.  Any OS which fails to make these guarantees is, by its
definition, non-compliant with POSIX.

You can intentionally choose to operate certain filesystems in a
POSIX-non-compliant mode; for example, you can use an MFS, or you
can mount a filesystem async, such that metatadata update guarantees
required for conformance to the standard are not observed.  But you
knowingly give up standards compliance when you do this.

For example, Linux running EXT2FS mounted asynchronously fails
to comply with the POSIX standard with regard to update of ctime,
atime, and mtime updates, both because of the direct failure for
such updates to be committed to stable storage, and because of the
indirect failure of the updates to be committed, since the containers
are not committed, thus making the containers in which the commits
are taking place fail to comply with the definition of "stable
storage".

Another example would be FreeBSD running FFS, if you went out of the
way to mount it async, rather than sync (or with more recent
installations, with soft updates).  Similarly, mounting it noatime
also fails this test.

If you were to mount a System V UFS in SVR4.2 by default, without
specifying "sync" or "async", then you get a behaviour called DOW
(Delayed Ordered Writes), in which an intentionally stall point is
inserted between dependeny convergences.  THis is similar to soft
updates, in that the stall point requires synchronization of the
stable storage at the point where the intersection would occur, but
it provides only non-commutability on non-commutable operations in
a given edge, and does not permit reordering of associativity, even
though operations are associative, and effeciency might be gained,
thereby.  Thus the original A-B-C, D-B-E operation actually *must*
occur in A-B B-E ordering, with a stall between the "B" and the "B".
This only coincidently makes a *partial* ordering guarantee on the
order of independent metadata updates -- so even here, you can not
rely on the system ordering independent updates, only on it being
standards compliant in the API guarantees.

If you want this behaviour on Linux, ReiserFS uses the USL patented
DOW technology without a license.  If you are outside the US, and
don't plan on selling into the US until at least 2018, you could
use ReiserFS to get metadata update ordering withing standards
guaranteed operations, and it will only stall out as often as the
SVR4.2 UFS with DOW.  But you will have the same problem with your
software that assumes -- incorrectly -- that serially requested
independent metadata updates will take place serially... when, in
fact, there is no such guarantee.

PS: FWIW, it's *possible* to generalize the soft updates mechanism
to export a transactioning interface -- actually, a dependency edge
that can be used to implement transactioning -- to user space.  The
effect of doing this would be to also export an edge of the dependency
graph upward.  For two independent graphs, implying an edge between
the top nodes establishes a precedence order on completion, and
therefore guarantees ordering of operations within a transaction.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C910C57.71C2D823>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation