Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 10 Mar 1997 15:10:38 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        steve@visint.co.uk (Stephen Roome)
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: Hard Link Count too small!
Message-ID:  <199703102210.PAA23655@phaeton.artisoft.com>
In-Reply-To: <Pine.BSF.3.91.970310131830.23316B-100000@bagpuss.visint.co.uk> from "Stephen Roome" at Mar 10, 97 01:23:39 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> Asynchronous mounted filesystems might not be POSIX compliant, but 
> synchronous is unbeleivable slow at times. I'm not sure how much it
> is dependant on the actual disk/driver/buffering, but for me rm -rf
> seems to be at least an order of magnitude faster when the filesystem
> is mounted asynchronously.
> 
> Is there not some way that perhaps mount could be more sensible and
> perhaps do some of it's I/O async, or is it really that unsafe.

It is dangerous to not guarantee order of operation for metadata,
since it is impossible to disambiguate what source state led to the
current FS state if you can't know that there was only a single
metadata operation before a crash.

You can run async if you want, with this risk.  This is the same risk
that Linux users using Ext2fs face (which runs async by default).

The most correct way to guartantee order of operation that has been
found so far is soft updates, from the UKY paper by Ganger and Patt.
The implementation in Appendix A of their online version of their
paper is SVR4 specific.  It also has a number of (obvious when you
try to compile) omissions and typos, though these are easy to fix.

Keith Bostic is known to be working on integrating soft updates into
BSD 4.4-Lite2's FFS.

The problem with the Ganger/Patt approach, which is being echoed by
Bostic's work from the reports I have heard, is that it is implementation
specific.  An acyclic directed dependency graph for a single FS is
defined, and then transitive closure is computed over the graph once,
and then the code is generated from the closure computation.  It looks
more like a simplified routing graph because they aren't trying to
solve the problem generally.

The first obvious consequence of this approach is that it works for
only a single file system: all of the other FS's which currently
share code with FFS have to be broken out as seperate code, and
therefore seperate kernel bloat: LFS, MFS, the BSD EXT2FS, etc..

A less obvious consequence is that the directed graph in the FreeBSD
case will be edge-opaque as a result of the VM/buffer cache unification
(I have discussed this extensively with a number of professional kernel
engineers, including several USL people, several Network Appliance
people, and one of the engineers responsible for the Dell UNIX product).
The net effect of this will be deterioration of the graph in the
FreeBSD case to the "worst-case" soft update behaviour in all cases,
which will make it (effectively) equivalent to DOW (Delayed Ordered
Writes, used in UnixWare 2.x SMP).  This will still perform better
than an ununified cache without "worst-case", for instance, BSDI,
since the effect of cache unification is marginally greater than
the effect of moving from DOW to soft updates.


> I know ppl seem to want POSIX compliancy, but for some things (like this)
> is it really worth it to be losing so much speed..

The POSIX compliance had to do with whether or not the times were
marked for update vs. being updated.  If you made soft update or DOW
ordering guarantees (which the async case CAN NOT) for the metadata
writes, then you would maintain PSOIX compliance.  As it is, the order
of change/mark and change/update are ambiguous (and Bruce is wrong,
even though I think the particular place he is wrong was snuck into
POSIX by some overzealous VMS weenie, so it isn't Bruce's fault -- it
has to do with access time updates in the getdents case and modification
time updates in the inode metadata case, in particular, size and other
metadata updates that occur as a result of extending the file, a very
special case, but a frequent one for news servers).


> Also, exactly how 'dangerous' is async, as far as I know linux is async
> and I don't hear that many people complaining that their filesystems
> have all been destroyed. (I'm probably wrong about this, as I don't keep
> up with linux much anymore)

The insidious part of this is that, after a crash, the FS is returned
to a consistent state.  The problem is that you can't know if it's the
*right* consistent state.  With the number of operations that can be
outstanding, you can only guarantee one order of unroll:

	2^(n) -> 2^(n-1) for n >= 1

For any n != 1, there are 2^(n-1) - 1 ambiguous base states that could
have led to the ambiguous state that you want to "repair" from.

When you mount async, you let n get larger than 1 (2^(1-1) - 1 = 0 for
the non-async case means that there is no ambiguity if you only allow
one metadata operation at a time to occur in order).

For instance, say you async write 9 metadata operations, and the system
crashes after doing 4 of them, in any order (async == "in any order").
This leaves you with 2^(5-1) - 1, or 15 possible incorrect states out
of 16 possible source states to which you will "repair" your FS.  You
have a 1 in 16 chance of guessing the right one.

No matter which one you guess, all 16 are "consistent".  But only one
is "right".  So the FS "looks" repaired after the crash.  If you are
using an index file of some kind, then you have "implied state" across
two files: you imply a gaurantee that if you update one file, you
will update both files or neither in the event of a crash.  This may
mean regenerating the index each time you reboot following a crash.



The answer to "exactly how 'dangerous' is async" depends on a lot of
factors.  You can calculate it for yourself; start with the MTBF for
your system for all components in the write path (motherboard, CPU,
RAM, cache, disk controller, disk, power supply, etc.).  Then you
will need the probability scaling factor for the propagation latency
of a write through the machine (how long does it stay in the queue
before it is forcibly flushed, etc.).  This is basically, the amount
of time between the call, and the actual write taking place.


The easy answer is "it depends".

The harder answer is "if you don't know the real answer, are you
comfortable taking chances with the data?".


If you are doing a news server, is this a posting host, or only a
reading host?  If a reading host, then you can regenerate all of
your news data following a crash, so the probability bears on
latency to restart: is it unacceptable for your probability?  If a
posting host, then the news posted by your users can be lost, and
will be unrecoverable.  Is this type of denial of service unacceptable
for your probability.  Etc..

For different applications than news servers, the risk is proportional
to the calculated probability of a failure, the ability to recover
the data, and the service expectations of the people using the service.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703102210.PAA23655>