From owner-freebsd-hackers  Fri Jan 10 12:18:33 1997
Return-Path: <owner-hackers>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.4/8.8.4) id MAA29824
          for hackers-outgoing; Fri, 10 Jan 1997 12:18:33 -0800 (PST)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.8.4/8.8.4) with SMTP id MAA29814
          for <hackers@freebsd.org>; Fri, 10 Jan 1997 12:18:30 -0800 (PST)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id NAA20414; Fri, 10 Jan 1997 13:06:51 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199701102006.NAA20414@phaeton.artisoft.com>
Subject: Re: mount -o async on a news servre
To: scrappy@hub.org (The Hermit Hacker)
Date: Fri, 10 Jan 1997 13:06:51 -0700 (MST)
Cc: hackers@freebsd.org
In-Reply-To: <Pine.BSF.3.95.970110065253.5112P-100000@thelab.hub.org> from "The Hermit Hacker" at Jan 10, 97 06:59:08 am
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

More information that you probably want; on the other hand, this is
the first time I've laid out explicit examples of failure cases in
anything but the broadest terms (I've always felt they should be
blatantly obvious to anyone who has looked at the problem enough
that they were willing to post comments; unfortunately, not enough
people have that much self control).


> 	Exactly *how* dangerous is setting a file system to async?

Depends on how much you will miss the data if it disappears.  For
passive news servers, not very.  For those which you post users
articles to, when you rebuild the FS, you will get the articles which
have propagated already, but your users articles for the propagation
delay period (times a 1.5 probability vector) could be lost.  So
Bob posts something, your server crashes before it is propagated,
and Bob's article is now lost.

If you provide posting services, this denial of service may in fact
annoy some people.  It would annoy me, if it happened.

For a regular FS, it is nearly too dangerous to contemplate, unless
you can guarantee that the only time your system will go down is
when you shut it down on purpose (ie: you have a UPS, adequate
environmental controls for the room the machine lives in, etc.).


Effectively, then, you are engaging in a risk/reward analysis.


> 	In my case, I'm braving it on a news server, just the ccd
> device that contains both the news admin and spool directories.  The
> drive is local to the system.
> 
> 	My understanding of asynchronous I/O is that it doesn't wait
> for an acknowledgement from the system before going to the next
> write (only, its a very basic understanding?), so I'm curiuos as how
> it would handle writting the history file itself? 

The write returns before it is committed to stable storage.  If you
have a failure in this situation, then you will lose some (random)
number of writes.

Because UFS/FFS/EXT2FS are not logging or journalling (note: I did not
say "not log structured"; structuring is irrelevent for this discussion),
this means that you have:

	o	The state that you wanted the FS in
	o	The state the FS is actually in
	o	An unknown number of intermediate states between the two

File system repair utilities (like fsck) take an FS from:

	o	The state the FS is actually in

To:

	o	A state wherein the FS is internally consistent


Because of Godel's Theorem, you can't guarantee that:

	o	A state wherein the FS is internally consistent

Is equal to:

	o	The state that you wanted the FS in

If you have async on.

Because once the number of unknown intermediate states exceeds 2,
the intended target state of an inconsistent state is no longer
determinite.  There are multiple potential paths to consistency,
(2^(number of pending async operations - 1), in fact), but only
one of them is correct.  If you had 11 writes outstanding at the
time of the crash, you would have only a 1 out of 1024 chance of
doing the write thing... less than 1/10th of 1% change of getting
the FS corrected to where it's supposed to be instead of only
geting the FS cortrected to the point where it's simply "usable".


Consider the example of a database engaged in a two state commit.  The
procedure it will follow to change a record is:

	--- BEGIN TRANSACTION ---
	1.	Read the index record for the data record to change
		into memory
	2.	Read the data record to change into memory
	3.	Modify the data record in memory
	4.	Obtain exclusive access to an unused data record
	5.	Write the modified data record from memory into
		the unused data record on disk
	--- COMMIT STAGE 1 COMPLETED ---
	6.	Modify the index record in memory
	7.	Write the modified index record to disk
	--- COMMIT STAGE 2 COMPLETED ---
	--- END TRANSACTION ---

(A logging or journalling FS would write transaction records and keep
the writes pending; it is a much simpler implementation, since no
disk data would change until the transaction was ended, and the writes
would be guaranteed to be ordered).

Now understand that the async writes do not guarantee ordering.  That
is their nature.  This means that it is possible that the index record
could be written before the data record.  If that happens, and the
machine crashes, you have a valid (but incorrect) data record (whatever
was there before) pointed to by a valid index record, and a valid (but
orphaned) data record not pointed to by anyone.

Questions:

o	How do I back the index record transaction out so that the
	index record points to the orphaned record?

o	How do I *know* that it wasn't the data record getting written
	before the index record, and that I shouldn't be advincing the
	index record instead of backing it up, to run the trasaction
	to completion?

o	Will they ever find poor Aunt Nell, lying at the edge of town
	in a ditch? ...tune in tomorrow...


The problem, of course, is that there is *implied* state generated by
the program.  Short of implementing a transaction tracking system (like
NetWare TTS, or, to be more UNIX-like, USL Tuxedo), you can't arbitrarily
wind and unwind state.

The INN history file would probably need to be rebuilt after each crash
(if that were even possible) to ensure that it was in a good state.


> 	The only risk that I can see is that if the system crashes,
> I'll have (might have) a corrupted file system, but is that my only
> risk?  ie. if I turned on async for a period of time (long enough for
> a backlog from one system to catch up?) and then turned it back off
> again, would that be reasonably safe?

It wouldn't matter; it would reduce the duration of your exposure
to the time it was on, but it wouldn't eliminate the exposure
entirely, or reduce it overall.  The problem still boils down to
whether you can deterministically put the FS in the state that it
would have been had the system not crashed.

Effectively, this means ordering the writes.  There are a number
of ways to do this:

o	Order them explicitly by not doing async writes.  This is
	what the non-async FFS/EXT2FS implements; it is the most
	simple method, easiest to implement, and reduces the
	concurrency the most.  It is the worst way to implement
	a good thing.

o	Order them explicitly by doing async writes to stable
	cache.  This requires specialized hardware and is not a
	generic soloution.  Sun's PrestoServe hardware does this;
	so do Auspex NFS servers.  You would be hard put to find PC
	controllers even capable of this, let alone drivers that
	could handle it.  The BIO subsystem would require significant
	changes to be able to put stable commit incursions up to
	the point an FS could choose between a stable commit to
	cache, a stable commit to disk, and an unstable commit to disk
	(a necessary optimization for non-metadata data originating
	locally instead of as the result of a network transaction).
	This is an OK, if very expensive, way to implement a good thing.

o	Order them implicitly by time and idempotence.  This is what
	Delayed Ordered Writes (DOW -- a USL patented technology)
	does.  Basically, I can do unordered writes as much as I want
	for "unimportant" things, but when I need to do writes involving
	FS structure, then I commit all outstanding I/O's (drain them
	to the disk).  In SVR4.2 ES/MP (UnixWare 2.x), this results
	in a 60% increase in throughput to the disk, even in the
	Uniprocessor case.  This is a patented, and therefore unusable
	in a public project, way of implementing a good thing.

o	Order them implicitly by graph relation; this may or may not
	include time relationships, if they happen to be members of
	the graph hierarchy.  This is what Soft Updates (from the
	Ganger/Patt paper) does.  Basically, I can do writes in any
	order, and when I have a write that depends on another write
	being committed before it is, I insert a queue synchronization
	so that associative and commutative transformation on the
	graph node relationships won't cross the synchronization
	boundries.  This results in speeds approaching 5% of memory
	speeds (until cache limits impose disk latency slowdowns).  In
	fact, this method tends to result in faster code than simple
	async writes because of increased locality seperating metadata
	and non-metadata operations.  It is the best way (so far) to
	implement a good thing.

Obviously, I would use soft updates if I could.  But in no case would
I run for a long time with async; if you want more speed, you should
be willing to pay for it in code, and not kludge things in such a
way that you could potentially lose data.  It is the difference between
doing the right thing the right way and the right thing the wrong way.


> 	BTW...what exactly does fastfs do?  My understanding is that
> this is basically what it does, turns the file system to be async vs
> sync...

Yes; this is exactly what it does; there is a minor difference in the
gathering of directory operations (there is an #ifdef'ed DEBUG sysctl()
variable for directory ordering which is also affected; I really don't
know why the directory ordering was ommitted in the standard sync case;
I believe it is an error in the UFS implementation...


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.