From owner-freebsd-hackers Fri Jan 10 12:18:33 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.4/8.8.4) id MAA29824 for hackers-outgoing; Fri, 10 Jan 1997 12:18:33 -0800 (PST) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.8.4/8.8.4) with SMTP id MAA29814 for ; Fri, 10 Jan 1997 12:18:30 -0800 (PST) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id NAA20414; Fri, 10 Jan 1997 13:06:51 -0700 From: Terry Lambert Message-Id: <199701102006.NAA20414@phaeton.artisoft.com> Subject: Re: mount -o async on a news servre To: scrappy@hub.org (The Hermit Hacker) Date: Fri, 10 Jan 1997 13:06:51 -0700 (MST) Cc: hackers@freebsd.org In-Reply-To: from "The Hermit Hacker" at Jan 10, 97 06:59:08 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@freebsd.org X-Loop: FreeBSD.org Precedence: bulk More information that you probably want; on the other hand, this is the first time I've laid out explicit examples of failure cases in anything but the broadest terms (I've always felt they should be blatantly obvious to anyone who has looked at the problem enough that they were willing to post comments; unfortunately, not enough people have that much self control). > Exactly *how* dangerous is setting a file system to async? Depends on how much you will miss the data if it disappears. For passive news servers, not very. For those which you post users articles to, when you rebuild the FS, you will get the articles which have propagated already, but your users articles for the propagation delay period (times a 1.5 probability vector) could be lost. So Bob posts something, your server crashes before it is propagated, and Bob's article is now lost. If you provide posting services, this denial of service may in fact annoy some people. It would annoy me, if it happened. For a regular FS, it is nearly too dangerous to contemplate, unless you can guarantee that the only time your system will go down is when you shut it down on purpose (ie: you have a UPS, adequate environmental controls for the room the machine lives in, etc.). Effectively, then, you are engaging in a risk/reward analysis. > In my case, I'm braving it on a news server, just the ccd > device that contains both the news admin and spool directories. The > drive is local to the system. > > My understanding of asynchronous I/O is that it doesn't wait > for an acknowledgement from the system before going to the next > write (only, its a very basic understanding?), so I'm curiuos as how > it would handle writting the history file itself? The write returns before it is committed to stable storage. If you have a failure in this situation, then you will lose some (random) number of writes. Because UFS/FFS/EXT2FS are not logging or journalling (note: I did not say "not log structured"; structuring is irrelevent for this discussion), this means that you have: o The state that you wanted the FS in o The state the FS is actually in o An unknown number of intermediate states between the two File system repair utilities (like fsck) take an FS from: o The state the FS is actually in To: o A state wherein the FS is internally consistent Because of Godel's Theorem, you can't guarantee that: o A state wherein the FS is internally consistent Is equal to: o The state that you wanted the FS in If you have async on. Because once the number of unknown intermediate states exceeds 2, the intended target state of an inconsistent state is no longer determinite. There are multiple potential paths to consistency, (2^(number of pending async operations - 1), in fact), but only one of them is correct. If you had 11 writes outstanding at the time of the crash, you would have only a 1 out of 1024 chance of doing the write thing... less than 1/10th of 1% change of getting the FS corrected to where it's supposed to be instead of only geting the FS cortrected to the point where it's simply "usable". Consider the example of a database engaged in a two state commit. The procedure it will follow to change a record is: --- BEGIN TRANSACTION --- 1. Read the index record for the data record to change into memory 2. Read the data record to change into memory 3. Modify the data record in memory 4. Obtain exclusive access to an unused data record 5. Write the modified data record from memory into the unused data record on disk --- COMMIT STAGE 1 COMPLETED --- 6. Modify the index record in memory 7. Write the modified index record to disk --- COMMIT STAGE 2 COMPLETED --- --- END TRANSACTION --- (A logging or journalling FS would write transaction records and keep the writes pending; it is a much simpler implementation, since no disk data would change until the transaction was ended, and the writes would be guaranteed to be ordered). Now understand that the async writes do not guarantee ordering. That is their nature. This means that it is possible that the index record could be written before the data record. If that happens, and the machine crashes, you have a valid (but incorrect) data record (whatever was there before) pointed to by a valid index record, and a valid (but orphaned) data record not pointed to by anyone. Questions: o How do I back the index record transaction out so that the index record points to the orphaned record? o How do I *know* that it wasn't the data record getting written before the index record, and that I shouldn't be advincing the index record instead of backing it up, to run the trasaction to completion? o Will they ever find poor Aunt Nell, lying at the edge of town in a ditch? ...tune in tomorrow... The problem, of course, is that there is *implied* state generated by the program. Short of implementing a transaction tracking system (like NetWare TTS, or, to be more UNIX-like, USL Tuxedo), you can't arbitrarily wind and unwind state. The INN history file would probably need to be rebuilt after each crash (if that were even possible) to ensure that it was in a good state. > The only risk that I can see is that if the system crashes, > I'll have (might have) a corrupted file system, but is that my only > risk? ie. if I turned on async for a period of time (long enough for > a backlog from one system to catch up?) and then turned it back off > again, would that be reasonably safe? It wouldn't matter; it would reduce the duration of your exposure to the time it was on, but it wouldn't eliminate the exposure entirely, or reduce it overall. The problem still boils down to whether you can deterministically put the FS in the state that it would have been had the system not crashed. Effectively, this means ordering the writes. There are a number of ways to do this: o Order them explicitly by not doing async writes. This is what the non-async FFS/EXT2FS implements; it is the most simple method, easiest to implement, and reduces the concurrency the most. It is the worst way to implement a good thing. o Order them explicitly by doing async writes to stable cache. This requires specialized hardware and is not a generic soloution. Sun's PrestoServe hardware does this; so do Auspex NFS servers. You would be hard put to find PC controllers even capable of this, let alone drivers that could handle it. The BIO subsystem would require significant changes to be able to put stable commit incursions up to the point an FS could choose between a stable commit to cache, a stable commit to disk, and an unstable commit to disk (a necessary optimization for non-metadata data originating locally instead of as the result of a network transaction). This is an OK, if very expensive, way to implement a good thing. o Order them implicitly by time and idempotence. This is what Delayed Ordered Writes (DOW -- a USL patented technology) does. Basically, I can do unordered writes as much as I want for "unimportant" things, but when I need to do writes involving FS structure, then I commit all outstanding I/O's (drain them to the disk). In SVR4.2 ES/MP (UnixWare 2.x), this results in a 60% increase in throughput to the disk, even in the Uniprocessor case. This is a patented, and therefore unusable in a public project, way of implementing a good thing. o Order them implicitly by graph relation; this may or may not include time relationships, if they happen to be members of the graph hierarchy. This is what Soft Updates (from the Ganger/Patt paper) does. Basically, I can do writes in any order, and when I have a write that depends on another write being committed before it is, I insert a queue synchronization so that associative and commutative transformation on the graph node relationships won't cross the synchronization boundries. This results in speeds approaching 5% of memory speeds (until cache limits impose disk latency slowdowns). In fact, this method tends to result in faster code than simple async writes because of increased locality seperating metadata and non-metadata operations. It is the best way (so far) to implement a good thing. Obviously, I would use soft updates if I could. But in no case would I run for a long time with async; if you want more speed, you should be willing to pay for it in code, and not kludge things in such a way that you could potentially lose data. It is the difference between doing the right thing the right way and the right thing the wrong way. > BTW...what exactly does fastfs do? My understanding is that > this is basically what it does, turns the file system to be async vs > sync... Yes; this is exactly what it does; there is a minor difference in the gathering of directory operations (there is an #ifdef'ed DEBUG sysctl() variable for directory ordering which is also affected; I really don't know why the directory ordering was ommitted in the standard sync case; I believe it is an error in the UFS implementation... Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.