Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 25 May 2002 08:30:04 -0700 (PDT)
From:      Salvo Bartolotta <bartequi@neomedia.it>
To:        freebsd-doc@FreeBSD.org
Subject:   Re: docs/30008: This document should be translated, commented and added
Message-ID:  <200205251530.g4PFU4b99187@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR docs/30008; it has been noted by GNATS.

From: Salvo Bartolotta <bartequi@neomedia.it>
To: freebsd-gnats-submit@FreeBSD.org, 3d@FreeBSD.org
Cc:  
Subject: Re: docs/30008: This document should be translated, commented and added
Date: Sat, 25 May 2002 17:29:10 +0200 (CEST)

 This message is in MIME format.
 
 ---MOQ1022340550bbe564e284cb9c4c0461b687f576a955
 Content-Type: text/plain; charset=ISO-8859-1
 Content-Transfer-Encoding: 8bit
 
 Dear FreeBSD doc'ers,
 
 I've translated the central part (i.e. part III) of the document.  This draft, 
 which I submit for your review/comments/flames/whatever, will (hopefully) give 
 you the gist of Pornin's article.
 
 Although I have benefited from a number of effective suggestions from Giorgos 
 (very kind and helpful, as always), neverthelss I am fully to blame for 
 anything wrong/queer/inconsistent.  Shame on me (if any :-)
 ---MOQ1022340550bbe564e284cb9c4c0461b687f576a955
 Content-Type: text/html; name="x47.html"; charset=ISO-8859-1
 Content-Transfer-Encoding: 8bit
 Content-Disposition: inline; filename="x47.html"
 
 
 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 <html>
   <head>
     <meta name="generator" content="HTML Tidy, see www.w3.org">
     <title>Advanced Fault Tolerant Methods</title>
     <meta name="GENERATOR" content=
     "Modular DocBook HTML Stylesheet Version 1.71 ">
     <link rel="HOME" title=
     "Softupdates and Journaling Filesystems" href=
     "index.html">
     <link rel="PREVIOUS" title=
     "Write Caching and reboot" href="x30.html">
     <link rel="NEXT" title="Different Questions" href="x95.html">
   </head>
 
   <body class="SECT1" bgcolor="#FFFFFF" text="#000000" link=
   "#0000FF" vlink="#840084" alink="#0000FF">
     <div class="NAVHEADER">
       <table summary="Header navigation table" width="100%" border=
       "0" cellpadding="0" cellspacing="0">
         <tr>
           <th colspan="3" align="center">Softupdates and Journaling filesystems</th>
         </tr>
 
         <tr>
           <td width="10%" align="left" valign="bottom"><a href=
           "x30.html" accesskey="P">Previous</a></td>
 
           <td width="80%" align="center" valign="bottom">
           </td>
 
           <td width="10%" align="right" valign="bottom"><a href=
           "x95.html" accesskey="N">Next</a></td>
         </tr>
       </table>
       <hr align="LEFT" width="100%">
     </div>
 
     <div class="SECT1">
       <h1 class="SECT1"><a name="AEN47">3. Advanced Fault Tolerant Methods</a></h1>
 
       <p>Let us specify, incidentally, what the mechanics can ensure:
 	each write of a sector (512 bytes) is atomic, i.e. once it has 
 	been started, it is completed even though the power goes down,
 	the kernel crashes and the processor catches fire.</p>
 
       <div class="SECT2">
         <h2 class="SECT2"><a name="AEN50">3.1. Deferred ordered
         write</a></h2>
 
         <p>First of all, people proposed the "deferred ordered write": 
 	metadata updates are asynchronous, but they are performed in
 	the [proper/correct] order.  That is, the system quickly returns
 	execution to applications, saying "ok, everything is all right,
 	writes have been carried out", but it performs writes in the 
 	background at disk speed, paying attention to order; e.g. the 
 	creation of numerous files in the same directory actually
 	involves numerous updates of the same disk portion, and the 
 	system can group them together and carry them out in one single
 	access.  Yet "medatada updates" are ordered, that is, there 
 	are dependencies between various updates: when a file is created
 	in a directory A and then another file is created in the same
 	directory, this second operation needs to take place at the same
 	time, or after the first one -- certainly not before.</p>
 
         <p>Deferred ordered writes pose the following problem: it is
 	easy to create cyclic dependencies, which block the system or   
 	else require a non-atomic update, and so a crash at the "wrong"
 	moment puts us in a delicate position.  This is rare, but Murphy 
 	arranges for it to happen.  Typical example: I move a file from
 	directory A to directory B, and, almost simultaneously, I 
 	move a file from directory B to directory A.</p>
       </div>
 
       <div class="SECT2">
         <h2 class="SECT2"><a name="AEN54">3.2. Softupdates</a></h2>
 
         <p>To pull off the coup, people developed "softupdates".
 	This is derived from a paper by Ganger and Patt (from the 
 	University of Michigan).  The *BSD implementation comes from
 	a certain Mr. McKusick (a key player in the original BSD 
 	project).  As far as I have understood, it would have been 
 	sponsored by Sun (which is interested in its inclusion in 
 	Solaris), and an agreement would have been made: when the code 
 	has been debugged, it will pass to the BSD license; which is 
 	not yet the case at present.  From the moment the change in 
 	license has taken place, FreeBSD will include the code in its
 	kernel by default; currently, it is necessary to recompile the 
 	kernel in order to get softupdates.  I don't know what NetBSD 
 	and OpenBSD will do. Probably the same.</p>
 
         <p>The principle of softupdates consists in maintaining a twofold
 	wait file; updates arrive in a wait buffer first, and then    
 	they pass, one by one, to a second buffer, where dependencies
 	are checked.  If an update completes a dependency loop, it is
 	sent back to the wait buffer, better times will come; the rest
 	of the cycle passes to a list with higher update priority.
 	This algorithm is similar to what CVS does in order to merge
 	various modifications of the same file.</p>
 
         <p>In fact, softupdates entails this:</p>
 
         <ul>
           <li>
             <p>Good filesystem performance, even in the decompression
 	    of numerous small files.  My benchmarks show that decompressing 
 	    the sources of an egcs takes 10% more time than ext2, on 
 	    the same machine and on the same portion of the disk -- your
 	    mileage may vary, as the 'Mericans say, with your
 	    hardware.</p>
           </li>
 
           <li>
             <p>Excellent crash tolerance. I would even say that fsck
 	    is warranted to recover all by itself, unless the crash
 	    is due to the disk itself (in which case whatever it does
 	    before stopping is immaterial; however, no filesystems
 	    can tolerate that).</p>
           </li>
 
           <li>
             <p>Specifically, the FFS implementation ensures upward 
 	    compatibility: the filesystem is unmounted, then it is 
 	    remounted without softupdates, and this works perfectly. 
 	    That's a painless upgrade.</p>
           </li>
 
           <li>
             <p>Fsck takes a long time, since it has to traverse the entire
 	     filesystem.</p>
           </li>
 
           <li>
             <p>When a file is deleted, its place is not immediately 
 	     freed for reuse, but this can take as much as 30 seconds.
 	     This is because the wait buffer is untidy; therefore the
 	     system does not traverse the information contained therein
 	     when seeking free blocks for its files; when a file is
 	     deleted, its blocks can thus be reallocated only when  
 	     the [related] update reaches the second-level buffer.
 	     In practice, it is not a big deal, but you might run into
 	     trouble when you do a "make world" (recompilation and 
 	     reinstallation of the base system, on every good BSD 
 	     system: since the whole system is reinstalled in a short
 	     time, the binaries in /bin, in particular, are deleted, 
 	     and new ones are immediately placed there again, which 
 	     produces "frictional occupation" [Cf. "frictional
 	     unemployment": here English paralles French.  N.o.T]. If
 	     saturation point is reached, it means trouble. I myself
 	     have run into this case, it is not fiction.  This is 
 	     typical of systems with a small / partition, since it is 
 	     separate from /var, /usr, and /tmp.</p>
           </li>
         </ul>
 
         <p>Let us note that in the case of fsck it is theoretically 
 	possible to accelerate recovery significantly.  Essentially,
  	it would be a matter of performing updates in such a way that,
 	in case of crash, the only inconvenience consisted in missing 
 	blocks, that is, blocks that had not come back to the free 
 	blocks spool yet, albeit not referenced elsewhere in the 
 	filesystem.  In this case, the filesystem could be reutilized
 	immediately, and fsck could be run in the background.  Here is 
 	what would fulfil point 3.  This possibility has been suggested, 
 	I do not know whether it will be carried out, but it clearly should
 	[The feature is already implemented in FreeBSD 5-CURRENT.  N.o.T.].</p>
 
         <p>As a whole, softupdates is a fine mechanism, elegant and 
 	effective. Cf. <a href=
         "http://www.ece.cmu.edu/~ganger/papers/CSE-TR-254-95/"
         target=
         "_top">http://www.ece.cmu.edu/~ganger/papers/CSE-TR-254-95/</a></p>;
       </div>
 
       <div class="SECT2">
         <h2 class="SECT2"><a name="AEN73">3.3. Log-structured
         filesystems</a></h2>
 
         <p>There are also "log-structured filesystems".  The idea is 
 	simple: all writes (data et metadata) are done in an
 	uninterrupted flow of operations.  Effort is shifted onto  
 	reading, since finding a piece of data may be rather complicated
 	in such a scheme.  Actually, it is necessary to "garbage-collect"
 	the flow of operations (the log) retrospectively in order to 
 	find the requisite information.  There exist some more or less
 	prototypal implementations for BSD and Linux, named LFS
         (cf <a href=
         "http://collective.cpoint.net/prof/lfs/" target=
         "_top">http://collective.cpoint.net/prof/lfs/</a>; for Linux). 
 
         In a filesystem, reads are usually more frequent than writes.
 	This is not the case for what lives in /var/log, where LFS
 	can be practical.  Nevertheless, the use of LFS is marginal.</p>
       </div>
 
       <div class="SECT2">
         <h2 class="SECT2"><a name="AEN77">3.4.
         Journaling</a></h2>
 
         <p>Finally, there is journaling.  Journaling is, as it were,
 	"transactional": when the system wants to make a series of
 	updates, it builds a new version of the related metadata in a
 	different place in the filesystem; then, when this new version
 	(called "transaction", a concept connected with databases) is 
 	ready, it switches to the new version in one "atomic" 
 	operation [atomic relates to "atomos", a Greek word meaning 
 	"indivisible".  Here it indicates that the system switches to
 	the new version (when it is ready) in one single operation, 
 	therefore preventing any possible data corruption or "intermediate" 
 	states.  N.o.T.].  Thus the filesystem is always in a consistent
 	state.</p>
 
         <p>To be more precise [warning: several technical details follow. 
 	N.o.T.]: when metadata updates need to be performed, the new 
 	version is built in a particular region of the disk, namely the
 	journal.</p>  
 
 	<p>Incidentally, in ext3, the journal is a file like 
 	any other, referenced by a special superblock field.  The 
 	final version of ext3 will automatically create the journal
 	if it is not present, and will not show it up in the filesystem;
 	which will avoid its accidental deletion.</p>
 
 	<p>The preparation of the new version entails the inclusion of
 	all the requisite items; in particular, if there are any circular
 	dependencies, the whole cycle is within.  Once the new version
 	is ready, a commit operation is performed: the "good" sector 
 	is modified so as to point to the new version instead of the 
 	old one.  Next, the new version is copied over the old one,
 	and a second commit is performed to free its place in the 
 	journal.</p>
 
 	<p>As a side note, you could simply consider marking the journal
 	modified, but this would fragment it too much; since it is 
 	always used, this is not desirable at all.</p>
 
         <p>In case of accidental crash, recovery is necessary, 
 	which consists in traversing the journal in order to:</p>
 
         <ul>
           <li>
             <p>discard transactions not yet finished.</p>
           </li>
 
           <li>
             <p>finish copying transactions for which the first commit, 
             but not the second, has been performed.</p>
           </li>
         </ul>
 
         <p>Since the journal is typically 100 times smaller than the 
 	filesystem, recovery is very fast (it's the difference between
 	30 seconds and an hour).</p>
 
         <p>You will notice that, at the end of the process, each piece
 	of metadata is written twice (and read once, but memory buffers 
 	are nevertheless useful in this instance).  In the case of 
 	metadata, that is not a serious issue, since metadata is small,
 	so it is the time to move the disk heads [i.e. seek latency] 
 	that is important.  Since everything works asynchronously 
 	between two commits, the kernel optimizes this sort of things
 	very well. That's why a journaling filesystem is (nearly) as 
 	fast as an FFS with softupdates (in fact, it can be shown 
 	that softupdates remains faster so long as the system has a 
 	good amount of memory, but the reverse applies when it swaps
 	heavily), the difference in speed being very small, smaller
 	than that between ext2 and ffs/softupdates.</p>
 
         <p>On the other hand, ext3 in its present form (0.0.2d) is 
 	also a journaling filesystem.  In this instance, the problem
 	of double writes is noticeable [ext3 journalizes both data and
 	metadata.  N.o.T.], and actually its solution means 
 	reducing by half the time to write a file.  This problem
 	will be solved in a later version (0.0.4 in theory -- in fact,
 	the code already exists, but has not been sufficiently tested 
 	to be activated with reasonable safety).  There are various 
 	safety issues to be taken into account.  In rejecting a 
 	transaction not yet committed, problems may arise if the 
 	blocks have already begun to fill with data from another file.
 	It is rather difficult to recover pieces of a priviledged file
 	within another.  Stephen Tweedie (the developer of ext3) says 
 	that he has thought about this, and that the necessary framework 
 	has already been put in place.</p>
 
         <p>There are other journaling filesystems, apart from ext3.
         Linux has ReiserFS, whose latest version includes a journaling
 	layer handling only metadata.  Reiser, its author, has been 
 	heard ranting about a new form of super-journaling which cleans
 	all this up.  This super-journaling will be present in the next
 	version of ReseirFS.  Apart from those, at least another two 
 	operating systems have had journaling filesystems in their 
 	"production" versions for a certain time: Tru64 (the former OSF, 
 	Digital/Compaq's Unix for Alpha) has advfs, and it works rather
 	well, and Windows NT has ntfs.  The latter has been present for at
 	least five years and is really robust [fortunately, since
 	NT has a tendency to crash often - N.o.A.].  Furthermore, SGI
 	is porting its journaling filesystem (XFS) to Linux, and it is
 	beginning to distribute the code under GPL; IBM is also one of
 	the party, with its JFS (which comes from AIX).</p>
 
 	<p>Journaling makes it possible to attain points 1 to 4.
 	On the other hand, ext3 remains compatible with ext2:  an ext3
 	filesystem can be unmounted and then remounted as ext2; which
 	works seamlessly.  In my opinion, ext3 will be superior to 
 	softupdates when pure metadata journaling has been implemented, 
 	unless "background fsck" has been set up for softupdates [it 
 	actually is, under FreeBSD 5.0 -CURRENT.  N.o.T.].  There 
 	might be other factors that can make a difference, though.  
 	For example, ext2/3 is simpler and requires less CPU and code
 	in order to run; but ffs has a better directory structure 
 	(binary tree instead of a linear list), which speeds up write
 	access to directories containing a large number of files (e.g.
 	a traditional news spool).  Even in this instance, the OS plays
 	an important role, Linux having a tendency to smooth over 
 	certain difficulties thanks to dcache [Linux's VFS layer 
 	maintains a cache of currently active and recently used names. 
 	This cache is referred to as the dcache.  N.o.T.].</p>
 
         <p>Journaling is also an elegant means of not losing one's 
 	metadata.  I very much love the transactional features. And, 
 	on the other hand, I have been using ext3 for all my partitions 
 	(except /tmp) for several months, without any problems.  In 
 	this case, too, <i class="EMPHASIS">your mileage may vary</i>.</p>
       </div>
     </div>
 
     <div class="NAVFOOTER">
       <hr align="LEFT" width="100%">
 
       <table summary="Footer navigation table" width="100%" border=
       "0" cellpadding="0" cellspacing="0">
         <tr>
           <td width="33%" align="left" valign="top"><a href=
           "x30.html" accesskey="P">Previous</a></td>
 
           <td width="34%" align="center" valign="top"><a href=
           "index.html" accesskey="H">Summary</a></td>
 
           <td width="33%" align="right" valign="top"><a href=
           "x95.html" accesskey="N">Next</a></td>
         </tr>
 
         <tr>
           <td width="33%" align="left" valign="top">Write Caching 
           and reboot</td>
 
           <td width="34%" align="center" valign="top">&nbsp;</td>
 
           <td width="33%" align="right" valign="top">Other Questions</td>
         </tr>
       </table>
     </div>
   </body>
 </html>
 
 
 ---MOQ1022340550bbe564e284cb9c4c0461b687f576a955--

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-doc" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200205251530.g4PFU4b99187>