Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 14 Dec 2001 12:58:58 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Carl Schmidt <carl@slackerbsd.org>
Cc:        Brett Glass <brett@lariat.org>, Hiten Pandya <hitmaster2k@yahoo.com>, Brad Knowles <brad.knowles@skynet.be>, chat@FreeBSD.ORG, phk@FreeBSD.ORG, grog@FreeBSD.ORG
Subject:   Re: IBM suing (was: RMS Suing was [SUGGESTION] - JFS for FreeBSD)
Message-ID:  <3C1A6812.A0EFCB73@mindspring.com>
References:  <a05101013b83fd20c4206@[10.0.1.22]> <4.3.2.7.2.20011214123703.02ad7290@localhost> <20011214194909.GA2943@Carbon.SlackerBSD.ORG>

next in thread | previous in thread | raw e-mail | index | archive | help
carl@slackerbsd.org wrote:
> This has been beat to death over and over but people still do not understand
> that softupdates will not minimize data loss. It guarantees metadata to be
> written, not `normal' data. The quick intro to softupdates on Kirk McKusick's
> site clearly states this fact: http://www.mckusick.com/softdep/. Softupdates
> will delay writing of data which is why you get the speed increase but if you
> pull the plug on the machine in the middle of something trying to write to the
> disk you may lose the data it was trying to write. It is very simple to prove
> by doing it. Try doing something like extracting a tarball and powering the
> machine off in the middle of it then see what fsck says about the unclaimed
> blocks and whatnot.

FWIW: By default, JFS operates by journalling metadata updates,
but not data updates (it can operate in one of three modes).  See
the article on http://www.ibm.com/developerworks/ for details.


The major value in JFS is that it exports a transactiong interface
to user space.  Soft Updates could have done this (by implying an
edge to a synthetic dependency) but didn't.  This was one of my
original complaints with the soft updates implementation in FreeBSD,
since, as well as not exporting such an interface to user space, it
did not export such an interface at the VFS boundary layer, which
means that it can't span stacking modules, even if both of them
support soft updates, without introducing a serialization barrier.

The point in a transactioning interface to the applicaiton is that
you can know whether a given transaction has been committed to
stable storage, or not, and delay your response to one of many
clients until it has been committed.  In UNIX systems without such
an interface, you usually see a lot of "fsync" or "sync" operations.


JFS also fails to solve the "chicken and egg" problem of recovery
following a failure (even if journalling of user data is enabled,
rather than the default of just metadata).  Soft updates has this
problem, too.

The problem is that you want to recover from a failure to a known
good state.  But you can't always tell the reason for the failure.

If the reason is a hardware or controller error, rather than, for
example, a power failure, then you need to perform a full fsck to
recover.  But how do you tell a power failure from some other data
corruption related failure (e.g. a panic from a wild pointer that
cause pending journal data to be written corrupted, or an unrecoverable
meadia error of some kind).

Most high end hardware handles this by logging a failure code to
NVRAM, which it can then use to know whether recovery will require
a full check, or not (the default value at startup is "full check
required", so if it fails catastrophically, a full check is done).

For power failure, this requires specific power supply capabilities
to handle; it requires AC fail notification, with sufficient DC
holdup to write the failure cause out, before hard stop.  This is
usually done via Lithium Ion batter backed RAM, since CMOS takes
a lot of power and is slow to write... but I've seen CMOS used, as
well.

There is a semi-useful workaround, but it requires that the system
is relatively quiescent, so that at the time of failure, it can be
in a recoverable state.

The way it works is called "soft read-only", and it's implemented
by flushing all data out, and marking the FS clean, and setting a
"soft read-only" flag on the in-core superblock.  Then if you want
to write the disk after it is in this state, it has to first mark
the FS dirty, and after that is committed to stable storage, clears
the soft read-only bit, and allows the write operation to continue.

This is very trivial to implement; I'm very surprised that FreeBSD
doesn't have it already.


In any case, this doesn't help with servers where writes are common,
since they are, by the intrinsic nature of servers, rarely quiescent;
if, on the other hand, writes are rare (e.g. a web server serving
mostly static content), then it sould be quite useful.

"Soft read-only" avoids the problem, since you only have to know the
failure cause in the case that you have a dirty FS; a clean FS will
not have bad data on it, so you are safe to start without a fsck.


It should be noted that the Soft Updates implementation, and metadata
only journalling share the implementation detail that, following a
soft recoverable crash (e.g. a power failure or non-FS, VM, or paging
path code related panic), they can clean in the background, since the
"uncleanliness" will be detectable overallocations (in soft updates,
the cylinder group bitmaps will have "allocated" bits falsely set,
which can be cleaned by locking access on a per cylinder group basis,
and running in the background, with little or no system impact,
depending on access locality while the cleaner is touching a particular
cylinder group).


Really, journalling and soft updates should be considered complementary
technologies (e.g. soft updates prevents disk accesses, which, if your
system is IDE based, will otherwise have to occur serially, and thus
slow down accesses; this is not usually a problem with JFS, since "big
iron" generally runs SCSI disks, anyway).  But they both fail to deal
adequately with unexpected hard-recovery requiring failures.

As a final note, intention logging is antithetical to Soft Updates,
vbut is required if you want to be able to roll interrupted transactions
forward on recovery.  The reason it doesn't mix very well is that it
requires writing the intention to stable storage, and then making it
"active" at the end of the complete transaction.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C1A6812.A0EFCB73>