Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 1 Nov 1999 21:10:03 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        grog@lemis.com (Greg Lehey)
Cc:        don@calis.blacksun.org, bright@wintelcom.net, freebsd-fs@FreeBSD.ORG
Subject:   Re: Journaling
Message-ID:  <199911012110.OAA03339@usr02.primenet.com>
In-Reply-To: <19991027095431.45462@mojave.worldwide.lemis.com> from "Greg Lehey" at Oct 27, 99 09:54:31 am

next in thread | previous in thread | raw e-mail | index | archive | help
> > Kirk McKusick has been working for the last year or so on
> > a combination of "soft-updates" (complete) and "snapshots"
> > (not released yet), once complete FFS will have the equivelant
> > of logging AND snapshots like the netapp appliance.
>
> I am familiar with softupdates but not with snapshots.

Snapshots are where you put a peg in the soft updates clock
and export the state as of the peg.  This lets you have a
consistant copy of the filesystem state, guaranteed, which will
not mutate out from under you while you are, for example, doing
a backup of the system.

This is a far cry from journalling, which, unless you do an
LRU on your journal allocations, doesn't have the capability
for "snapshots" (which would be "all journal entries prior to
the time of the snapshot").


> The reason for starting a new project was basically to once
> and for all get rid of UFS.

I assume you mean the on-disk structure.  Having been in the
bowels of VXFS (Veritas) in SVR4.2, I can guarantee you that
the on-disk directory structure is derived from the SVR4 UFS
implementation, and that the only real changes are to the
way inodes and inode data is stored.


> While there is nothing wrong with UFS it does have some limitations which
> I would like to eliminate such as a limit of 7 slices.

This is a limit of the disklabel partitioning scheme; you might
as well say you want to address the 4 partition limit in the FAT
FS, since it bears the same relation.

The big things that journalling buys you over soft updates or
logging are:

1)	The ability to come back up at the last valid journalled
	state, without checking the FS.  Like soft updates and
	LFS, this only works if you can tell the difference
	between a panic and a power failure; otherwise, you
	still need a full fsck.  If you know this, then it
	saves you the background cleanup of the cylinder group
	bitmaps that soft updates requires, and the background
	"cleanerd" that LFS requires.

2)	The ability to roll things forward following a crash,
	in as much as you know them to be true.  This saves
	you in the case of implied state between user files,
	without a synchronous commit process in effect (e.g.
	an index file for a record file).

> I would also like to add functionality such as the ability to
> grow and shrink partitions etc.

You can actually grow partitions with FFS.  Der Mouse has written
a program to extend FFS size, and it is publically available for
download.

The problem that arises is that the relative fragmentation rate
for the old and new zones are not constant.  If you think of the
block allocation process as a hashing process, you effectively
hash the blocks onto the disk.

The original reason for a large free reserve was based on Knuth's
seminumerical algorithms: sorting and searching, which states that
a hash fill in excess of 85% is the point of diminishing returns
for a perfect hash.  This actually means that the correct free
reserve for a hard disk, for optimal performance, is 15%, which
is almost twice the 8% set by MINFREE in fs.h (whose comments
are wrong now, as well)

So effectively, someone needs to write a defragger.  This is
actually quite trivial to do, it's just a lot of grunt work,
and the danger of a bug is rather amplified, so a lot of rigor
would be needed, as well.

The case of shrinking the available space is trivial, given a
defragger, since you can easily define a "no fly zone" for the
defragmentation process to get the data moved out of the
region that you are going to take away.


In any event, this is unrelated to the idea of journalling.


> Softupdates is also not recommended for use on the root partition and

This is actually a chicken-and-egg problem with setting the bit,
not really an issue of "not being recommended for the root fs";
it's a bit hard to tunefs /.  It's likely that the integration
of character and block devices will make it impossible, without
a seperate boot, since you will no longer be able to "cheat".


> it still seems to be just a little flaky. Every once in a while I wind up
> with a problem which I have traced to softupdates but which I could 
> not recreate. (To be fair I have not had a problem in a month or two now)

I think these are more VM issues, than anything else; when things
change, they tend to break where they are most fragile.  The order
guarantees in soft updates must be rigidly enforced by the systems
on which it depends.

If you are not running a UPS, and you are using soft updates, you
should make sure to turn off write-caching on your disk drive,
since it doesn't do cache flush ("committed to stable storage")
notification, and the cache flush operation, if exported by the
drive, is not integrated, so soft updates can neither force a
flush at a synchronization point, nor can it intentioanlly stall
writes over a synchronization point, pending flush notification.

In any case, so long as you use it correctly, you should not be
experienceing any problems, and I'm sure many of us would be very
interested in knowing about any problems you see (Julian and Kirk,
especially).


Again, soft updates is a contention resoloution technology that
is used to guaranteed ordering of metadata writes.  I believe
that there are good technical arguments why you might want to
use soft updates technology, even if you had journalled metadata,
to allow dependency ordered log data to be logged on a clock
tick rather than on a synchronization point, and to ensure that
the journalling process itself does not become a bottleneck.

That said, without a distributed cache coherency protocol, you
would potentially have to give up some goals, such as multiple
machine access to the same filesystem over a shared SCSI bus,
like XFS, for example.



> > In so far as codebase there is the LFS project, currently 
> > fixed (afaik) in NetBSD, perhaps porting that to FreeBSD
> > would be worthwhile.
>
> This is indeed going to be the starting point for this project but
> I hope I would be able to take it far beyond this.

Logging and journalling are very different animals, even if some
of the tricks that both do are conceptually similar.

I would actually _disourage_ using the LFS as a starting point
for a JFS, since I believe that it would limit you options in a
number of subtle, but important ways.

Also note that XFS is log structured (they have posted their
logging code under GPL, up at SGI, as a "teaser" while they
"clean" the remainder of their code of encumberances, presumably
USL).

Actually, AIX has a device driver writer's supplementary guide,
which comes with source code for an MFS for AIX, and goes into
great detail about th AIX GFS (think file system switch) abstraction,
and into some detail on the AIX JFS, as well.  I was able to, for
example, reverse engineer the entry points for the file locking
code, which was not externalized in AIX 4, in support of a shared
file descriptor pool that could be used by multiple processes -- a
poor man's "rfork".  You have to order this book seperately, since
it doesn't come with the full documentation set.


You might also want to look at the NTFS implementation, as it
is described in the thin (about 1/4 inch thick) Helen Custer
book.

I believe that kernel changes, and in particular, changes to the
way VOP_ABORT has to be called and implemented for journalling,
will be necessary.  It may be easier for you to make these changes
with a partially working example, by making the existing NTFS code
read/write instead of read-only.  Don't despair: Linux is going to
require much more extensive VFS changes to support journalling
than FreeBSD, so you are ahead of the game, even though the Linux
JFS project is already under way.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199911012110.OAA03339>