Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Aug 1999 02:17:04 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        abial@webgiro.com (Andrzej Bialecki)
Cc:        dillon@apollo.backplane.com, freebsd-hackers@FreeBSD.ORG
Subject:   Re: Possibility of increasing default MAXPARTITIONS from 8 to 16
Message-ID:  <199908250217.TAA17644@usr01.primenet.com>
In-Reply-To: <Pine.BSF.4.05.9908250218220.61896-100000@freja.webgiro.com> from "Andrzej Bialecki" at Aug 25, 99 02:23:14 am

next in thread | previous in thread | raw e-mail | index | archive | help
> > :I know it's not the answer, it's just related question: do you know
> > :perhaps of any initiatives (except XFS) that could significantly shorten
> > :time it takes fsck to check big filesystems, let's say 64GB? As it is now,
> > :it's almost unbearable. I naively thought softupdates would (almost)
> > :eliminate the need to do fsck...
> > :
> > :Andrzej Bialecki
> > 
> > Eventually Kirk is planning for softupdates to allow you to run a special 
> > version of fsck in the background to clean up the block bitmap on a live 
> > filesystem.  The time frame for this project is not known.
> >
> > Another possibility would be to mark individual cylinders clean/dirty
> > to reduce the amount of work fsck must do on a normal filesystem.  It 
> > would be a pretty hefty project for someone to take on, though.
> 
> Hmm.. If I understand you correctly:
> 
> * the ffs code would have to be modified to mark cylinder groups "dirty"
> when there are writes to that CG.
> 
> * on unmount, after the buffers are flushed they would be marked
> clean.
> 
> * on mount all "clean" flags in CGs would have to be ckecked (instead of
> the single bit)
> 
> * fsck would have to be modified to recognize CG "clean" flag and prune
> only those CGs.
> 
> Overall, doesn't sound _that_ complicated... but most probably I'm missing
> something.


When a system crashes, the dirty bit is set.

Because the dirty bit is set, you can't trust the FS contents
to be able to distinguish between a crash that was the result
of a software failure, and one that was the result of a hardware
failure.

Because of this, you must assume a hardware failure, and engage
in a full check.


In the case of a software failure, the cylinder group bitmaps
may, in fact, have bits indicating that things which are not
truly allocated have in fact been allocated.

The process of traversing these (locking each CG as you do so)
to clear the bits on things that were never truly allocated is
the "fsck in the background" operation which is permissible
following a software failure which leaves the dirty bit set for
the FS.



There are two rational methods for getting around this problem;
the first was suggested by Ganger and Patt, Matt Day, Mark
Muhlestein, myself, and others: "soft read-only".

A "soft read-only" implementation was done (by Kirk) for BSDI.
The basic idea is to mark the in core superblock read-only
after there are no dirty buffer left associated with an FS,
and then mark the on-disk structure clean.  When a write (or
a read, since you must obey POSIX atime semantics) occurs,
you must mark the FS dirty and _be certain this write has
been commited to disk_, before clearing the "soft read-only"
flag and allowing the dirtying operation to complete.

An implementation of this is pretty trivial on a normal system,
and Matt, Mark, and myself implemented such a beast for our
Windows 95 port of the Heidemann framework and the BSD FFS
(and the Ganger/Patt Soft updates code).

This gives you a sort of "statistical protection", which is
most useful for a single user desktop box (e.g. Windows 95),
where the box's disks are frequently idle for large stretches
of time, and therefore in the state "clean, soft-read-only".

For FreeBSD, the problem is complicated by the FS metadata's
dirty buffers being hung off the device vnode, rather than
being truly seperate data.  This means that you must sync
out that data, as well, before you can mark the FS clean
(and you must resync out similar data to besure the dirty bit
has been correctly set, before proceeding with other writes).
For the Windows 95 port of the code, there was no unified VM
and buffer cache to have to worry about in this regard.



Apart from "soft read-only", you can obtain, at the cylinder
group level, seperate "clean bits" on a per cylinder group
basis.

For this to work in the face of a true hardware failure, you
must engage in a two stage commit process, in which you mark
the entire FS dirty, modify the state of the cylinder group
clean bit, and then mark the FS clean.

This works in the face of software failures for cylinder
group operations.  To make it robust in the face of hardware
failures, you must have a seperate "dirty-but-ok" bit for
the cylinder group, which is similarly protected, and which
is reset (after resetting the FS dirty bit, after resetting
the CG dirty bit) during updates to non-CG bitmap data.
Failure to support this leaves you unable to verify the state
of the non-bitmap data in the CG bitmap, particularly for
files whose block pointers span cylinder groups.

Processing of cleanup is further complicated by the fact that
any file that spans a "dirty, dirty" CG after a software
failure must be treated as if it had been involved in a hardware
failure.  With a large number of files, the benefits gained
by this approach are small.

Aside:

I was under the impresssion from the Usenix reports that Kirk's
checkpointing mechanism was a reference to the ability to stall
an image of the FS as an exposed "snapshot", to allow for backups
to occur on running FS's (and if the backups were "taking too long",
that regular soft-updates operations would eventaully stall as a
result).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199908250217.TAA17644>