Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 07 Jun 2005 14:52:44 -0600
From:      Scott Long <scottl@samsco.org>
To:        David Malone <dwmalone@maths.tcd.ie>
Cc:        scottl@FreeBSD.org, hackers@FreeBSD.org, Pawel Jakub Dawidek <pjd@FreeBSD.org>, phk@FreeBSD.org, Ivan Voras <ivoras@fer.hr>
Subject:   Re: Google SoC idea
Message-ID:  <42A6091C.40409@samsco.org>
In-Reply-To: <20050607201642.GA58346@walton.maths.tcd.ie>
References:  <42A475AB.6020808@fer.hr> <20050607194005.GG837@darkness.comp.waw.pl> <20050607201642.GA58346@walton.maths.tcd.ie>

next in thread | previous in thread | raw e-mail | index | archive | help
David Malone wrote:
> On Tue, Jun 07, 2005 at 09:40:05PM +0200, Pawel Jakub Dawidek wrote:
> 
>>+> Does it make sense to do it this way? Is it worth applying for the SoC?
>>
>>Not sure. Basically this is simlar what softupdate does, I think.
>>From another point of view softupdates are only available for UFS.
>>You probably wants to hear scottl and phk opinions (CCed).
> 
> 
> I think that Ivan's idea is kind of different from softupdates. His
> idea is pretty clever, in that it could provide synchronus random
> writes at sequential write speeds for any filesystem, providing you
> repaly the journal at startup.
> 
> However, our main problem these days is the fact that we do an fsck
> after every unclean reboot, not the speed of writes. I guess that
> you could skip the fsck (or run it very slowly in the background)
> if you knew the filesystem was clean 'cos of jounral replay.
> 
> 	David.

/me jumps up and down and waves his hands

The problem with journalling at the block layer is that you pretty much 
become forced to journal metadata and data, since the block layer really 
doesn't know the distinction, and definitely not in a 
filesystem-independent way (yes, UFS does evil things to the buffer 
cache by representing metadata with negative block numbers, but that is 
just UFS).  Full journalling has many drawbacks from the viewpoint of 
speed and complexity, of course.  So you really want to be able to do 
just metadata journalling.

Another hard part of distinguishing between metadata and data is that 
filesystems have a habit of migrating disk blocks from holding metadata 
to holding data, and vice versa (think indirect pointer blocks, not 
inode blocks).  If you are only replaying metadata, you want to make 
sure that you don't smash data blocks with old metadata.

Coming up with a filesystem independent way to represent all of this for 
the block layer is not easy.  Filesystems would have to be able to be 
modified to provide proper metadata vs. data hints to the block layer. 
And if you're going to do that, then why not just make it a library in 
VFS, like what Darwin does?

The UFS Journalling work is already well underway, and I expect it to 
follow the path of being a VFS library.  Note that I'm saying 'library' 
here, not 'layer'.  There really is no way to make journalling work with 
an arbitrary filesystem 'for free', whether as a VFS layer or a GEOM 
transform, since journalling is 100% dependent on the filesystem working 
with the buffer-cache to do sane operations in a defined in order.

An alternate SoC project that would be very useful is block-level 
snapshots.  I'm not sure if I'll be able to retain the filesystem 
snapshot functionality in UFS with journalling enabled, so moving to 
doing the snapshots in the block layer would be a good way to make up 
for this.  Beware that while the GEOM transform would be pretty 
straight-forward to write, the real trick comes from being able to make 
the consumer of a block device (a filesystem, maybe) flush itself to a 
consistent state while the snapshot is being taken.  The infrastructure 
for this is the part that is very interesting, but also the most work.

Scott



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?42A6091C.40409>