From owner-freebsd-current  Mon Mar 17 20:59:54 2003
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id D3F8D37B401; Mon, 17 Mar 2003 20:59:50 -0800 (PST)
Received: from goliath.cnchost.com (goliath.cnchost.com [207.155.252.47])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 1CE2843F75; Mon, 17 Mar 2003 20:59:50 -0800 (PST)
	(envelope-from bakul@bitblocks.com)
Received: from bitblocks.com (adsl-209-204-185-216.sonic.net [209.204.185.216])
	by goliath.cnchost.com
	id XAA15021; Mon, 17 Mar 2003 23:59:46 -0500 (EST)
	[ConcentricHost SMTP Relay 1.15]
Message-ID: <200303180459.XAA15021@goliath.cnchost.com>
To: Terry Lambert <tlambert2@mindspring.com>
Cc: Julian Elischer <julian@elischer.org>,
	FreeBSD current users <current@FreeBSD.ORG>, fs@FreeBSD.ORG
Subject: Re: Anyone working on fsck? 
In-reply-to: Your message of "Mon, 17 Mar 2003 19:41:34 PST."
             <3E76956E.5E0FCC6B@mindspring.com> 
Date: Mon, 17 Mar 2003 20:59:45 -0800
From: Bakul Shah <bakul@bitblocks.com>
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-current.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-current>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-current>
X-Loop: FreeBSD.ORG

> > UFS is the real problem here, not fsck.  Its tradeoffs for
> > improving normal access latencies may have been right in the
> > past but not for modern big disks.
	...
> Sorry, but the track-to-track seek latency optimizations you
> are referring to are turned off, given the newfs defaults, and
> have been for a very long time.

I was thinking of the basic idea of cylinder groups as good
for normal load, not so good for fsck when you have too many
CGs. I wasn't thinking of what fsck does or does not do.

> Basically, the problem is that for a BG fsck, it's not possible
> to lock access on a cylinder group basis, and then there is a hell
> of a lot of data on that drive.

Note that Julian said 6 hours to fsck a TB in the normal
foreground mode.

> What Julian needs is a Journalling or Log-structured FS.  

All that Julian wants is a faster fsck without mucking with
the FS!

While I agree with you that you do need a full consistency
check, it is worth thinking about how one can avoid that
whenever possible.  For example, if you can know where the
disk head is at the time of crash (based on what blocks were
being written) it should be possible to avoid a full check.

> The easiest way to do this is to provide some sort of domain
> control, so that you limit the files you have to check to a
> smaller set of files per CG, so that you don't have to check
> all the files to check a particular CG -- and then preload the
> CG's for the sets of files you *do* have to check.

If you have to visit a CG (during fsck) you have already paid
the cost of the seek and rotational latency.  Journalling
wouldn't help here if you still have a zillion CGs.

> Another way of saying this is... don't put all your space in a
> single FS.  8-).

Or in effect treat each CG (or a group of CGs) as a self
contained filesystem (for the purpose of physical allocation)
and maintain explicit import/export lists for files that span
them.

> The problem with having to (effectively) read every inode and
> direct block on the disk is really insurmountable, I think.

That is why I was suggesting putting them in one (or small
number of) contiguous area(s).  On a modern ATA100 or better
disk you can read a GB in under a minute.  Once the data is
in-core you can divide up the checking to multiple
processors.  This is sort of like a distributed graph
collection: you only need to worry about graphs that cross a
node boundary.  Most structures wille contained in one node.

Even for UFS it is probably worth dividing fsck in two or
more processes, one doing IO, one or more doing computation.

> > Typically only a few cyl grps may be inconsistent in case of
> > a crash.  May be some info about which cyl groups need to be
> > checked can be stored so that brute force checking of all
> > grps can be avoided.
> 
> This would work well... under laboratory test conditions.  In
> the field, if the machine has a more general workload (or even
> a heterogeneous workload, but with a hell of a lot of files,
> like a big mail server), this falls down, as the number of bits
> marking "unupdated cylinder groups" becomes large.

Possible -- it is one of the ideas I can think of.  I'd have
to actually model it or simulate it beyond handwaving to know
one way or other.  May be useful in conjunction with other
ideas.

> ...AND... The problem is still that you must scan everything on
> the disk (practically) to identify the inode or indirect block
> that references the block on the cylinder roup in question, and
> THAT's the problem.  If you knew a small set of CG's, that needed
> checked, vs. "all of them", it would *still* bust the cache, which
> is what takes all the time.


> Assume, on average, each file goes into indirect blocks.

On my machine the average file size is 21KB (averaged over
4,000,000 files).  Even with 8KB blocksize very few will have
indirect blocks.  I don't know how typical my file size
distribution is but I suspect the average case is probably
smaller files (I store lots of datasheets, manuals,
databases, PDFs, MP3s, cvs repositories, compressed tars of
old stuff).

But in any case wouldn't going forward from inodes make more
sense?  This is like a standard tracing garbage collection
algorithm.  Blocks that are not reachable are free.  Even for
a 1 TB system with 8K blocks you need 2^(40-13-3) == 16Mbytes
bitmap or some multiple if you want more than 1 bit of state.

> The problem is reverse mapping a bit in a CG bitmap to the file
> that reference it... 8^p.

Why would you want to do that?!

> Multithreading fsck would be an incredibly bad idea.

It depends on the actual algorithm.

> Personally, I recommend using a different FS, if you are going
> to create a honing big FS as a single partition.  8-(.

There are other issues with smaller partitions.  I'd rather
have a single logical file system where all the space can be
used.  If physically it is implemented as a number of smaller
systems that is okay.  Also note that now people can create
big honking files with video streaming at the rate of 2GB per
hour and even then you are compromising on quality a bit.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message