From owner-freebsd-fs  Sun Aug  8 10:55:50 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from florence.pavilion.net (florence.pavilion.net [194.242.128.25])
	by hub.freebsd.org (Postfix) with ESMTP
	id B4604150DA; Sun,  8 Aug 1999 10:55:36 -0700 (PDT)
	(envelope-from joe@florence.pavilion.net)
Received: (from joe@localhost)
	by florence.pavilion.net (8.9.3/8.8.8) id SAA01243;
	Sun, 8 Aug 1999 18:51:12 +0100 (BST)
	(envelope-from joe)
Date: Sun, 8 Aug 1999 18:51:12 +0100
From: Josef Karthauser <joe@pavilion.net>
To: hackers@freebsd.org, fs@freebsd.org
Subject: Disk label recovery - request for suggestions.
Message-ID: <19990808185112.A99557@pavilion.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.95.4i
X-NCC-RegID: uk.pavilion
Organisation: Pavilion Internet plc, 24 The Old Steine, Brighton, BN1 1EL, England
Phone: +44-845-333-5000
Fax: +44-845-333-5001
Mobile: +44-403-596893
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

A few weeks ago I had a problem with a missing partition table and
disklabel.  Niall Smart forwarded me a small C program for scanning a
drive for superblocks and rewriting a disklabel table.

I'd like to do some work on integrating this into FreeBSD because it
seems too useful to leave out.  At the very least it could be a stand
along tool that works on UFS slices, that'd be easy.  What I'm wondering
though is whether it should be an extension to the disklabel program.
If so, what extra work is required to make it work with non UFS file
systems - is 'disklabel' used on non UFS fs's?

Joe
-- 
Josef Karthauser	FreeBSD: How many times have you booted today?
Technical Manager	Viagra for your server (http://www.uk.freebsd.org)
Pavilion Internet plc.  [joe@pavilion.net, joe@uk.freebsd.org, joe@tao.org.uk]


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Aug 10 12:43: 0 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from crufty.research.bell-labs.com (crufty.research.bell-labs.com [204.178.16.49])
	by hub.freebsd.org (Postfix) with SMTP id 3EA0914EE5
	for <freebsd-fs@freebsd.org>; Tue, 10 Aug 1999 12:42:51 -0700 (PDT)
	(envelope-from vernick@bell-labs.com)
Received: from bronx.dnrc.bell-labs.com ([135.180.160.8]) by crufty; Tue Aug 10 15:40:49 EDT 1999
Received: from bell-labs.com (shortstop [135.180.181.58])
	by bronx.dnrc.bell-labs.com (8.9.3/8.9.3) with ESMTP id PAA10504
	for <freebsd-fs@freebsd.org>; Tue, 10 Aug 1999 15:41:19 -0400 (EDT)
Message-ID: <37B07E3D.16F2B334@bell-labs.com>
Date: Tue, 10 Aug 1999 15:32:13 -0400
From: Michael Vernick <vernick@bell-labs.com>
X-Mailer: Mozilla 4.5 [en] (WinNT; U)
X-Accept-Language: en
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Help with understand file system performance
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Greetings,

It's been a few years since I've hacked with FreeBSD, but I'm back and I
need some help deciphering some of the file system performance numbers
that I'm currently getting.  I'm sure that this has probably been
discussed before but I  haven't found any good related material.

The machine is a P166 w/ 32MB RAM and two 1GB SCSI disks (one for OS and
one for Data) running FreeBSD 3.2-RELEASE.  The Kernel configuration
uses all defaults.

My experiment consists of the following two steps:
1. Create a directory structure of files (depending on certain
parameters like height and width of structure) where the files are
randomly (uniform distribution) chosen to be between 10KB and 20KB.  The
total number of files is around 6400 for a total size of about 100MB.

2. Then a reader program is run that randomly reads a subset (3200) of
the files.  The reader program can have from 1 to 8 processes (fork() is
used to create each process). Each process simply uses 'rand()' to get a
random file, opens the file ('open()'), reads the file in its entirety
using 1 'read(sizeOfFile)' call, then closes the file.

Each experiment is run 8 times (varying the number of processes from
1-8) on each different directory structure.  The structures, in a
nutshell, can be deep (lots of subdirs with few files per directory, or
wide with few subdirs and lots of files per directory).  Both a single
file system and two file systems on the same physical disk are compared.

The performance metric is simply bytes/sec read.

My results show that:

1. Performance degrades significantly (15-20%) when going from 1 to 2
processes then slowly increases as more processes are run.  The same
performance is achieved when running a single reader vs. running 8
readers.  This happens for each type of directory structure.

Is this because of the overhead of directory operations and context
switches?  I would have hoped to get more parallelism with more
processes (i.e. keep the disk at fuller saturation because of Tagged
Queuing) but the results don't show that.

2. Performance degrades about 15% for the 1 process experiment when the
files are split across 2 file systems vs. a single file system.  This
one has me somewhat perplexed.  Is it because there is more directory
information thrashing from disk to memory?

3. On a per process basis, performance increases when the number of
files per directory increases/number of subdirs decreases.  Is this
because there is a better chance the directory information about the
file could be in memory?

In general, my conjecture is that the more directory information that
can be stored in memory, the better, thus leaving all disk activity for
retrieving the actual files. Are there kernel parameters which configure
how much memory is allocated to directory information (metadata) vs.
actual file data.

Our goal, of course, it to maximize performance.  So any help in the
tuning of our system (i.e. reading lots of ~15KB files) would be
appreciated.  I've started to look through the kernel source code to
figure out what is going in, but it isn't easy.  There is lots of
indirection via function pointers.  I've also just started looking
through the 4.4BSD OS Design book.  Is there any FreeBSD documentation
about the file system code?  I really didn't see anything in the
handbook.

Thanks for any help.   It's good to be back.

Michael Vernick, Ph.D.
Multimedia Applications Research
Lucent Bell Labs


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Aug 10 23:50:18 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131])
	by hub.freebsd.org (Postfix) with ESMTP id 0B39714DE9
	for <freebsd-fs@FreeBSD.ORG>; Tue, 10 Aug 1999 23:50:09 -0700 (PDT)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id IAA32011;
	Wed, 11 Aug 1999 08:48:39 +0200 (CEST)
	(envelope-from phk@critter.freebsd.dk)
To: Michael Vernick <vernick@bell-labs.com>
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: Help with understand file system performance 
In-reply-to: Your message of "Tue, 10 Aug 1999 15:32:13 EDT."
             <37B07E3D.16F2B334@bell-labs.com> 
Date: Wed, 11 Aug 1999 08:48:38 +0200
Message-ID: <32009.934354118@critter.freebsd.dk>
From: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

In message <37B07E3D.16F2B334@bell-labs.com>, Michael Vernick writes:

>My results show that:
>
>1. Performance degrades significantly (15-20%) when going from 1 to 2
>processes then slowly increases as more processes are run.  The same
>performance is achieved when running a single reader vs. running 8
>readers.  This happens for each type of directory structure.

That is a good sign:  It means that you don't have to do unnatural
things to your application to get full throughput out of our
file system.

>2. Performance degrades about 15% for the 1 process experiment when the
>files are split across 2 file systems vs. a single file system.  This
>one has me somewhat perplexed.  Is it because there is more directory
>information thrashing from disk to memory?

That sounds weird...

Do you have twice as many directories this way ?

Or are the two filesystems on the same physical disk ?  if so you
are seeking much more.

>3. On a per process basis, performance increases when the number of
>files per directory increases/number of subdirs decreases.  Is this
>because there is a better chance the directory information about the
>file could be in memory?

Yes.  The minimum directory size is the fragsize of the filesystem,
filling the directories better means better performance.

>Our goal, of course, it to maximize performance.  So any help in the
>tuning of our system (i.e. reading lots of ~15KB files) would be
>appreciated

Try fiddling the newfs parameters.  I see 17% speedup using:

	newfs -b 16384 -f 4096 -c 100

Try to fill your directories so they are just below the fragment
size of the filesystem (Ie: <1024 bytes for no newfs options,
< 4096 bytes with the above options).

--
Poul-Henning Kamp             FreeBSD coreteam member
phk@FreeBSD.ORG               "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 11  6:38:29 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31])
	by hub.freebsd.org (Postfix) with ESMTP
	id D252014D2E; Wed, 11 Aug 1999 06:38:23 -0700 (PDT)
	(envelope-from des@flood.ping.uio.no)
Received: (from des@localhost)
	by flood.ping.uio.no (8.9.3/8.9.3) id PAA12263;
	Wed, 11 Aug 1999 15:38:06 +0200 (CEST)
	(envelope-from des)
To: Josef Karthauser <joe@pavilion.net>
Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: Disk label recovery - request for suggestions.
References: <19990808185112.A99557@pavilion.net>
From: Dag-Erling Smorgrav <des@flood.ping.uio.no>
Date: 11 Aug 1999 15:38:05 +0200
In-Reply-To: Josef Karthauser's message of "Sun, 8 Aug 1999 18:51:12 +0100"
Message-ID: <xzpg11qv3nm.fsf@flood.ping.uio.no>
Lines: 10
X-Mailer: Gnus v5.5/Emacs 19.34
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Josef Karthauser <joe@pavilion.net> writes:
> If so, what extra work is required to make it work with non UFS file
> systems - is 'disklabel' used on non UFS fs's?

Disklabel doesn't work at the fs level, it works at the slice level -
dividing slices into partitions, in which you can create file systems.

DES
-- 
Dag-Erling Smorgrav - des@flood.ping.uio.no


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 11  9:16:31 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from florence.pavilion.net (florence.pavilion.net [194.242.128.25])
	by hub.freebsd.org (Postfix) with ESMTP
	id AB48A14EE4; Wed, 11 Aug 1999 09:16:23 -0700 (PDT)
	(envelope-from joe@florence.pavilion.net)
Received: (from joe@localhost)
	by florence.pavilion.net (8.9.3/8.8.8) id RAA13400;
	Wed, 11 Aug 1999 17:15:14 +0100 (BST)
	(envelope-from joe)
Date: Wed, 11 Aug 1999 17:15:14 +0100
From: Josef Karthauser <joe@pavilion.net>
To: Dag-Erling Smorgrav <des@flood.ping.uio.no>
Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: Disk label recovery - request for suggestions.
Message-ID: <19990811171514.X88035@pavilion.net>
References: <19990808185112.A99557@pavilion.net> <xzpg11qv3nm.fsf@flood.ping.uio.no>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.95.4i
In-Reply-To: <xzpg11qv3nm.fsf@flood.ping.uio.no>; from Dag-Erling Smorgrav on Wed, Aug 11, 1999 at 03:38:05PM +0200
X-NCC-RegID: uk.pavilion
Organisation: Pavilion Internet plc, 24 The Old Steine, Brighton, BN1 1EL, England
Phone: +44-845-333-5000
Fax: +44-845-333-5001
Mobile: +44-403-596893
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Wed, Aug 11, 1999 at 03:38:05PM +0200, Dag-Erling Smorgrav wrote:
> Josef Karthauser <joe@pavilion.net> writes:
> > If so, what extra work is required to make it work with non UFS file
> > systems - is 'disklabel' used on non UFS fs's?
> 
> Disklabel doesn't work at the fs level, it works at the slice level -
> dividing slices into partitions, in which you can create file systems.

Ahha - of course.  Ok, let me re-phrase the question then.  By looking
at the contents of the superblocks on a UFS file system it's possible to
reconstruct a disklabel for a slice.  Is this trick possible with other
kinds of file systems too? (Does it even make sense to ask that question?).

Should this recovery functionality be part of an already existing tool,
like disklabel, or should it be a completely new tool?  Opinions?

Would it be possible to tag swap partitions with an equivalent of a
superblock to make their recognition easier under failure conditions?

Joe
-- 
Josef Karthauser	FreeBSD: How many times have you booted today?
Technical Manager	Viagra for your server (http://www.uk.freebsd.org)
Pavilion Internet plc.  [joe@pavilion.net, joe@uk.freebsd.org, joe@tao.org.uk]


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 11  9:23:43 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31])
	by hub.freebsd.org (Postfix) with ESMTP
	id 206E515531; Wed, 11 Aug 1999 09:23:33 -0700 (PDT)
	(envelope-from des@flood.ping.uio.no)
Received: (from des@localhost)
	by flood.ping.uio.no (8.9.3/8.9.3) id SAA13015;
	Wed, 11 Aug 1999 18:23:25 +0200 (CEST)
	(envelope-from des)
To: Josef Karthauser <joe@pavilion.net>
Cc: Dag-Erling Smorgrav <des@flood.ping.uio.no>, hackers@FreeBSD.ORG,
	fs@FreeBSD.ORG
Subject: Re: Disk label recovery - request for suggestions.
References: <19990808185112.A99557@pavilion.net> <xzpg11qv3nm.fsf@flood.ping.uio.no> <19990811171514.X88035@pavilion.net>
From: Dag-Erling Smorgrav <des@flood.ping.uio.no>
Date: 11 Aug 1999 18:23:24 +0200
In-Reply-To: Josef Karthauser's message of "Wed, 11 Aug 1999 17:15:14 +0100"
Message-ID: <xzp672muw03.fsf@flood.ping.uio.no>
Lines: 22
X-Mailer: Gnus v5.5/Emacs 19.34
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Josef Karthauser <joe@pavilion.net> writes:
> Ahha - of course.  Ok, let me re-phrase the question then.  By looking
> at the contents of the superblocks on a UFS file system it's possible to
> reconstruct a disklabel for a slice.

Well, it's possible to reconstruct the label information for *that
particular UFS file system*, since if you know the location of the
superblock (or one of its backup copies), you can determine the offset
and size of the FS. It won't tell you anything about *other*
partitions though.

>                                       Is this trick possible with other
> kinds of file systems too?

That's totally dependent on the particular file system. For instance,
a swap partition contains no metadata (that I know of), so all you can
do is deduce it's size and position from the sizes and positions of
surrounding partitions, and of the slice they're in.

DES
-- 
Dag-Erling Smorgrav - des@flood.ping.uio.no


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 11  9:35:52 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from florence.pavilion.net (florence.pavilion.net [194.242.128.25])
	by hub.freebsd.org (Postfix) with ESMTP
	id 0745815581; Wed, 11 Aug 1999 09:35:44 -0700 (PDT)
	(envelope-from joe@florence.pavilion.net)
Received: (from joe@localhost)
	by florence.pavilion.net (8.9.3/8.8.8) id RAA16474;
	Wed, 11 Aug 1999 17:35:35 +0100 (BST)
	(envelope-from joe)
Date: Wed, 11 Aug 1999 17:35:35 +0100
From: Josef Karthauser <joe@pavilion.net>
To: Dag-Erling Smorgrav <des@flood.ping.uio.no>
Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: Disk label recovery - request for suggestions.
Message-ID: <19990811173535.Y88035@pavilion.net>
References: <19990808185112.A99557@pavilion.net> <xzpg11qv3nm.fsf@flood.ping.uio.no> <19990811171514.X88035@pavilion.net> <xzp672muw03.fsf@flood.ping.uio.no>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.95.4i
In-Reply-To: <xzp672muw03.fsf@flood.ping.uio.no>; from Dag-Erling Smorgrav on Wed, Aug 11, 1999 at 06:23:24PM +0200
X-NCC-RegID: uk.pavilion
Organisation: Pavilion Internet plc, 24 The Old Steine, Brighton, BN1 1EL, England
Phone: +44-845-333-5000
Fax: +44-845-333-5001
Mobile: +44-403-596893
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Wed, Aug 11, 1999 at 06:23:24PM +0200, Dag-Erling Smorgrav wrote:
> Josef Karthauser <joe@pavilion.net> writes:
> > Ahha - of course.  Ok, let me re-phrase the question then.  By looking
> > at the contents of the superblocks on a UFS file system it's possible to
> > reconstruct a disklabel for a slice.
> 
> Well, it's possible to reconstruct the label information for *that
> particular UFS file system*, since if you know the location of the
> superblock (or one of its backup copies), you can determine the offset
> and size of the FS. It won't tell you anything about *other*
> partitions though.

That's ok, because each slice has its _own_ label.  If the bios partition
table loses it's mind that's a little more work :).

> 
> Is this trick possible with other kinds of file systems too?
> 
> That's totally dependent on the particular file system. For instance,
> a swap partition contains no metadata (that I know of), so all you can
> do is deduce it's size and position from the sizes and positions of
> surrounding partitions, and of the slice they're in.
> 

What are the implications of adding a metadata structure to the swap
structure.  (It only needs a block :).  [Although thinking out loud,
it's complicated because there's no 'newfs' process that touches the
partition, on the other hand the size of the partition is known at
swap-mounting time, so the meta data could be written at that point.]

Joe
-- 
Josef Karthauser	FreeBSD: How many times have you booted today?
Technical Manager	Viagra for your server (http://www.uk.freebsd.org)
Pavilion Internet plc.  [joe@pavilion.net, joe@uk.freebsd.org, joe@tao.org.uk]


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 11  9:47:23 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31])
	by hub.freebsd.org (Postfix) with ESMTP
	id 0352014C59; Wed, 11 Aug 1999 09:47:17 -0700 (PDT)
	(envelope-from des@flood.ping.uio.no)
Received: (from des@localhost)
	by flood.ping.uio.no (8.9.3/8.9.3) id SAA13180;
	Wed, 11 Aug 1999 18:46:51 +0200 (CEST)
	(envelope-from des)
To: Josef Karthauser <joe@pavilion.net>
Cc: Dag-Erling Smorgrav <des@flood.ping.uio.no>, hackers@FreeBSD.ORG,
	fs@FreeBSD.ORG
Subject: Re: Disk label recovery - request for suggestions.
References: <19990808185112.A99557@pavilion.net> <xzpg11qv3nm.fsf@flood.ping.uio.no> <19990811171514.X88035@pavilion.net> <xzp672muw03.fsf@flood.ping.uio.no> <19990811173535.Y88035@pavilion.net>
From: Dag-Erling Smorgrav <des@flood.ping.uio.no>
Date: 11 Aug 1999 18:46:51 +0200
In-Reply-To: Josef Karthauser's message of "Wed, 11 Aug 1999 17:35:35 +0100"
Message-ID: <xzpyafitgck.fsf@flood.ping.uio.no>
Lines: 19
X-Mailer: Gnus v5.5/Emacs 19.34
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Josef Karthauser <joe@pavilion.net> writes:
> On Wed, Aug 11, 1999 at 06:23:24PM +0200, Dag-Erling Smorgrav wrote:
> > Josef Karthauser <joe@pavilion.net> writes:
> > > Ahha - of course.  Ok, let me re-phrase the question then.  By looking
> > > at the contents of the superblocks on a UFS file system it's possible to
> > > reconstruct a disklabel for a slice.
> > Well, it's possible to reconstruct the label information for *that
> > particular UFS file system*, since if you know the location of the
> > superblock (or one of its backup copies), you can determine the offset
> > and size of the FS. It won't tell you anything about *other*
> > partitions though.
> That's ok, because each slice has its _own_ label.  If the bios partition
> table loses it's mind that's a little more work :).

You're confusing partitions and slices.

DES
-- 
Dag-Erling Smorgrav - des@flood.ping.uio.no


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 11 10: 1:12 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from florence.pavilion.net (florence.pavilion.net [194.242.128.25])
	by hub.freebsd.org (Postfix) with ESMTP
	id BC313155A1; Wed, 11 Aug 1999 10:01:01 -0700 (PDT)
	(envelope-from joe@florence.pavilion.net)
Received: (from joe@localhost)
	by florence.pavilion.net (8.9.3/8.8.8) id SAA19912;
	Wed, 11 Aug 1999 18:00:48 +0100 (BST)
	(envelope-from joe)
Date: Wed, 11 Aug 1999 18:00:48 +0100
From: Josef Karthauser <joe@pavilion.net>
To: Dag-Erling Smorgrav <des@flood.ping.uio.no>
Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: Disk label recovery - request for suggestions.
Message-ID: <19990811180048.Z88035@pavilion.net>
References: <19990808185112.A99557@pavilion.net> <xzpg11qv3nm.fsf@flood.ping.uio.no> <19990811171514.X88035@pavilion.net> <xzp672muw03.fsf@flood.ping.uio.no> <19990811173535.Y88035@pavilion.net> <xzpyafitgck.fsf@flood.ping.uio.no>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.95.4i
In-Reply-To: <xzpyafitgck.fsf@flood.ping.uio.no>; from Dag-Erling Smorgrav on Wed, Aug 11, 1999 at 06:46:51PM +0200
X-NCC-RegID: uk.pavilion
Organisation: Pavilion Internet plc, 24 The Old Steine, Brighton, BN1 1EL, England
Phone: +44-845-333-5000
Fax: +44-845-333-5001
Mobile: +44-403-596893
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Wed, Aug 11, 1999 at 06:46:51PM +0200, Dag-Erling Smorgrav wrote:
> Josef Karthauser <joe@pavilion.net> writes:
> > On Wed, Aug 11, 1999 at 06:23:24PM +0200, Dag-Erling Smorgrav wrote:
> > > Josef Karthauser <joe@pavilion.net> writes:
> > > > Ahha - of course.  Ok, let me re-phrase the question then.  By looking
> > > > at the contents of the superblocks on a UFS file system it's possible to
> > > > reconstruct a disklabel for a slice.
> > > Well, it's possible to reconstruct the label information for *that
> > > particular UFS file system*, since if you know the location of the
> > > superblock (or one of its backup copies), you can determine the offset
> > > and size of the FS. It won't tell you anything about *other*
> > > partitions though.
> > That's ok, because each slice has its _own_ label.  If the bios partition
> > table loses it's mind that's a little more work :).
> 
> You're confusing partitions and slices.

I don't think so - PC's have a partition table.  Us FreeBSDers call these
partitions 'slices', and subdivide these into FreeBSD partitions.  Each
slice (pc partition) has a disklabel which denotes where the FreeBSD partitions
live on the slice.

I see what you were saying now above now.  I agree that the superblock for
a UFS file system won't tell anything about other UFS partitions, but
a block by block search of the whole slice will identify potential
superblocks that will.

Joe
-- 
Josef Karthauser	FreeBSD: How many times have you booted today?
Technical Manager	Viagra for your server (http://www.uk.freebsd.org)
Pavilion Internet plc.  [joe@pavilion.net, joe@uk.freebsd.org, joe@tao.org.uk]


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 11 10:33:39 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from rover.village.org (rover.village.org [204.144.255.49])
	by hub.freebsd.org (Postfix) with ESMTP
	id AEA5115776; Wed, 11 Aug 1999 10:33:19 -0700 (PDT)
	(envelope-from imp@harmony.village.org)
Received: from harmony.village.org (harmony.village.org [10.0.0.6])
	by rover.village.org (8.9.3/8.9.3) with ESMTP id LAA20642;
	Wed, 11 Aug 1999 11:33:05 -0600 (MDT)
	(envelope-from imp@harmony.village.org)
Received: from harmony.village.org (localhost.village.org [127.0.0.1]) by harmony.village.org (8.9.3/8.8.3) with ESMTP id LAA18169; Wed, 11 Aug 1999 11:33:30 -0600 (MDT)
Message-Id: <199908111733.LAA18169@harmony.village.org>
To: Dag-Erling Smorgrav <des@flood.ping.uio.no>
Subject: Re: Disk label recovery - request for suggestions. 
Cc: Josef Karthauser <joe@pavilion.net>, hackers@FreeBSD.ORG,
	fs@FreeBSD.ORG
In-reply-to: Your message of "11 Aug 1999 18:23:24 +0200."
		<xzp672muw03.fsf@flood.ping.uio.no> 
References: <xzp672muw03.fsf@flood.ping.uio.no>  <19990808185112.A99557@pavilion.net> <xzpg11qv3nm.fsf@flood.ping.uio.no> <19990811171514.X88035@pavilion.net> 
Date: Wed, 11 Aug 1999 11:33:30 -0600
From: Warner Losh <imp@village.org>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

In message <xzp672muw03.fsf@flood.ping.uio.no> Dag-Erling Smorgrav writes:
: superblock (or one of its backup copies), you can determine the offset
: and size of the FS. It won't tell you anything about *other*
: partitions though.

It will give a fairly strong hint, however.  If you know what is taken
up by this partition, you can remove it from the pool of available
space and guess with a relatively high degree of accuracy that the
next partition begins where this one ends.

: >                                       Is this trick possible with other
: > kinds of file systems too?
: 
: That's totally dependent on the particular file system. For instance,
: a swap partition contains no metadata (that I know of), so all you can
: do is deduce it's size and position from the sizes and positions of
: surrounding partitions, and of the slice they're in.

Yes.  This is true....  That's one of the problems of my disklabel
reconstruction program that tries to run fast...  It slows way down
when it hits the swap area...

Warner


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 11 14:48:19 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from gatewaya.anheuser-busch.com (gatewaya.anheuser-busch.com [151.145.250.252])
	by hub.freebsd.org (Postfix) with SMTP
	id 61BFE14E31; Wed, 11 Aug 1999 14:48:01 -0700 (PDT)
	(envelope-from Matthew.Alton@anheuser-busch.com)
Received: by gatewaya.anheuser-busch.com; id QAA24660; Wed, 11 Aug 1999 16:49:19 -0500
Received: from stlexggtw002-pozzoli.fw-users.busch.com(151.145.101.130) by gatewaya.anheuser-busch.com via smap (V5.0)
	id xma024583; Wed, 11 Aug 99 16:48:57 -0500
Received: from stlabcexg006.anheuser-busch.com ([151.145.101.161]) by 151.145.101.130
  (Norton AntiVirus for Internet Email Gateways 1.0) ;
  Wed, 11 Aug 1999 21:46:50 0000 (GMT)
Received: by stlabcexg006.anheuser-busch.com with Internet Mail Service (5.5.2448.0)
	id <QV9NABR3>; Wed, 11 Aug 1999 16:46:33 -0500
Message-ID: <0740CBD1D149D31193EB0008C7C56836EB8B05@STLABCEXG012>
From: "Alton, Matthew" <Matthew.Alton@anheuser-busch.com>
To: "'Hackers@FreeBSD.ORG'" <Hackers@FreeBSD.ORG>,
	"'fs@FreeBSD.ORG'" <fs@FreeBSD.ORG>
Subject: BSD-XFS Update
Date: Wed, 11 Aug 1999 16:46:46 -0500
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2448.0)
Content-Type: text/plain
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

SGI has released a portion of the XFS source code under the GPL:

http://oss.sgi.com/projects/xfs/download/

the source file is xfs_log.tar.gz.

Of greater interest at this stage are the documents in:

http://oss.sgi.com/projects/xfs/design_docs/

I am currently researching methods for implementing the 64-bit
syscalls stat64(), fstat64(), lseek64() &etc.  delineated in the
SGI design doc _64 Bit File Access_  by Adam Sweeney.

The BSD-XFS port will be made available as a patch to the RELEASE
FreeBSD kernels.


Matthew Alton
Computer Services - UNIX Systems Administration
(314)632-6644   matthew.alton@anheuser-busch.com
                alton@plantnet.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 11 15:36:38 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from alpo.whistle.com (alpo.whistle.com [207.76.204.38])
	by hub.freebsd.org (Postfix) with ESMTP
	id A844C14DEE; Wed, 11 Aug 1999 15:36:33 -0700 (PDT)
	(envelope-from julian@whistle.com)
Received: from current1.whistle.com (current1.whistle.com [207.76.205.22])
	by alpo.whistle.com (8.9.1a/8.9.1) with SMTP id PAA71891;
	Wed, 11 Aug 1999 15:34:57 -0700 (PDT)
Date: Wed, 11 Aug 1999 15:35:31 -0700 (PDT)
From: Julian Elischer <julian@whistle.com>
To: "Alton, Matthew" <Matthew.Alton@anheuser-busch.com>
Cc: "'Hackers@FreeBSD.ORG'" <Hackers@FreeBSD.ORG>,
	"'fs@FreeBSD.ORG'" <fs@FreeBSD.ORG>
Subject: Re: BSD-XFS Update
In-Reply-To: <0740CBD1D149D31193EB0008C7C56836EB8B05@STLABCEXG012>
Message-ID: <Pine.BSF.3.95.990811153511.1124A-100000@current1.whistle.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

stat, fstat, lseek are all already 64 bits in freebsd.....


On Wed, 11 Aug 1999, Alton, Matthew wrote:

> SGI has released a portion of the XFS source code under the GPL:
> 
> http://oss.sgi.com/projects/xfs/download/
> 
> the source file is xfs_log.tar.gz.
> 
> Of greater interest at this stage are the documents in:
> 
> http://oss.sgi.com/projects/xfs/design_docs/
> 
> I am currently researching methods for implementing the 64-bit
> syscalls stat64(), fstat64(), lseek64() &etc.  delineated in the
> SGI design doc _64 Bit File Access_  by Adam Sweeney.
> 
> The BSD-XFS port will be made available as a patch to the RELEASE
> FreeBSD kernels.
> 
> 
> Matthew Alton
> Computer Services - UNIX Systems Administration
> (314)632-6644   matthew.alton@anheuser-busch.com
>                 alton@plantnet.com
> 
> 
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-hackers" in the body of the message
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 11 16: 4:21 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from gatewaya.anheuser-busch.com (gatewaya.anheuser-busch.com [151.145.250.252])
	by hub.freebsd.org (Postfix) with SMTP
	id 65C5615636; Wed, 11 Aug 1999 16:04:06 -0700 (PDT)
	(envelope-from Matthew.Alton@anheuser-busch.com)
Received: by gatewaya.anheuser-busch.com; id SAA01734; Wed, 11 Aug 1999 18:05:51 -0500
Received: from stlexggtw002-pozzoli.fw-users.busch.com(151.145.101.130) by gatewaya.anheuser-busch.com via smap (V5.0)
	id xma001698; Wed, 11 Aug 99 18:05:44 -0500
Received: from stlabcexg006.anheuser-busch.com ([151.145.101.161]) by 151.145.101.130
  (Norton AntiVirus for Internet Email Gateways 1.0) ;
  Wed, 11 Aug 1999 23:03:37 0000 (GMT)
Received: by stlabcexg006.anheuser-busch.com with Internet Mail Service (5.5.2448.0)
	id <QV9NAB6S>; Wed, 11 Aug 1999 18:03:20 -0500
Message-ID: <0740CBD1D149D31193EB0008C7C56836EB8B06@STLABCEXG012>
From: "Alton, Matthew" <Matthew.Alton@anheuser-busch.com>
To: "'Julian Elischer'" <julian@whistle.com>
Cc: "'Hackers@FreeBSD.ORG'" <Hackers@FreeBSD.ORG>,
	"'fs@FreeBSD.ORG'" <fs@FreeBSD.ORG>
Subject: RE: BSD-XFS Update
Date: Wed, 11 Aug 1999 18:03:33 -0500
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2448.0)
Content-Type: text/plain
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Quite so.  Thank you.  I initially only looked at things like:
19      COMPAT  POSIX   { long lseek(int fd, long offset, int whence); }
from /usr/src/sys/kern/syscalls.master and assumed a 32-bit long int.

The easy way to deal with this is to change the calls in the XFS code.
The syscall part is mostly done.


> -----Original Message-----
> From:	Julian Elischer [SMTP:julian@whistle.com]
> Sent:	Wednesday, August 11, 1999 5:36 PM
> To:	Alton, Matthew
> Cc:	'Hackers@FreeBSD.ORG'; 'fs@FreeBSD.ORG'
> Subject:	Re: BSD-XFS Update
> 
> stat, fstat, lseek are all already 64 bits in freebsd.....
> 
> 
> On Wed, 11 Aug 1999, Alton, Matthew wrote:
> 
> > SGI has released a portion of the XFS source code under the GPL:
> > 
> > http://oss.sgi.com/projects/xfs/download/
> > 
> > the source file is xfs_log.tar.gz.
> > 
> > Of greater interest at this stage are the documents in:
> > 
> > http://oss.sgi.com/projects/xfs/design_docs/
> > 
> > I am currently researching methods for implementing the 64-bit
> > syscalls stat64(), fstat64(), lseek64() &etc.  delineated in the
> > SGI design doc _64 Bit File Access_  by Adam Sweeney.
> > 
> > The BSD-XFS port will be made available as a patch to the RELEASE
> > FreeBSD kernels.
> > 
> > 
> > Matthew Alton
> > Computer Services - UNIX Systems Administration
> > (314)632-6644   matthew.alton@anheuser-busch.com
> >                 alton@plantnet.com
> > 
> > 
> > 
> > 
> > To Unsubscribe: send mail to majordomo@FreeBSD.org
> > with "unsubscribe freebsd-hackers" in the body of the message
> > 
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 11 19:38:28 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from lestat.nas.nasa.gov (lestat.nas.nasa.gov [129.99.33.127])
	by hub.freebsd.org (Postfix) with ESMTP
	id 0686414CFE; Wed, 11 Aug 1999 19:38:20 -0700 (PDT)
	(envelope-from thorpej@lestat.nas.nasa.gov)
Received: from lestat (localhost [127.0.0.1]) by lestat.nas.nasa.gov (8.8.8/8.6.12) with ESMTP id TAA00416; Wed, 11 Aug 1999 19:37:23 -0700 (PDT)
Message-Id: <199908120237.TAA00416@lestat.nas.nasa.gov>
To: "Alton, Matthew" <Matthew.Alton@anheuser-busch.com>
Cc: "'Hackers@FreeBSD.ORG'" <Hackers@FreeBSD.ORG>,
	"'fs@FreeBSD.ORG'" <fs@FreeBSD.ORG>
Subject: Re: BSD-XFS Update 
Reply-To: Jason Thorpe <thorpej@nas.nasa.gov>
From: Jason Thorpe <thorpej@nas.nasa.gov>
Date: Wed, 11 Aug 1999 19:37:22 -0700
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Wed, 11 Aug 1999 16:46:46 -0500 
 "Alton, Matthew" <Matthew.Alton@anheuser-busch.com> wrote:

 > I am currently researching methods for implementing the 64-bit
 > syscalls stat64(), fstat64(), lseek64() &etc.  delineated in the
 > SGI design doc _64 Bit File Access_  by Adam Sweeney.

...which, of course, is completely unnecessary, as systems derived
from 4.4BSD have always had 64-bit file offsets.

        -- Jason R. Thorpe <thorpej@nas.nasa.gov>


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 12  0:14: 7 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from raditex.se (gandalf.raditex.se [192.5.36.18])
	by hub.freebsd.org (Postfix) with ESMTP id A9FFF14D43
	for <freebsd-fs@freebsd.org>; Thu, 12 Aug 1999 00:14:01 -0700 (PDT)
	(envelope-from ps@raditex.se)
Received: (from ps@localhost)
	by raditex.se (8.9.3/8.9.3) id JAA14266
	for freebsd-fs@freebsd.org; Thu, 12 Aug 1999 09:11:56 +0200 (CEST)
	(envelope-from ps)
Date: Tue, 10 Aug 1999 16:32:45 +0200
From: Patrik Sundberg <ps@raditex.se>
To: freebsd-fs@freebsd.org
Subject: mfs and imagefile (/usr/src/sbin/newfs/mkfs.c)
Message-ID: <19990810163402.B10448@radiac.sickla.raditex.se>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.95.6i
X-Mailer: Mutt 0.95.6i
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Hi,

I hope this goes to the right forum, otherwise I apologize.

I have been trying to set up a FreeBSD box to avoid disc-writes. This lead
me to use mfs-filesystems for things like /var. In the process of doing this
I wanted to initialize a mfs-fs from an image-file. I thought the -F option
was the way to go, but after testing a bit and reading the source it seems
like when using -F one always gets an empty filesystem - it doesn't care
about the contents of the file given.

We asked Andrzej Bialecki(picobsd) about it and he too thought the -F flag was
the way to accomplish this, but later came to the same conclusion as we did.

The relevant sourcecode (mkfs.c):

if(filename) {
  unsigned char buf[BUFSIZ];
  unsigned long l,l1;
  fd = open(filename,O_RDWR|O_TRUNC|O_CREAT,0644);
  if(fd < 0)
    err(12, "%s", filename);
  for(l=0;l< fssize * sectorsize;l += l1) {
    l1 = fssize * sectorsize;
    if (BUFSIZ < l1)
      l1 = BUFSIZ;
    if (l1 != write(fd,buf,l1))
      err(12, "%s", filename);
  }
  membase = mmap(0,
                 fssize * sectorsize,
                 PROT_READ|PROT_WRITE,
                 MAP_SHARED,
                 fd,
                 0);

  if(membase == MAP_FAILED)
    err(12, "mmap");
  close(fd);
} else {

It makes the file 0 size and then writes an uninitialized buffer to it until
it is of correct size(?).

Is there any reason for not having the possibility to use the contents of
the file to initialize the fs? Maybe we could have a flag which specifies
the behaviour of -F ?

-- 
Patrik Sundberg  -  email: ps@raditex.se - PGP: finger ps@raditex.se
  ---> telefon: 08-636 59 39  -  mobiltelefon: 070-760 22 40 <---


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 12  3:47:27 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from critter.freebsd.dk (wandering-wizard.cybercity.dk [212.242.41.238])
	by hub.freebsd.org (Postfix) with ESMTP id A495914FDF
	for <freebsd-fs@FreeBSD.ORG>; Thu, 12 Aug 1999 03:47:24 -0700 (PDT)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id LAA00331;
	Thu, 12 Aug 1999 11:05:34 +0200 (CEST)
	(envelope-from phk@critter.freebsd.dk)
To: Patrik Sundberg <ps@raditex.se>
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: mfs and imagefile (/usr/src/sbin/newfs/mkfs.c) 
In-reply-to: Your message of "Tue, 10 Aug 1999 16:32:45 +0200."
             <19990810163402.B10448@radiac.sickla.raditex.se> 
Date: Thu, 12 Aug 1999 11:05:33 +0200
Message-ID: <329.934448733@critter.freebsd.dk>
From: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


>We asked Andrzej Bialecki(picobsd) about it and he too thought the -F flag was
>the way to accomplish this, but later came to the same conclusion as we did.

The -F flag was added because we didn't have a working vn(4) at the
time and Jordan and I were sick and tired of make release falling
over because of sick floppy disks.  It works the opposite of what you
want:  it preserves the contents of the MFS after you unmount it.

Feel free to add code for what you suggest, but make it read the input
from a filedescriptor so that I can 
	gunzip < mfs.image.gz | mount_mfs ...


--
Poul-Henning Kamp             FreeBSD coreteam member
phk@FreeBSD.ORG               "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 12  7:31:18 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from frmug.org (frmug-gw.frmug.org [193.56.58.252])
	by hub.freebsd.org (Postfix) with ESMTP id 2BD601577D
	for <freebsd-fs@FreeBSD.ORG>; Thu, 12 Aug 1999 07:31:09 -0700 (PDT)
	(envelope-from roberto@keltia.freenix.fr)
Received: (from uucp@localhost)
	by frmug.org (8.9.1/frmug-2.3/nospam) with UUCP id QAA20199
	for freebsd-fs@FreeBSD.ORG; Thu, 12 Aug 1999 16:31:16 +0200 (CEST)
	(envelope-from roberto@keltia.freenix.fr)
Received: by keltia.freenix.fr (Postfix, from userid 101)
	id CBD3A870B; Thu, 12 Aug 1999 13:36:48 +0200 (CEST)
Date: Thu, 12 Aug 1999 13:36:48 +0200
From: Ollivier Robert <roberto@keltia.freenix.fr>
To: freebsd-fs@FreeBSD.ORG
Subject: Re: Help with understand file system performance
Message-ID: <19990812133648.A64754@keltia.freenix.fr>
Mail-Followup-To: freebsd-fs@FreeBSD.ORG
References: <37B07E3D.16F2B334@bell-labs.com> <32009.934354118@critter.freebsd.dk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
User-Agent: Mutt/0.95.5i
In-Reply-To: <32009.934354118@critter.freebsd.dk>; from Poul-Henning Kamp on Wed, Aug 11, 1999 at 08:48:38AM +0200
X-Operating-System: FreeBSD 4.0-CURRENT/ELF ctm#5543 AMD-K6 MMX @ 200 MHz
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

According to Poul-Henning Kamp:
> Yes.  The minimum directory size is the fragsize of the filesystem,

I'm afraid it is not the case...

214 [13:35] root@tara:/src# df .
Filesystem   1K-blocks     Used    Avail Capacity  Mounted on
/dev/da0s2d    1375362   602742   662592    48%    /src
215 [13:35] root@tara:/src# dumpfs /dev/rda0s2d|more
magic   11954   time    Thu Aug 12 13:34:13 1999
id      [ 360d5f59 20984fb4 ]
cylgrp  dynamic inodes  4.4BSD
nbfree  68880   ndir    22396   nifree  217288  nffree  29178
ncg     44      ncyl    694     size    1419379 blocks  1375362
bsize   8192    shift   13      mask    0xffffe000
fsize   1024    shift   10      mask    0xfffffc00
        ^^^^
216 [13:35] root@tara:/src# ll
total 5
drwxr-xr-x   2 roberto  staff   512 Sep 26  1998 CVS/
                                ^^^
drwxr-xr-x   4 root     wheel   512 Jan 24  1999 obj/
drwxr-xr-x  48 roberto  staff  1024 Mar 11 00:51 ports/
drwxr-xr-x  21 roberto  staff   512 Jul 10 18:06 src/
drwxr-xr-x   2 root     staff  1024 Jul 26 22:53 world/

-- 
Ollivier ROBERT -=- FreeBSD: The Power to Serve! -=- roberto@keltia.freenix.fr
FreeBSD keltia.freenix.fr 4.0-CURRENT #73: Sat Jul 31 15:36:05 CEST 1999


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 12  7:37:32 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18])
	by hub.freebsd.org (Postfix) with ESMTP id 19A171577D
	for <freebsd-fs@FreeBSD.ORG>; Thu, 12 Aug 1999 07:37:28 -0700 (PDT)
	(envelope-from zzhang@cs.binghamton.edu)
Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72])
	by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id KAA05322;
	Thu, 12 Aug 1999 10:37:33 -0400 (EDT)
Date: Thu, 12 Aug 1999 10:24:30 -0400 (EDT)
From: Zhihui Zhang <zzhang@cs.binghamton.edu>
To: Ollivier Robert <roberto@keltia.freenix.fr>
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: Help with understand file system performance
In-Reply-To: <19990812133648.A64754@keltia.freenix.fr>
Message-ID: <Pine.GSO.3.96.990812102303.1833A-100000@sol.cs.binghamton.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Thu, 12 Aug 1999, Ollivier Robert wrote:

> According to Poul-Henning Kamp:
> > Yes.  The minimum directory size is the fragsize of the filesystem,
> 
> I'm afraid it is not the case...
> 
> 214 [13:35] root@tara:/src# df .
> Filesystem   1K-blocks     Used    Avail Capacity  Mounted on
> /dev/da0s2d    1375362   602742   662592    48%    /src
> 215 [13:35] root@tara:/src# dumpfs /dev/rda0s2d|more
> magic   11954   time    Thu Aug 12 13:34:13 1999
> id      [ 360d5f59 20984fb4 ]
> cylgrp  dynamic inodes  4.4BSD
> nbfree  68880   ndir    22396   nifree  217288  nffree  29178
> ncg     44      ncyl    694     size    1419379 blocks  1375362
> bsize   8192    shift   13      mask    0xffffe000
> fsize   1024    shift   10      mask    0xfffffc00
>         ^^^^
> 216 [13:35] root@tara:/src# ll
> total 5
> drwxr-xr-x   2 roberto  staff   512 Sep 26  1998 CVS/
>                                 ^^^
The fsize is the number of bytes in a fragment.  Even if your file is 1
byte, that file needs 1024 bytes to store.  However, the byte count is
still one byte.  In your example, the byte count is 512 bytes.

-Zhihui


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 12  7:48:48 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from critter.freebsd.dk (wandering-wizard.cybercity.dk [212.242.41.238])
	by hub.freebsd.org (Postfix) with ESMTP id A779414EDF
	for <freebsd-fs@FreeBSD.ORG>; Thu, 12 Aug 1999 07:48:43 -0700 (PDT)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id QAA01406;
	Thu, 12 Aug 1999 16:45:09 +0200 (CEST)
	(envelope-from phk@critter.freebsd.dk)
To: Zhihui Zhang <zzhang@cs.binghamton.edu>
Cc: Ollivier Robert <roberto@keltia.freenix.fr>,
	freebsd-fs@FreeBSD.ORG
Subject: Re: Help with understand file system performance 
In-reply-to: Your message of "Thu, 12 Aug 1999 10:24:30 EDT."
             <Pine.GSO.3.96.990812102303.1833A-100000@sol.cs.binghamton.edu> 
Date: Thu, 12 Aug 1999 16:45:09 +0200
Message-ID: <1404.934469109@critter.freebsd.dk>
From: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

In message <Pine.GSO.3.96.990812102303.1833A-100000@sol.cs.binghamton.edu>, Zhi
hui Zhang writes:
>
>> According to Poul-Henning Kamp:
>> > Yes.  The minimum directory size is the fragsize of the filesystem,
>> 
>> I'm afraid it is not the case...
>> 
>> 216 [13:35] root@tara:/src# ll
>> total 5
>> drwxr-xr-x   2 roberto  staff   512 Sep 26  1998 CVS/
>>                                 ^^^
>The fsize is the number of bytes in a fragment.  Even if your file is 1
>byte, that file needs 1024 bytes to store.  However, the byte count is
>still one byte.  In your example, the byte count is 512 bytes.

Yeah, well, the real issue is if the UFS implementation works on
the 512 bytes size of the fragsize.

--
Poul-Henning Kamp             FreeBSD coreteam member
phk@FreeBSD.ORG               "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 12 16:14:57 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134])
	by hub.freebsd.org (Postfix) with ESMTP id 4373114C1D
	for <freebsd-fs@FreeBSD.ORG>; Thu, 12 Aug 1999 16:14:52 -0700 (PDT)
	(envelope-from tlambert@usr04.primenet.com)
Received: (from daemon@localhost)
	by smtp04.primenet.com (8.9.3/8.9.3) id QAA03382;
	Thu, 12 Aug 1999 16:14:24 -0700 (MST)
Received: from usr04.primenet.com(206.165.6.204)
 via SMTP by smtp04.primenet.com, id smtpdAAAvhaqvg; Thu Aug 12 16:14:16 1999
Received: (from tlambert@localhost)
	by usr04.primenet.com (8.8.5/8.8.5) id QAA23506;
	Thu, 12 Aug 1999 16:14:05 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908122314.QAA23506@usr04.primenet.com>
Subject: Re: Help with understand file system performance
To: phk@critter.freebsd.dk (Poul-Henning Kamp)
Date: Thu, 12 Aug 1999 23:14:05 +0000 (GMT)
Cc: zzhang@cs.binghamton.edu, roberto@keltia.freenix.fr,
	freebsd-fs@FreeBSD.ORG
In-Reply-To: <1404.934469109@critter.freebsd.dk> from "Poul-Henning Kamp" at Aug 12, 99 04:45:09 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Poul-Henning Kamp writes:
> Zhihui Zhang writes:
> >
> >> According to Poul-Henning Kamp:
> >> > Yes.  The minimum directory size is the fragsize of the filesystem,
> >> 
> >> I'm afraid it is not the case...
> >> 
> >> 216 [13:35] root@tara:/src# ll
> >> total 5
> >> drwxr-xr-x   2 roberto  staff   512 Sep 26  1998 CVS/
> >>                                 ^^^
> >The fsize is the number of bytes in a fragment.  Even if your file is 1
> >byte, that file needs 1024 bytes to store.  However, the byte count is
> >still one byte.  In your example, the byte count is 512 bytes.
> 
> Yeah, well, the real issue is if the UFS implementation works on
> the 512 bytes size of the fragsize.

Poul's right.  More particularly, there are two concepts here:

1)	File system block size

2)	Directory entry block size


The directory entry block size is a physical disk block.  This is
intentional for the purposes of atomicity of directory entry block
updates.  In point of fact, the code is incapable of dealing with
anything other than BLKATOFF()-type semantics.


Directories are files.  This is an implementation detail, and the
wording of POSIX specifically distances itself from the concept
that directories and files are the same primitive object.  This is
probably in an attempt to allow VMS, NT, and NetWare filesystems
claim POSIX compliance.

The filesystem block allocation table in directories is unique, in
that it is generally used as a convenience for locating physical
blocks, rather than using the standard filesystem block access
mechanisms, when reading or writing directories.

There are a number of performance penalties for this, especially
on large directories, where it is not possible to trigger sequential
readahead through use of the getdents() system call sequentially
accessing sequential 512b/physical_block_size extents.


There also appears to be a misunderstanding about frags here:

> >> drwxr-xr-x   2 roberto  staff   512 Sep 26  1998 CVS/
> >>                                 ^^^
> >The fsize is the number of bytes in a fragment.  Even if your file is 1
> >byte, that file needs 1024 bytes to store.  However, the byte count is
> >still one byte.  In your example, the byte count is 512 bytes.

The frag size is, by default, 1/8 of the filesystem block size.

For a filesystem block size of 4096, the frag size is 512b, which
is the physical block size on most media (e.g. most everything that
you might have an FFS on, except not Japanese magneto-optical and
some Japanese winchester disk drives).

The frag size can be tuned down below this (i.e. 1/4, 1/2, 1).

The only case where 1024 bytes of physical disk would be used is at
a filesystem block size of 8192 (or greater), which, divided by 8,
gives 1024b (or greater).

In this case, the directory entry structure size is... still the
physical device block size, or 512b.


As an exercise for the reader, try implementing a directory entry
block size in excess of 512b (e.g. 1024b, in an attempt to support
both 8.3 names and 256 character Unicode names for files).

The problem you will encounter is that the physical disk only
guarantees atomicity at the block I/O level.


Soft Updates allow this to work for file contents, but inodes are
still 128 bytes (sub 1 physical device block) and directory entry
blocks are still 512b (equal to or sub the physical device block
size.  There aren't really structures to allow for an encapsulated
update of these objects to occur, to allow them to exceed the
physical device block size, yet remain atomic.

What happens at the inode data contents level, is that new blocks
are allocated, given the new content for the region, verified that
they are written to disk, and then the direct block list in the
inode, or the direct block list of an indirect block pointed to
by the inode or by another indirect block, is updated.

This means that if a crash occurs before the block list is modified,
the old contents remain, in their entirety, and if a crash occurs
after the block list is modified, the fact that the data is verified
on disk before the update occurs, the new contents are there, in their
entirety.

This is called an encapsulated two stage commit, in database terms.

For inodes, indirect blocks, and directory entry blocks, there is
no two stage commit, because there is no indirection of their data
contents.


Hope this sets things straight in your mind (not you, Poul, I know
you already understand it 8-)).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 12 18:16:49 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18])
	by hub.freebsd.org (Postfix) with ESMTP id 2E34F14C9E
	for <freebsd-fs@FreeBSD.ORG>; Thu, 12 Aug 1999 18:16:41 -0700 (PDT)
	(envelope-from zzhang@cs.binghamton.edu)
Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72])
	by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id VAA20795;
	Thu, 12 Aug 1999 21:16:16 -0400 (EDT)
Date: Thu, 12 Aug 1999 21:02:32 -0400 (EDT)
From: Zhihui Zhang <zzhang@cs.binghamton.edu>
To: Terry Lambert <tlambert@primenet.com>
Cc: Poul-Henning Kamp <phk@critter.freebsd.dk>,
	roberto@keltia.freenix.fr, freebsd-fs@FreeBSD.ORG
Subject: Re: Help with understand file system performance
In-Reply-To: <199908122314.QAA23506@usr04.primenet.com>
Message-ID: <Pine.GSO.3.96.990812202049.1878A-100000@sol.cs.binghamton.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


On Thu, 12 Aug 1999, Terry Lambert wrote:

> The filesystem block allocation table in directories is unique, in
> that it is generally used as a convenience for locating physical
> blocks, rather than using the standard filesystem block access
> mechanisms, when reading or writing directories.

Directory files have the same on-disk structure as regular files.
However, they can never have holes and they can only be incremented at the
end of the file in device block chunks. No directory entry can cross the
device block boundary to guarantee the atomic update. 

However, I do not know why you say the block map (direct and indirect
blocks) of a directory is only used as a convenience. I mean there is a
need to call VOP_BMAP() on a directory file. The routine ffs_blkatoff() 
calls bread(), which in turn calls VOP_BMAP(). The in-core inode does have
several fields to facilitate the insertion of new directory entries. But
we still need the block map (block allocation table). 

Directory files are also specical in that we can not write into them with
the write() system call as normal files.  They use a special routine to
grow, i.e., ufs_direnter().  By the way, we can use read() system call to
read directory files as we do with normal files. 

> There are a number of performance penalties for this, especially
> on large directories, where it is not possible to trigger sequential
> readahead through use of the getdents() system call sequentially
> accessing sequential 512b/physical_block_size extents.

I do not understand this. The read-ahead mechanism should work on any
files. I thought the reorganization of diretory entries within a directory
block when you delete an entry is an inefficiency. 

Does this issue have anything to do with the VMIO directory issue
discussed earlier this year? 
 
> The frag size can be tuned down below this (i.e. 1/4, 1/2, 1).
> 
> The only case where 1024 bytes of physical disk would be used is at
> a filesystem block size of 8192 (or greater), which, divided by 8,
> gives 1024b (or greater).

I did not realize this before.  The maximum ratio is 8.  So if the
filesystem block is 8192, the allocation unit (fragment size) can not be
512 because 8192/512 > 8.

> This is called an encapsulated two stage commit, in database terms.
> 
> For inodes, indirect blocks, and directory entry blocks, there is
> no two stage commit, because there is no indirection of their data
> contents.

I guess you mean that their data are not managed by any higher level
metadata which must be updated together. 

Thanks for your help.

-Zhihui


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Fri Aug 13  6:13:20 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.26.10.9])
	by hub.freebsd.org (Postfix) with ESMTP id 975C414C25
	for <freebsd-fs@FreeBSD.ORG>; Fri, 13 Aug 1999 06:13:12 -0700 (PDT)
	(envelope-from bde@godzilla.zeta.org.au)
Received: (from bde@localhost)
	by godzilla.zeta.org.au (8.8.7/8.8.7) id XAA14647;
	Fri, 13 Aug 1999 23:13:21 +1000
Date: Fri, 13 Aug 1999 23:13:21 +1000
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199908131313.XAA14647@godzilla.zeta.org.au>
To: phk@critter.freebsd.dk, vernick@bell-labs.com
Subject: Re: Help with understand file system performance
Cc: freebsd-fs@FreeBSD.ORG
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

>>3. On a per process basis, performance increases when the number of
>>files per directory increases/number of subdirs decreases.  Is this
>>because there is a better chance the directory information about the
>>file could be in memory?
>
>Yes.  The minimum directory size is the fragsize of the filesystem,
>filling the directories better means better performance.
>
>>Our goal, of course, it to maximize performance.  So any help in the
>>tuning of our system (i.e. reading lots of ~15KB files) would be
>>appreciated

Try increasing nbuf.  I think effective caching of directories still
requires 1 buffer per directory.  

>Try fiddling the newfs parameters.  I see 17% speedup using:
>
>	newfs -b 16384 -f 4096 -c 100

I see a 48% speedup using

	linux
	mkfs.ext2 -b 4096 $device $size_of_device_in_4k_units

:->.  This is despite (or because of) ext2fs's block allocator being
broken (it essentially ignores cylinder groups).

The following times are for `tar zxpf linux-2.2.9.tar.gz', unmount
(to sync), and `tar cf /dev/null linux lost+found' on a new filesystem
on a Quantum KA disk on an overclocked Celeron-366 system:

ffs-4096-512:
       41.82 real         3.24 user         3.34 sys
        3.05 real         0.00 user         0.07 sys
       16.53 real         0.08 user         1.34 sys
ffs-4096-1024:
       35.89 real         3.35 user         3.70 sys
        2.11 real         0.00 user         0.07 sys
       15.53 real         0.13 user         1.39 sys
ffs-4096-2048:
       29.32 real         3.24 user         4.36 sys
        1.17 real         0.00 user         0.07 sys
       12.20 real         0.14 user         1.42 sys
ffs-4096-4096:
       28.85 real         3.34 user         4.51 sys
        1.12 real         0.00 user         0.07 sys
       11.24 real         0.10 user         1.59 sys
ffs-8192-1024:
       33.39 real         3.26 user         5.44 sys
        2.94 real         0.00 user         0.07 sys
       13.40 real         0.12 user         1.18 sys
ffs-8192-2048:
       28.08 real         3.29 user         3.01 sys
        2.32 real         0.00 user         0.07 sys
       11.21 real         0.06 user         1.26 sys
ffs-8192-4096:
       25.05 real         3.27 user         2.99 sys
        1.87 real         0.00 user         0.07 sys
        9.17 real         0.09 user         1.21 sys
ffs-8192-8192:
       23.27 real         3.27 user         2.82 sys
        1.53 real         0.00 user         0.07 sys
        8.94 real         0.10 user         1.23 sys
ffs-16384-2048:
       28.22 real         3.43 user         4.78 sys
        2.52 real         0.00 user         0.07 sys
       12.01 real         0.10 user         1.55 sys
ffs-16384-4096:
       24.32 real         3.41 user         3.51 sys
        1.97 real         0.00 user         0.07 sys
       10.56 real         0.11 user         1.37 sys
ffs-16384-8192:
       23.63 real         3.33 user         3.35 sys
        2.35 real         0.00 user         0.07 sys
        8.66 real         0.09 user         1.15 sys
ffs-16384-16384:
       85.41 real         3.33 user         3.28 sys
        2.00 real         0.00 user         0.08 sys
        9.51 real         0.10 user         1.17 sys
ext2fs-1024-1024:
       36.33 real         3.33 user         3.67 sys
        1.42 real         0.00 user         0.07 sys
       14.49 real         0.10 user         2.28 sys
ext2fs-4096-4096:
       20.81 real         3.38 user         3.54 sys
        1.01 real         0.00 user         0.07 sys
        6.96 real         0.12 user         1.57 sys

Note the anomalously slow times for ffs-16384-16384.

I analyzed why ffs was slow and ext2fs was fast for the `tar cf'
part a year or two ago.  It was because ffs handles fragments poorly
and needs many more small (but not small enough to be in the drive's
cache) backwards seeks.  Not using fragments reduced ext2fs's advantage
significantly but not completely.  ffs-4K-4K is only slightly faster
than ffs-8K-1K now, presumably because drive caches are larger and
command overheads are relatively higher (the KA acts like a slow
SCSI drive in wanting a block size of at least 8K to keep up with
the disk).

This output was produced by the following program:

---
#!/bin/sh

for b in 4096 8192 16384
do
	for f in $(($b / 8)) $(($b / 4)) $(($b / 2)) $b
	do
		echo ffs-$b-$f: >>/tmp/ztimes
		newfs -b $b -f $f /dev/rwd2s2a
		mount /dev/wd2s2a /d
		cd /d
		sync
		time tar zxpf $loc/z/dist/*2.2.9.tar.gz 2>>/tmp/ztimes
		cd /tmp
		time umount /d 2>>/tmp/ztimes
		mount /dev/wd2s2a /d
		cd /d
		sync
		time tar cf /dev/null * 2>>/tmp/ztimes
		cd /tmp
		umount /d
	done
done

for b in 1024 4096
do
	for f in $b
	do
		echo ext2fs-$b-$f: >>/tmp/ztimes
		# linux
		mkfs.ext2 -b $b /dev/rwd2s2a $((4819437 / ($b / 512)))
		# fsck.ext2 /dev/wd2s2a
		mount -t ext2fs /dev/wd2s2a /d
		cd /d
		sync
		time tar zxpf $loc/z/dist/*2.2.9.tar.gz 2>>/tmp/ztimes
		cd /tmp
		time umount /d 2>>/tmp/ztimes
		mount -t ext2fs /dev/wd2s2a /d
		cd /d
		sync
		time tar cf /dev/null * 2>>/tmp/ztimes
		cd /tmp
		umount /d
	done
done
---

Bruce


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Fri Aug 13  6:25:17 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from gw-nl3.philips.com (gw-nl3.philips.com [192.68.44.35])
	by hub.freebsd.org (Postfix) with ESMTP id A5BC714C25
	for <freebsd-fs@FreeBSD.ORG>; Fri, 13 Aug 1999 06:25:07 -0700 (PDT)
	(envelope-from Jos.Backus@nl.origin-it.com)
Received: from smtprelay-nl1.philips.com (localhost.philips.com [127.0.0.1])
          by gw-nl3.philips.com with ESMTP id PAA17537
          for <freebsd-fs@FreeBSD.ORG>; Fri, 13 Aug 1999 15:25:08 +0200 (MEST)
          (envelope-from Jos.Backus@nl.origin-it.com)
Received: from smtprelay-eur1.philips.com(130.139.36.3) by gw-nl3.philips.com via mwrap (4.0a)
	id xma017532; Fri, 13 Aug 99 15:25:08 +0200
Received: from hal.mpn.cp.philips.com (hal.mpn.cp.philips.com [130.139.64.195]) 
	by smtprelay-nl1.philips.com (8.9.3/8.8.5-1.2.2m-19990317) with SMTP id PAA25640
	for <freebsd-fs@FreeBSD.ORG>; Fri, 13 Aug 1999 15:25:07 +0200 (MET DST)
Received: (qmail 13055 invoked by uid 666); 13 Aug 1999 13:25:29 -0000
Date: Fri, 13 Aug 1999 15:25:29 +0200
From: Jos Backus <Jos.Backus@nl.origin-it.com>
To: Bruce Evans <bde@zeta.org.au>
Cc: phk@critter.freebsd.dk, vernick@bell-labs.com,
	freebsd-fs@FreeBSD.ORG
Subject: Re: Help with understand file system performance
Message-ID: <19990813152529.G12312@hal.mpn.cp.philips.com>
Reply-To: Jos Backus <Jos.Backus@nl.origin-it.com>
References: <199908131313.XAA14647@godzilla.zeta.org.au>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.95.6i
In-Reply-To: <199908131313.XAA14647@godzilla.zeta.org.au>; from Bruce Evans on Fri, Aug 13, 1999 at 11:13:21PM +1000
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Fri, Aug 13, 1999 at 11:13:21PM +1000, Bruce Evans wrote:
[Poul-Henning wrote:]
> >Try fiddling the newfs parameters.  I see 17% speedup using:
> >
> >	newfs -b 16384 -f 4096 -c 100

Too bad tunefs doesn't have those options :-)

> ffs-4K-4K is only slightly faster than ffs-8K-1K now, presumably because
> drive caches are larger and command overheads are relatively higher (the KA
> acts like a slow SCSI drive in wanting a block size of at least 8K to keep
> up with the disk).

As an aside, AIX uses 4K blocks and doesn't support fragments.

-- 
Jos Backus                          _/ _/_/_/  "Reliability means never
                                   _/ _/   _/   having to say you're sorry."
                                  _/ _/_/_/             -- D. J. Bernstein
                             _/  _/ _/    _/
Jos.Backus@nl.origin-it.com  _/_/  _/_/_/      use Std::Disclaimer;


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Fri Aug 13  6:40:11 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131])
	by hub.freebsd.org (Postfix) with ESMTP id 0AE6A14E63
	for <freebsd-fs@FreeBSD.ORG>; Fri, 13 Aug 1999 06:40:03 -0700 (PDT)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id PAA03326;
	Fri, 13 Aug 1999 15:39:25 +0200 (CEST)
	(envelope-from phk@critter.freebsd.dk)
To: Jos Backus <Jos.Backus@nl.origin-it.com>
Cc: Bruce Evans <bde@zeta.org.au>, vernick@bell-labs.com,
	freebsd-fs@FreeBSD.ORG
Subject: Re: Help with understand file system performance 
In-reply-to: Your message of "Fri, 13 Aug 1999 15:25:29 +0200."
             <19990813152529.G12312@hal.mpn.cp.philips.com> 
Date: Fri, 13 Aug 1999 15:39:25 +0200
Message-ID: <3324.934551565@critter.freebsd.dk>
From: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

In message <19990813152529.G12312@hal.mpn.cp.philips.com>, Jos Backus writes:
>On Fri, Aug 13, 1999 at 11:13:21PM +1000, Bruce Evans wrote:
>[Poul-Henning wrote:]
>> >Try fiddling the newfs parameters.  I see 17% speedup using:
>> >
>> >	newfs -b 16384 -f 4096 -c 100
>
>Too bad tunefs doesn't have those options :-)
>
>> ffs-4K-4K is only slightly faster than ffs-8K-1K now, presumably because
>> drive caches are larger and command overheads are relatively higher (the KA
>> acts like a slow SCSI drive in wanting a block size of at least 8K to keep
>> up with the disk).
>
>As an aside, AIX uses 4K blocks and doesn't support fragments.

AIX uses jfs, which is entirely diffrenet.

It may be time to abandon the concept of fragments.

"cylinders" should be taken out of "cylindergroups".

--
Poul-Henning Kamp             FreeBSD coreteam member
phk@FreeBSD.ORG               "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Fri Aug 13 11:55:39 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135])
	by hub.freebsd.org (Postfix) with ESMTP id B7FF814FE5
	for <freebsd-fs@FreeBSD.ORG>; Fri, 13 Aug 1999 11:55:32 -0700 (PDT)
	(envelope-from tlambert@usr09.primenet.com)
Received: (from daemon@localhost)
	by smtp05.primenet.com (8.9.1/8.9.1) id LAA517950;
	Fri, 13 Aug 1999 11:53:14 -0700
Received: from usr09.primenet.com(206.165.6.209)
 via SMTP by smtp05.primenet.com, id smtpduBppUa; Fri Aug 13 11:53:10 1999
Received: (from tlambert@localhost)
	by usr09.primenet.com (8.8.5/8.8.5) id LAA22289;
	Fri, 13 Aug 1999 11:53:07 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908131853.LAA22289@usr09.primenet.com>
Subject: Re: Help with understand file system performance
To: Jos.Backus@nl.origin-it.com
Date: Fri, 13 Aug 1999 18:53:06 +0000 (GMT)
Cc: bde@zeta.org.au, phk@critter.freebsd.dk, vernick@bell-labs.com,
	freebsd-fs@FreeBSD.ORG
In-Reply-To: <19990813152529.G12312@hal.mpn.cp.philips.com> from "Jos Backus" at Aug 13, 99 03:25:29 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> On Fri, Aug 13, 1999 at 11:13:21PM +1000, Bruce Evans wrote:
> [Poul-Henning wrote:]
> > >Try fiddling the newfs parameters.  I see 17% speedup using:
> > >
> > >	newfs -b 16384 -f 4096 -c 100
> 
> Too bad tunefs doesn't have those options :-)

These are not tunable options, they are initial layout options.


> > ffs-4K-4K is only slightly faster than ffs-8K-1K now, presumably because
> > drive caches are larger and command overheads are relatively higher (the KA
> > acts like a slow SCSI drive in wanting a block size of at least 8K to keep
> > up with the disk).
> 
> As an aside, AIX uses 4K blocks and doesn't support fragments.

AIX uses JFS, which is a journaling file system.

Journalling file systems can replay transactions forward, as well
as rolling them backward (log structured FS's can only roll them
backward).

In addition, the very nature of a JFS is significantly different
(e.g. to write one for FreeBSD, it would be necessary to cause
VOP_ABORTOP to do what it's name says it does, instead of freeing
up cn_pnbuf allocations that the caller should be freeing up
anyway).

Likewise, you can't get rid of the concept of cylinder/cylindergroup
without damaging the hashing function, which prevents fragmentation.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Fri Aug 13 13:51:54 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp01.primenet.com (smtp01.primenet.com [206.165.6.131])
	by hub.freebsd.org (Postfix) with ESMTP id 572E814EA3
	for <freebsd-fs@FreeBSD.ORG>; Fri, 13 Aug 1999 13:51:38 -0700 (PDT)
	(envelope-from tlambert@usr01.primenet.com)
Received: (from daemon@localhost)
	by smtp01.primenet.com (8.8.8/8.8.8) id NAA21170;
	Fri, 13 Aug 1999 13:50:54 -0700 (MST)
Received: from usr01.primenet.com(206.165.6.201)
 via SMTP by smtp01.primenet.com, id smtpd021051; Fri Aug 13 13:50:47 1999
Received: (from tlambert@localhost)
	by usr01.primenet.com (8.8.5/8.8.5) id NAA17293;
	Fri, 13 Aug 1999 13:50:39 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908132050.NAA17293@usr01.primenet.com>
Subject: Re: Help with understand file system performance
To: zzhang@cs.binghamton.edu (Zhihui Zhang)
Date: Fri, 13 Aug 1999 20:50:39 +0000 (GMT)
Cc: tlambert@primenet.com, phk@critter.freebsd.dk,
	roberto@keltia.freenix.fr, freebsd-fs@FreeBSD.ORG
In-Reply-To: <Pine.GSO.3.96.990812202049.1878A-100000@sol.cs.binghamton.edu> from "Zhihui Zhang" at Aug 12, 99 09:02:32 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> On Thu, 12 Aug 1999, Terry Lambert wrote:
> 
> > The filesystem block allocation table in directories is unique, in
> > that it is generally used as a convenience for locating physical
> > blocks, rather than using the standard filesystem block access
> > mechanisms, when reading or writing directories.
> 
> Directory files have the same on-disk structure as regular files.

Yes.  But they are not accessed internally as if they were regular
files.  The only operation which is trated as "regular" is extending
(and as of 4.4BSD, truncating back) the block allocations in
directories.

The directory manipulation code treats it as a series of blocks,
and translates from the "regular file" aspect into BLKATOFF().


> However, they can never have holes and they can only be incremented at the
> end of the file in device block chunks. No directory entry can cross the
> device block boundary to guarantee the atomic update. 

Right.  There is no such thing as a "sparse block allocation" in a
directory, since BLKATOFF() assumes the existance of a block.

Directory entries are physically prevented from crossing block
boundaries in order to ensure atomic update.  But this is an
implementation detail, and it is not the only way one could ensure
atomicity, so long as one were willing to reallocate (filesystem,
not physical) blocks or frags in order to do the updates (i.e. you
could arrange for a two stage commit; I did this in my Unicode FFS
prototype, since even though a 256 character name would fit in 512b,
there was no room left over for the metadata).


> However, I do not know why you say the block map (direct and indirect
> blocks) of a directory is only used as a convenience. I mean there is a
> need to call VOP_BMAP() on a directory file. The routine ffs_blkatoff() 
> calls bread(), which in turn calls VOP_BMAP(). The in-core inode does have
> several fields to facilitate the insertion of new directory entries. But
> we still need the block map (block allocation table). 

Directory manipulations access blocks directly.

You've no doubt noticed that the vast majority of system calls do
_not_ require VOP_BMAP() calls for copyin/out operations on VM
objects backed by the filesystem.  The need to call VOP_BMAP() is
an artifact of treating the directories as a list of blocks, rather
than treating them as files.

The "convenience" aspect is that they are files, but they are not
used as such, and it's just because it is convenient that files are
used as the underlying abstraction: directories are not naturally
represented as files, and in fact, trying to make them conform to
the normal file behaviour would result in breakage of the atomicity
guarantee.


> Directory files are also specical in that we can not write into them
> with the write() system call as normal files.  They use a special
> routine to grow, i.e., ufs_direnter().  By the way, we can use read()
> system call to read directory files as we do with normal files. 

The lack of the ability to write was mirrored by a lack of ability
to read, as well, until this was changed, intentionally.  Likewise,
there was no ability to mmap directories (read only, of course),
until that, too, was changed.

These are both optimizations to speed certain programs, and are
really antithetical to POSIX.

In reality, if you have looked at the "cookie" code for VOP_READDIR()
in NFS, FFS, and at least one other FS, you will see that the need
for cookies is an artifact of the structure of the interface.  An
alternate interface would allow directory block abstraction seperate
from the externalization of directory entries.  The structure that
is returned by getdents() is actually only coincidentally (albeit
intentionally so) the same as the FFS on disk structure.  See the
4.3/4.4 compatability translation code in the VOP_READDIR() in the
FFS implementation.

The upshot of this is that the ability to read or mmap for read
directories is actually a very bad thing, from an interface
perspective, since it promotes the writing of code that depends
on data format interfaces.  This is similar to the use of the KVM
as a data interface.

It is only coincidental, based on implementation (unintentionally so,
this time) that the POSIX access time updates for files and the access
time of directories (as POSIX mandates for getdents() operations)
happen to coincide.

If you look at the cookie mess, and the NFS server code wire format
translation mess, I'm sure you will agree.  You only need to ask
yourself "how could NFS handle a VOP_READDIR() that came from an
underlying FS that could pack more entries in a block than could
be represented in a block in the external 'coincidental' format?"
to prove to yourself that this is broken.


> > There are a number of performance penalties for this, especially
> > on large directories, where it is not possible to trigger sequential
> > readahead through use of the getdents() system call sequentially
> > accessing sequential 512b/physical_block_size extents.
> 
> I do not understand this. The read-ahead mechanism should work on any
> files. I thought the reorganization of diretory entries within a directory
> block when you delete an entry is an inefficiency. 
> 
> Does this issue have anything to do with the VMIO directory issue
> discussed earlier this year? 

No.  It has to do with VOP_READDIR() not exhibiting behaviour which
would trigger read-ahead, such as is triggered by READ, WRITE, GETPAGES,
and PUTPAGES.


> > The frag size can be tuned down below this (i.e. 1/4, 1/2, 1).
> > 
> > The only case where 1024 bytes of physical disk would be used is at
> > a filesystem block size of 8192 (or greater), which, divided by 8,
> > gives 1024b (or greater).
> 
> I did not realize this before.  The maximum ratio is 8.  So if the
> filesystem block is 8192, the allocation unit (fragment size) can not be
> 512 because 8192/512 > 8.

Yes.  There are only 8 bits available for representing frag allocations.


> > This is called an encapsulated two stage commit, in database terms.
> > 
> > For inodes, indirect blocks, and directory entry blocks, there is
> > no two stage commit, because there is no indirection of their data
> > contents.
> 
> I guess you mean that their data are not managed by any higher level
> metadata which must be updated together. 

Yes.

Despite the fact that "higher level" metadata exists, since the
implementation detail is that they are stored using "files", the
actual implementation does not take advantage of this, either for
triggering read-ahead, or for encapsulated commits of directory
modifications, or for clustering (which could only occur on a
restore from an archive, given the incremental nature of directory
entries), or for any of a dozen other speed enhancements which are
applied to normal files.

This means that directories are, by their nature, rather slow.


> Thanks for your help.
> 
> -Zhihui

Any time.  8-).  It's an interesting discussion to engage in; there
are (not implemented in FreeBSD) interesting soloutions to much of
the performance issues that people raise against the FFS.  The last
time this issue came up that I remember had to do with depth-first
creation and breadth-first traversal of the ports directory structure;
I actually still maintain that this is a problem in the creation of
the directory (i.e. the organization of the archive) more than it is
a problem with the FS itself (a tool is only as good as the craftsman
using it).  If used properly, there really aren't a lot of performance
problems that you can point to (sort of like cutting with vs. against
the grain in a board).

I am becoming convinced that an intermediate abstraction is really
what is called for, to turn the bottom end into what is, in effect,
nothing more than a flat, numeric namespace on top of a variable
granularity block store.  A nice topic for much research... 8-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Fri Aug 13 14:33:50 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from m4.c2.telstra-mm.net.au (m4.c2.telstra-mm.net.au [24.192.3.19])
	by hub.freebsd.org (Postfix) with ESMTP id 3011914EAB
	for <freebsd-fs@FreeBSD.ORG>; Fri, 13 Aug 1999 14:33:43 -0700 (PDT)
	(envelope-from a.reilly@lake.com.au)
Received: from m5.c2.telstra-mm.net.au (m5.c2.telstra-mm.net.au [24.192.3.20])
	by m4.c2.telstra-mm.net.au (8.8.6 (PHNE_14041)/8.8.6) with ESMTP id HAA27453
	for <freebsd-fs@FreeBSD.ORG>; Sat, 14 Aug 1999 07:33:48 +1000 (EST)
X-BPC-Relay-Envelope-From: a.reilly@lake.com.au 
X-BPC-Relay-Envelope-To: <freebsd-fs@FreeBSD.ORG>
X-BPC-Relay-Sender-Host: m5.c2.telstra-mm.net.au [24.192.3.20]
X-BPC-Relay-Info: Message delivered directly.
Received: from areilly.bpc-users.org (CPE-24-192-49-170.nsw.bigpond.net.au [24.192.49.170])
	by m5.c2.telstra-mm.net.au (8.8.6 (PHNE_14041)/8.8.6) with SMTP id HAA12262
	for <freebsd-fs@FreeBSD.ORG>; Sat, 14 Aug 1999 07:33:47 +1000 (EST)
Received: (qmail 39702 invoked by uid 1000); 13 Aug 1999 21:33:46 -0000
From: "Andrew Reilly" <a.reilly@lake.com.au>
Date: Sat, 14 Aug 1999 07:33:46 +1000
To: Terry Lambert <tlambert@primenet.com>
Cc: Zhihui Zhang <zzhang@cs.binghamton.edu>, phk@critter.freebsd.dk,
	roberto@keltia.freenix.fr, freebsd-fs@FreeBSD.ORG
Subject: Re: Help with understand file system performance
Message-ID: <19990814073346.A38606@gurney.reilly.home>
References: <Pine.GSO.3.96.990812202049.1878A-100000@sol.cs.binghamton.edu> <199908132050.NAA17293@usr01.primenet.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.95.4i
In-Reply-To: <199908132050.NAA17293@usr01.primenet.com>; from Terry Lambert on Fri, Aug 13, 1999 at 08:50:39PM +0000
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Fri, Aug 13, 1999 at 08:50:39PM +0000, Terry Lambert wrote:
> I am becoming convinced that an intermediate abstraction is really
> what is called for, to turn the bottom end into what is, in effect,
> nothing more than a flat, numeric namespace on top of a variable
> granularity block store.  A nice topic for much research... 8-).

Isn't that what Andrew Tannenbaum had on Amoeba?  Does anyone have
any experience with that system?  The numbers in his namespace were
capabilities/crypto-cookies, if I remember rightly.

-- 
Andrew


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Fri Aug 13 18:53:32 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133])
	by hub.freebsd.org (Postfix) with ESMTP
	id 04EB115039; Fri, 13 Aug 1999 18:52:53 -0700 (PDT)
	(envelope-from tlambert@usr04.primenet.com)
Received: (from daemon@localhost)
	by smtp03.primenet.com (8.9.3/8.9.3) id SAA08447;
	Fri, 13 Aug 1999 18:50:58 -0700 (MST)
Received: from usr04.primenet.com(206.165.6.204)
 via SMTP by smtp03.primenet.com, id smtpdAAAHkaqzq; Fri Aug 13 18:50:53 1999
Received: (from tlambert@localhost)
	by usr04.primenet.com (8.8.5/8.8.5) id SAA23891;
	Fri, 13 Aug 1999 18:50:48 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908140150.SAA23891@usr04.primenet.com>
Subject: Re: BSD XFS Port & BSD VFS Rewrite
To: Matthew.Alton@anheuser-busch.com (Alton Matthew)
Date: Sat, 14 Aug 1999 01:50:47 +0000 (GMT)
Cc: Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
In-Reply-To: <0740CBD1D149D31193EB0008C7C56836EB8AFC@STLABCEXG012> from "Alton, Matthew" at Aug 5, 99 06:02:47 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> I am currently conducting a thorough study of the VFS subsystem
> in preparation for an all-out effort to port SGI's XFS filesystem to
> FreeBSD 4.x at such time as SGI gives up the code.  Matt Dillon
> has written in hackers- that the VFS subsystem is presently not
> well understood by any of the active kernel code contributers and
> that it will be rewritten later this year.  This is obviously of great
> concern to me in this port.

It is of great concern to me that a rewrite, apparently because of
non-understanding, is taking place at all.

I would suggest that anyone planning on this rewrite should talk,
in depth, with John Heidemann prior to engaging in such activity.
John is very approachable, and is a deep thinker.  Any rewrite
that does not meet his original design goals for his stacking
architecture is, I think, a Very Bad Idea(tm).


> I greatly appreciate all assistance in answering the following
> questions:
> 
> 1)  What are the perceived problems with the current VFS?
> 2)  What options are available to us as remedies?
> 3)  To what extent will existing FS code require revision in order
>      to be useful after the rewrite?
> 4)  Will Chapters 6,7,8 & 9 of "The Design and Implementation of
>      the 4.4BSD Operating System" still pertain after the rewrite?
> 5)  How important are questions 3 & 4 in the design of the new
>      VFS?
> 
> I believe that the VFS is conceptually sound and that the existing
> semantics should be strictly retained in the new code.  Any new
> functionality should be added in the form of entirely new kernel 
> routines and system calls, or possibly by such means as
> converting the existing routines to the vararg format &etc.

Here some of the problems I'm aware of, and my suggested remedies:

1.	The interface is not reflexive, with regard to cn_pnbuf.

	Specifically, path buffers are allocated by the caller, but
	not freed by the caller, and various routines in each FS
	implementation are expected to deal with this.

	Each FS duplicates code, and such duplication is subject
	to error.  Not to mention that it makes your kernel fat.

2.	Advisory locks are hung off private backing objects.

	Advisory locks are passed into VOP_ADVLOCK in each FS
	instance, and then each FS applies this by hanging the
	locks off a list on a private backing object.  For FFS,
	this is the in core inode.

	A more correct approach would be to hang the lock off the
	vnode.  This effectively obviates the need for having a
	VOP_ADVLOCK at all, except for the NFS client FS, which
	will need to propagate lock requests across the net.  The
	most efficient mechanism for this would be to institute
	a pass/fail response for VOP_ADVLOCK calls, with a default
	of "pass", and an actual implementation of the operand only
	in the NFS client FS.

	Again, each FS must duplicate the advisory locking code,
	at present, and such duplication is subject to error.

3.	Object locks are implemented locally in many FS's.

	The VOP_LOCK interface is implemented via vop_stdlock()
	calls in many FS's.  This is done using the "vfs_default"
	mechanism.  In other FS's, it's implemented locally.

	The intent of the VOP_LOCK mechanism being implemented
	as a VOP at all was to allow it to be proxied to another
	machine over a network, using the original Heidemann
	design.  This is also the reason for the use of descriptors
	for all VOP arguments, since they can be opaquely proxied to
	another machine via a general mechanism.  Unlike NFS based
	network filesystems, this would allow you to add VOP's to
	both machines, without having to teach the transport about
	the new VOP for it to be usable remotely.

	Like the VOP_ADVLOCK, the need for VOP_LOCK is for proxy
	purposes, and it, too, should generate a pass/fail response,
	and be largely implemented in non-filesystem specific
	higher level code.

	Again, each FS which duplicates code for this function is
	subject to duplication errors.

4.	The VOP_READIR interface is irrational.

	The VOP_READDIR interface returns its responses in "host
	cannonical format" (struct dirent, in sys/dirent.h).
	Internally, FFS operates on "directory entry blocks" that
	contain exactly these structures (an intentaional coincidence).

	The problem with this approach, is that it makes the getdents
	system call sensitive to file systems for which some of the
	information returned (e.g. d_fileno, d_reclen, d_type, d_namlen)
	are synthetic.  What this means is that a native file system
	directory implementation single directory block must be able
	to fit into the buffer passed to the getdirentries(2) system
	call, or a directory listing is not a valid snapshot of the
	current state of the directory.

	It also vastly complicates directory traversal restarts (hence
	the ncookies and a_cookies arguments, since the NFS server
	requires the ability to restart traversal, mid-block, since
	the NFSv2 protocol returns directory entries one at a time).

	The "cookie" idea must be carried out faithfully, in an FS
	specific fashion, for each FS which is allowed to be NFS
	exported.  This code duplication is subject to error, or
	worse, non-implementation due to its complexity.

	A more rational approach would be to split the operation
	into two seperate VOP's: one to acquire a snapshot of a set
	of FS specific directory entries of an arbitrary size, and
	the second to extract rentries into the user's buffer, in
	cannonical format.

5.	The idea of "root" vs. "non-root" mounts is inherently bad.

	Right now, there are several operations, all wrapped into
	a single "mount" entry point.  This is actually a partial
	transition to a more cannonically correct implemetnation.

	The reason for the "root" vs. "non-root" knowledge in the
	code has to do with several logical operations:

	1)	"Mounting" the filesystem; that is, getting the
		vnode for the device to be mounted, and doing any
		FS specific operations necessary to cause the
		correct in-core context to be established.

	2)	Covering the vnode at the mount point.

		This operation updates the vnode of the mount
		point so that traversals of the mount point will
		get you the root directory of the FS that was
		mounted instead of the directory that is covered
		by the mount.

	3)	Saving the "last mounted on" information.

		This is a clerical detail.  Read-only FS's, and
		some read-write FS's, do not implement this.  It
		is mostly a nicety for tools that manipulate FFS
		directly.

	4)	Initialize the FS stat information.

		Part of the in-core data for any FS is the mnt_stat
		data, which is what comes back from a VFS_STATFS()
		call

	The first operation is invariant.  It must be done for all
	FS's, whether they are "root" or "non-root".

	The second operation is specific to "non-root" FS's.  It
	could be moved to common, higher level code -- specifically,
	it could be moved into the mount system call.

	The third operation is also specific to "non-root" FS's.  It
	could be discarded, or it could be moved to a seperate VFS
	operation, e.g. VFS_SETMNTINFO().  I would recommend moving
	it to a seperate VFSOP, instead of discarding it.  The reason
	for this is that an intelligent person could reasonably decide
	to add the setting of this data in newfs and tunefs, and do
	away with /etc/fstab.

	The fourth operation is invariant.  It must be done for all
	FS's, whether they are "root" or "non-root".


	We can now see that we have two discrete operations:

	1)	Placement of any FS, regardless of how it is intended
		to be used, into the list of mounted filesystems.

	2)	Mapping a filesystem from the list of mounted FS's
		into the directory hierarchy.

	The job of the per FS mount code should be to take a mount
	structure, the vnode of a device, the FS specific arguments,
	the mount point credentials, and the process requesting the
	mount, and _only_ do #1 and #4.

	The conversion of the root device into a vnode pointer, or
	a path to a device into a vnode pointer, is the job of upper
	level code -- specifically, the mount system call, and the
	common code for booting.

	This removes a large amount of complex code from each of
	the file systems, and centralizes the maintenance task into
	one set of code that either works for everyone, or no one
	(removing the duplication of code/introduction of errors
	issue).

	In addition, the lack of "root" specific code in many FS's
	VFS_MOUNT entry points is the reason that they can not be
	mounted as "/".  This change would open it up, such that any
	FS that was supported by the kernel could be used as the
	root filesystem.

6.	The "vfs_default" code damages stacking

	The intent of the stacking architecture was to have the
	default operation for any VOP unknown to an FS fall through
	to the lower level code, and fail if it was not implemented.

	The use of the "vfs_default" to make unimplemented VOP's
	fall through to code which implements function, while well
	intentioned, is misguided.

	Consider the case of a VOP proxy that proxies requests.  These
	might be requests to another machine, as in the previous
	proxy example, or they might be requests to user space, to
	allow for easy developement of new filesystem layers.

	In addition, in order to get a default operation to actually
	fail, you have to intentionally create a failing VOP for that
	particular FS.

	Finally, the paradigm can not support new VOP's without a
	kernel recompilation.  This means that in order to add to
	the list of VOP's known to the system when you add a new FS,
	you don't merely have to reallocate the in-core copy of the
	vnodeop_desc to include a new (failing) member, you have to
	create a default behaviour for it, and modify the default
	operations table.  In other words, it's not extensible, as
	it was architected to be.

7.	The struct nameidata (namei.h) is broken in conception.

	One issue that recurrs frequently, and remains unaddressed,
	is the issue of namespace abstraction.

	This issue is nowhere more apparent than in the VFAT and NTFS
	filesystems, where there are two namespaces: one 8.3, and the
	second, 16 bit Unicode.

	The problem is one of coherency, and one of reference, and
	is not easily resolved in the context of the current nameidata
	structure.  Both NTFS and the VFAT FS try to cover this issue,
	both with varing degress of success.

	The problem is that there is no cannonical format that the
	kernel can use to communicate namespace data to FS's.  Unlike
	VOP_READDIR, which has the abstract (though ill-implemented)
	struct dirent, there is no abstract representation of the
	data in a pathname buffer, which would allow you to treat
	path components as opaque entities.

	One potential remedy for this situation would be to cannonize
	any path into an ordered list of components.  Ideally, this
	would be done in 16 bit Unicode (looking toward the future),
	but would minimally be seperate components with length counts
	to allow faster rejection of non-matching components, and
	frequent recalculation of length.

8.	The filesystems have knowledge of the name cache.

	Entries into the name cache, and deletion of entries from
	the name cache, should be handled in FS independent code
	at a higher level.  This can avoid expensive VFS_LOOKUP calls
	in many cases, and save marshalling arguments into and out of
	the descriptor structure, in addition to drastically reducing
	the function call overhead.

	Someone recently profiling FreeBSD's FS to detemine speed
	bottleneck (I believe it was Mike Smith, attempting to
	optimize for a ZD Labs benchmark) found that FreeBSD spends
	much of its time in namei().

9.	The implementation of namei() is POSIX non-compliant

	The implementation of namei() is by means of coroutine
	"recursion"; this is similar to the only recursion you can
	achieve in FORTRAN.

	The upshot of this is that the use of the "//" namespace
	escape allowed by POSIX can not be usefully implemented.
	This is because it is not possible to inherit a namespace
	escape deeper than a single path component for a stack of
	more than one layer in depth.

	This needs to be fixed, both for "natural" SMBFS support,
	and for other uses of the namespace escape (HTTP "tunnels",
	extended attribute and/or resource fork access in an OS/2
	HPFS or Macintosh HFS implementation, etc.), including
	forward looking research.

	This is related to item 7.

10.	Stacking is broken

	This is really an issue of not having a coherency protocol
	which can be applied between stacks of files.  It is somewhat
	related to almost all of the above issues.

	The current thinking which has been forwarded by Matt and
	John is that a vnode should have an associated vm_object_t,
	and that coherency should be maintained that way.

	This thinking is flawed for a number of reasons:

	a.	The main utility of this would be for an MFS
		implementation.  While a "fast MFS" is a
		laudable goal, it isn't sufficient to drive this.

	b.	A coherency protocol is required in any case,
		since a proxied VOP is not necessarily on the
		same machine or in the same VM space.  This
		approach would disallow the possibility of a
		user space filesystem developement framework.

	c.	There already exist aliases (VM implementation
		errors); intentionally adding aliases as an
		implementation detail will futher obfuscate them.
		Minimally, the VM system should pass a full
		branch path analysis based test procedure before
		they are introduced.  Even then, I would argue
		that it would open up a large complexity space
		that would prevent us from ever being sure about
		problem resoloution again.

	d.	Filesystems which need to transform data can
		never operate correctly, since they need to
		make local copies of the transformed content.
		This includes cryptographic, character set
		translation, compression, and similar stacking
		layers.

	Instead, I think the interface design issues (VOP_ADVLOCK,
	VOP_GETPAGES, VOP_PUTPAGES, VOP_READ, VOP_WRITE, et. al.)
	that drive the desire to implement coherency in this
	fashion be examined.  I believe that an ideal soloution
	would be to never have the pages replicated at more than a
	single vnode.  This would likewise solve the coherency
	problem, without the additional complexity.  The issue
	would devolve into locating the real backing object, and
	potentially, translating extents.


11.	The function call "footprint" of filesystems is too large

	Attempt the following:

		Compile up all of the files which make up an
		individual filesystem.  You can take all of
		the files for the ufs/ffs objects and the
		vnode_if.o from a compiled kernel for this
		exercise.

		Now link them.  Ignore the missing "main"; how
		many undefined functions are there?

	The problem you are seeing is the incursion of the VM
	system, and sloppy programming practices, into each VFS
	implementation.

	This footprint impacts filesystem portability, and is
	one reason, among many (including some of the above) that
	VFS modules are no longer very portable between BSD
	flavors.

	Minimally, the VFS incursions need to be macrotized, and
	not assume a unified VM and buffer cache (or a non-unified
	VM and buffer cache, as well, for that matter).  This would
	improve portability considerably.

	In addition to this change, a function minimzation effort
	should take place.

	If the underlying interface utilized by VFS layers was not
	the kernel (for local media FS's, like FFS or NTFS), but
	instead a variable granularity block store with a numeric
	namespace, then the "top" and "bottom" interfaces could be
	identical.  For now, however, some work can be done (and
	should be done) to reduce the function call footprint.
	This is important work, which can only aid developement
	of future work (such as a user space filesystem framework
	for use by developers and researchers).

	I hesitate to suggest this, but it might be reasonable to
	consider a struct containing externally referenced functions,
	which is registered into the FS via mount, and which is
	identical for all FS's.  This would, likewise, promote the
	idea of a user space framework.

	Ideally, work would be done to port the Heidemann framework
	to Linux, so that their developers could be leveraged.


Some FFS-specific problems are:

1.	The directory code in the UFS layer is intertwined with the
	filespace code

	Ideally, one would be able to mount a filesystem as a flat
	numeric namespace (see #7, above), and then mount the idea
	of directory management over top of that.

2.	The quota subsystem is too tightly integrated

	Quotas should be an abstract stacking layer that can be
	applied to any FS, instead of an FFS specific monstrosity.

	The current quota system is also limited to 16 bits for a
	number of values which, in FreeBSD, can be greater than
	16 bits (e.g. UID's).

	The current quota system is also broken for Y2038.

3.	The filesystem itself is broken for Y2038

	The space which was historically reserved for the Y2038 fix
	(a 64 bit time_t) was absconeded with for subsecond resoloution.

	This change should be reverted, and fsck modified to re-zero
	the values, given a specific argument.

	The subsecond resoloution doesn't really matter, but if it is
	seen as an issue which needs to be addressed, the only value
	which could reasonably require this is the modification time,
	and there is sufficient free space in the inode to be able
	to provide for this (there are 2x32 bit spares).


I have other suggestions, but the above covers the most obvious
damage.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Sat Aug 14  1:50: 0 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from gw-nl3.philips.com (gw-nl3.philips.com [192.68.44.35])
	by hub.freebsd.org (Postfix) with ESMTP id DC4B914EDA
	for <freebsd-fs@FreeBSD.ORG>; Sat, 14 Aug 1999 01:49:56 -0700 (PDT)
	(envelope-from Jos.Backus@nl.origin-it.com)
Received: from smtprelay-nl1.philips.com (localhost.philips.com [127.0.0.1])
          by gw-nl3.philips.com with ESMTP id KAA21578
          for <freebsd-fs@FreeBSD.ORG>; Sat, 14 Aug 1999 10:50:09 +0200 (MEST)
          (envelope-from Jos.Backus@nl.origin-it.com)
Received: from smtprelay-eur1.philips.com(130.139.36.3) by gw-nl3.philips.com via mwrap (4.0a)
	id xma021571; Sat, 14 Aug 99 10:50:11 +0200
Received: from hal.mpn.cp.philips.com (hal.mpn.cp.philips.com [130.139.64.195]) 
	by smtprelay-nl1.philips.com (8.9.3/8.8.5-1.2.2m-19990317) with SMTP id KAA25684
	for <freebsd-fs@FreeBSD.ORG>; Sat, 14 Aug 1999 10:50:06 +0200 (MET DST)
Received: (qmail 28515 invoked by uid 666); 14 Aug 1999 08:50:29 -0000
Date: Sat, 14 Aug 1999 10:50:29 +0200
From: Jos Backus <Jos.Backus@nl.origin-it.com>
To: Terry Lambert <tlambert@primenet.com>
Cc: bde@zeta.org.au, phk@critter.freebsd.dk, vernick@bell-labs.com,
	freebsd-fs@FreeBSD.ORG
Subject: Re: Help with understand file system performance
Message-ID: <19990814105029.A28461@hal.mpn.cp.philips.com>
Reply-To: Jos Backus <Jos.Backus@nl.origin-it.com>
References: <19990813152529.G12312@hal.mpn.cp.philips.com> <199908131853.LAA22289@usr09.primenet.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.95.6i
In-Reply-To: <199908131853.LAA22289@usr09.primenet.com>; from Terry Lambert on Fri, Aug 13, 1999 at 06:53:06PM +0000
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Fri, Aug 13, 1999 at 06:53:06PM +0000, Terry Lambert wrote:
> These are not tunable options, they are initial layout options.

I know, I was just thinking of Partition Magic which I had to use on my wife's
computer the other night :)

> AIX uses JFS, which is a journaling file system.

A very different beast indeed (I used to admin some AIX boxes). This is what
logfs was supposed to be like (correct me if I'm wrong).

-- 
Jos Backus                          _/ _/_/_/  "Reliability means never
                                   _/ _/   _/   having to say you're sorry."
                                  _/ _/_/_/             -- D. J. Bernstein
                             _/  _/ _/    _/
Jos.Backus@nl.origin-it.com  _/_/  _/_/_/      use Std::Disclaimer;


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Sat Aug 14  4:44:41 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.26.10.9])
	by hub.freebsd.org (Postfix) with ESMTP id C3CB1154D8
	for <freebsd-fs@freebsd.org>; Sat, 14 Aug 1999 04:44:32 -0700 (PDT)
	(envelope-from bde@godzilla.zeta.org.au)
Received: (from bde@localhost)
	by godzilla.zeta.org.au (8.8.7/8.8.7) id VAA00836
	for freebsd-fs@freebsd.org; Sat, 14 Aug 1999 21:43:31 +1000
Date: Sat, 14 Aug 1999 21:43:31 +1000
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199908141143.VAA00836@godzilla.zeta.org.au>
To: freebsd-fs@freebsd.org
Subject: better ffs-4096-512 ... ext2fs-4096-4096 benchmarks
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

This tests filesystems with various block sizes in a few simple ways.
The filesystems are now 1/3 filled with interesting data (a copy of
/home/ncvs which takes about 750000000 bytes of tar output).  I'm mainly
interested in the read benchmark (tar cf /dev/null ncvs) so I didn't
test -async or soft updates.

`tarcp' is two tars in a pipe with a block size of 1MB (the details
don't matter because the filesystems are on separate drives).

The results are as I expected, except reading ext2fs-4096-4096 is now
about 2.5 times faster than for the best ffs layout.  The throughput of
7-8MB/sec is about 40% of the drive's throughput.  This is surprisingly
large for a filesystem with lots of small files (79772 files with average
size 9500 bytes).

Bruce

ffs-4096-512:
fsck /dev/rwd2e:              12.56 real         2.59 user         0.18 sys
tarcp /d ncvs:              1509.69 real         4.76 user        74.37 sys
umount /d:                     2.00 real         0.00 user         0.29 sys
fsck /dev/rwd2e:              52.68 real         3.24 user         0.93 sys
tar cf /dev/null ncvs:       364.42 real         1.73 user        22.31 sys
ffs-4096-1024:
fsck /dev/rwd2e:               6.52 real         1.60 user         0.08 sys
tarcp /d ncvs:              1489.87 real         4.64 user        72.98 sys
umount /d:                     1.81 real         0.00 user         0.32 sys
fsck /dev/rwd2e:              45.52 real         2.22 user         0.81 sys
tar cf /dev/null ncvs:       327.91 real         1.76 user        21.89 sys
ffs-4096-2048:
fsck /dev/rwd2e:               3.97 real         1.12 user         0.01 sys
tarcp /d ncvs:              1411.08 real         4.31 user        71.62 sys
umount /d:                     1.06 real         0.00 user         0.31 sys
fsck /dev/rwd2e:              38.90 real         1.54 user         0.90 sys
tar cf /dev/null ncvs:       290.50 real         1.89 user        21.17 sys
ffs-4096-4096:
fsck /dev/rwd2e:               2.87 real         0.83 user         0.03 sys
tarcp /d ncvs:              1421.98 real         4.62 user        72.29 sys
umount /d:                     1.49 real         0.00 user         0.32 sys
fsck /dev/rwd2e:              40.72 real         1.31 user         0.80 sys
tar cf /dev/null ncvs:       283.48 real         1.86 user        21.93 sys
ffs-8192-1024:
fsck /dev/rwd2e:               5.93 real         1.27 user         0.13 sys
tarcp /d ncvs:              1444.24 real         5.07 user       148.23 sys
umount /d:                     0.51 real         0.00 user         0.30 sys
fsck /dev/rwd2e:              38.46 real         1.93 user         0.84 sys
tar cf /dev/null ncvs:       348.17 real         1.84 user        46.74 sys
ffs-8192-2048:
fsck /dev/rwd2e:               3.90 real         0.80 user         0.04 sys
tarcp /d ncvs:              1404.55 real         5.26 user       130.78 sys
umount /d:                     0.93 real         0.00 user         0.32 sys
fsck /dev/rwd2e:              35.04 real         1.43 user         0.71 sys
tar cf /dev/null ncvs:       308.19 real         1.83 user        40.08 sys
ffs-8192-4096:
fsck /dev/rwd2e:               2.68 real         0.52 user         0.05 sys
tarcp /d ncvs:              1379.23 real         5.41 user       123.14 sys
umount /d:                     1.13 real         0.00 user         0.31 sys
fsck /dev/rwd2e:              34.02 real         1.14 user         0.70 sys
tar cf /dev/null ncvs:       260.32 real         1.59 user        20.33 sys
ffs-8192-8192:
[deleted -- invalid due to insufficient inodes]
ffs-16384-2048:
fsck /dev/rwd2e:               3.78 real         0.69 user         0.03 sys
tarcp /d ncvs:              1379.81 real         5.71 user       128.53 sys
umount /d:                     1.04 real         0.00 user         0.30 sys
fsck /dev/rwd2e:              31.27 real         1.31 user         0.70 sys
tar cf /dev/null ncvs:       294.19 real         2.18 user        34.96 sys
ffs-16384-4096:
fsck /dev/rwd2e:               2.46 real         0.41 user         0.01 sys
tarcp /d ncvs:              1359.52 real         5.40 user       121.19 sys
umount /d:                     1.06 real         0.00 user         0.31 sys
fsck /dev/rwd2e:              30.66 real         0.98 user         0.72 sys
tar cf /dev/null ncvs:       272.84 real         1.86 user        32.19 sys
ffs-16384-8192:
[deleted -- invalid due to insufficient inodes]
ffs-16384-16384:
[deleted -- invalid due to insufficient inodes]
ext2fs-1024-1024:
fsck.ext2 /dev/wd2e:        [deleted -- invalid du to missing -f]
tarcp /d ncvs:              1519.08 real         4.71 user        77.18 sys
umount /d:                     3.73 real         0.00 user         0.32 sys
fsck.ext2 /dev/wd2e:        [deleted -- invalid du to missing -f]
tar cf /dev/null ncvs:       231.99 real         2.13 user        33.79 sys
ext2fs-4096-4096:
fsck.ext2 /dev/wd2e:        [deleted -- invalid du to missing -f]
tarcp /d ncvs:              1163.38 real         4.62 user        65.75 sys
umount /d:                     1.71 real         0.00 user         0.33 sys
fsck.ext2 /dev/wd2e:        [deleted -- invalid du to missing -f]
tar cf /dev/null ncvs:       101.68 real         1.81 user        23.96 sys

#!/bin/sh

for b in 4096 8192 16384
do
	for f in $(($b / 8)) $(($b / 4)) $(($b / 2)) $b
	do
		echo ffs-$b-$f: >>/tmp/ztimes
		newfs -b $b -f $f /dev/rwd2e
		echo -n "fsck /dev/rwd2e:       " >>/tmp/ztimes
		sync
		time fsck /dev/rwd2e 2>>/tmp/ztimes
		mount /dev/wd2e /d
		cd /home
		echo -n "tarcp /d ncvs:         " >>/tmp/ztimes
		sync
		time tarcp /d ncvs 2>>/tmp/ztimes
		echo -n "umount /d:             " >>/tmp/ztimes
		time umount /d 2>>/tmp/ztimes
		echo -n "fsck /dev/rwd2e:       " >>/tmp/ztimes
		sync
		time fsck /dev/rwd2e 2>>/tmp/ztimes
		mount /dev/wd2e /d
		cd /d
		echo -n "tar cf /dev/null ncvs: " >>/tmp/ztimes
		sync
		time tar cf /dev/null ncvs 2>>/tmp/ztimes
		cd /tmp
		umount /d
	done
done

for b in 1024 4096
do
	for f in $b
	do
		echo ext2fs-$b-$f: >>/tmp/ztimes
		# linux
		mkfs.ext2 -b $b /dev/rwd2e $((4754368 / ($b / 512)))
		sync
		echo -n "fsck.ext2 /dev/wd2e:  " >>/tmp/ztimes
		time fsck.ext2 /dev/wd2e 2>>/tmp/ztimes
		mount -t ext2fs /dev/wd2e /d
		cd /home
		echo -n "tarcp /d ncvs:         " >>/tmp/ztimes
		sync
		time tarcp /d ncvs 2>>/tmp/ztimes
		echo -n "umount /d:             " >>/tmp/ztimes
		time umount /d 2>>/tmp/ztimes
		echo -n "fsck.ext2 /dev/wd2e:  " >>/tmp/ztimes
		sync
		time fsck.ext2 /dev/wd2e 2>>/tmp/ztimes
		mount -t ext2fs /dev/wd2e /d
		cd /d
		echo -n "tar cf /dev/null ncvs: " >>/tmp/ztimes
		sync
		time tar cf /dev/null ncvs 2>>/tmp/ztimes
		cd /tmp
		umount /d
	done
done


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Sat Aug 14  5: 7:21 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.26.10.9])
	by hub.freebsd.org (Postfix) with ESMTP id CA9AE151FB
	for <freebsd-fs@FreeBSD.ORG>; Sat, 14 Aug 1999 05:07:15 -0700 (PDT)
	(envelope-from bde@godzilla.zeta.org.au)
Received: (from bde@localhost)
	by godzilla.zeta.org.au (8.8.7/8.8.7) id WAA02286;
	Sat, 14 Aug 1999 22:07:29 +1000
Date: Sat, 14 Aug 1999 22:07:29 +1000
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199908141207.WAA02286@godzilla.zeta.org.au>
To: bde@zeta.org.au, freebsd-fs@FreeBSD.ORG
Subject: Re: better ffs-4096-512 ... ext2fs-4096-4096 benchmarks
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

>The results are as I expected, except reading ext2fs-4096-4096 is now
>about 2.5 times faster than for the best ffs layout.  The throughput of

Actually, there are some much more surprising results:

>ffs-4096-512:
>fsck /dev/rwd2e:              12.56 real         2.59 user         0.18 sys
>tarcp /d ncvs:              1509.69 real         4.76 user        74.37 sys
                                                                   ^^^^^
>umount /d:                     2.00 real         0.00 user         0.29 sys
>fsck /dev/rwd2e:              52.68 real         3.24 user         0.93 sys
>tar cf /dev/null ncvs:       364.42 real         1.73 user        22.31 sys
                                                                   ^^^^^

The critical system times are about 70 and 20 seconds for ffs-4096-any.

>ffs-8192-1024:
>fsck /dev/rwd2e:               5.93 real         1.27 user         0.13 sys
>tarcp /d ncvs:              1444.24 real         5.07 user       148.23 sys
                                                                   ^^^^^
>umount /d:                     0.51 real         0.00 user         0.30 sys
>fsck /dev/rwd2e:              38.46 real         1.93 user         0.84 sys
>tar cf /dev/null ncvs:       348.17 real         1.84 user        46.74 sys
                                                                   ^^^^^

The critical system times are almost twice as large for ffs-8192-most.
They should be smaller.

>ffs-8192-4096:
>fsck /dev/rwd2e:               2.68 real         0.52 user         0.05 sys
>tarcp /d ncvs:              1379.23 real         5.41 user       123.14 sys
>umount /d:                     1.13 real         0.00 user         0.31 sys
>fsck /dev/rwd2e:              34.02 real         1.14 user         0.70 sys
>tar cf /dev/null ncvs:       260.32 real         1.59 user        20.33 sys
                                                                   ^^^^^

Here's one reasonable system time for ffs-8192-*.

`tar cf /dev/null' of the original /home/ncvs takes 828.85 real, 1.89 user,
21.90 sys, so it accounts for about half of the real times for tarcp but
not much of the system times.

Bruce


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Sat Aug 14 12:37:17 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131])
	by hub.freebsd.org (Postfix) with ESMTP id CF61A14CB9
	for <freebsd-fs@freebsd.org>; Sat, 14 Aug 1999 12:37:09 -0700 (PDT)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id VAA12287
	for <freebsd-fs@freebsd.org>; Sat, 14 Aug 1999 21:35:03 +0200 (CEST)
	(envelope-from phk@critter.freebsd.dk)
To: freebsd-fs@freebsd.org
Subject: disk performance model
Date: Sat, 14 Aug 1999 21:35:03 +0200
Message-ID: <12285.934659303@critter.freebsd.dk>
From: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


I have spent a few hours trying to figure out a model for the access
time of one of my disks.  (I wondered to what extent and what way 
the length of a seek affected the time it took to perform it.)

I set up a small program to read random places on the disk, timing
each request, and logging the previos sectornumber, this sectornumber
and time it took.

The program ran in user space so there is some finite overhead included
in the below model for the userland to kernel transition.  I collected
about 100k samples for various transfersizes.

This information may or may not give any meaning/inspiration/insight
in the current performance discussions...

The disk is in a PII/400MHz system:

ata1: master: setting up UDMA2 mode on PIIX4 chip OK
ad1: <IBM-DJNA-371800/J78OA30K> ATA-4 disk at ata1 as master
ad1: 17206MB (35239680 sectors), 34960 cyls, 16 heads, 63 S/T, 512 B/S
ad1: piomode=4, dmamode=2, udmamode=2
ad1: 16 secs/int, 31 depth queue, DMA mode


Model variables:

	size of transfer in sectors -> SZ
	number of sectors between last request and next one -> D

Model:

	if (D < 1.25e7)
		Tseek = D ** .42 * 8.05e-6 + .00175
	else
		Tseek = (D - 1.25e7) * 2.4e-10 + .00945

	Trotation = random(0 ... .008.5)

	Taccess = Tseek + Trotation + 36e-9 * SZ

Comments:

It should be noted that the place where this model has the worst fit
is where it is most interesting:  values of D < 1.5e6 where the model
underpredicts by up to a millisecond.

Above 1.5e6 the model predicts better than +/- 500usec.

The disk has larger media transfer rate rimwards than hubwards, but 
this doesn't manifest itself in the data for SZ < 100

I have no idea what this means in terms of  UFS/FFS parameters...

--
Poul-Henning Kamp             FreeBSD coreteam member
phk@FreeBSD.ORG               "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Sat Aug 14 13:32:36 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from frmug.org (frmug-gw.frmug.org [193.56.58.252])
	by hub.freebsd.org (Postfix) with ESMTP id 097F714D4E
	for <freebsd-fs@FreeBSD.ORG>; Sat, 14 Aug 1999 13:32:29 -0700 (PDT)
	(envelope-from roberto@keltia.freenix.fr)
Received: (from uucp@localhost)
	by frmug.org (8.9.1/frmug-2.3/nospam) with UUCP id WAA02272
	for freebsd-fs@FreeBSD.ORG; Sat, 14 Aug 1999 22:32:28 +0200 (CEST)
	(envelope-from roberto@keltia.freenix.fr)
Received: by keltia.freenix.fr (Postfix, from userid 101)
	id 54E32870B; Sat, 14 Aug 1999 19:44:22 +0200 (CEST)
Date: Sat, 14 Aug 1999 19:44:22 +0200
From: Ollivier Robert <roberto@keltia.freenix.fr>
To: freebsd-fs@FreeBSD.ORG
Subject: Re: Help with understand file system performance
Message-ID: <19990814194422.A13802@keltia.freenix.fr>
Mail-Followup-To: freebsd-fs@FreeBSD.ORG
References: <19990813152529.G12312@hal.mpn.cp.philips.com> <199908131853.LAA22289@usr09.primenet.com> <19990814105029.A28461@hal.mpn.cp.philips.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
User-Agent: Mutt/0.95.5i
In-Reply-To: <19990814105029.A28461@hal.mpn.cp.philips.com>; from Jos Backus on Sat, Aug 14, 1999 at 10:50:29AM +0200
X-Operating-System: FreeBSD 4.0-CURRENT/ELF ctm#5543 AMD-K6 MMX @ 200 MHz
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

According to Jos Backus:
> A very different beast indeed (I used to admin some AIX boxes). This is what
> logfs was supposed to be like (correct me if I'm wrong).

If you mean LFS then no. LFS is a log file system (the entire FS is a log)
whereas JFS is a journalling FS (i.e. a FS with a journal). They're quite
different beasts.
-- 
Ollivier ROBERT -=- FreeBSD: The Power to Serve! -=- roberto@keltia.freenix.fr
FreeBSD keltia.freenix.fr 4.0-CURRENT #73: Sat Jul 31 15:36:05 CEST 1999


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Sat Aug 14 14:57:55 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from gw-nl3.philips.com (gw-nl3.philips.com [192.68.44.35])
	by hub.freebsd.org (Postfix) with ESMTP id 28B2B1525F
	for <freebsd-fs@FreeBSD.ORG>; Sat, 14 Aug 1999 14:57:34 -0700 (PDT)
	(envelope-from Jos.Backus@nl.origin-it.com)
Received: from smtprelay-nl1.philips.com (localhost.philips.com [127.0.0.1])
          by gw-nl3.philips.com with ESMTP id XAA23267
          for <freebsd-fs@FreeBSD.ORG>; Sat, 14 Aug 1999 23:57:53 +0200 (MEST)
          (envelope-from Jos.Backus@nl.origin-it.com)
Received: from smtprelay-eur1.philips.com(130.139.36.3) by gw-nl3.philips.com via mwrap (4.0a)
	id xma023264; Sat, 14 Aug 99 23:57:53 +0200
Received: from hal.mpn.cp.philips.com (hal.mpn.cp.philips.com [130.139.64.195]) 
	by smtprelay-nl1.philips.com (8.9.3/8.8.5-1.2.2m-19990317) with SMTP id XAA09316
	for <freebsd-fs@FreeBSD.ORG>; Sat, 14 Aug 1999 23:57:53 +0200 (MET DST)
Received: (qmail 36343 invoked by uid 666); 14 Aug 1999 21:58:16 -0000
Date: Sat, 14 Aug 1999 23:58:16 +0200
From: Jos Backus <Jos.Backus@nl.origin-it.com>
To: Ollivier Robert <roberto@keltia.freenix.fr>
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: Help with understand file system performance
Message-ID: <19990814235816.A36250@hal.mpn.cp.philips.com>
Reply-To: Jos Backus <Jos.Backus@nl.origin-it.com>
References: <19990813152529.G12312@hal.mpn.cp.philips.com> <199908131853.LAA22289@usr09.primenet.com> <19990814105029.A28461@hal.mpn.cp.philips.com> <19990814194422.A13802@keltia.freenix.fr>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.95.6i
In-Reply-To: <19990814194422.A13802@keltia.freenix.fr>; from Ollivier Robert on Sat, Aug 14, 1999 at 07:44:22PM +0200
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Sat, Aug 14, 1999 at 07:44:22PM +0200, Ollivier Robert wrote:
> If you mean LFS then no. LFS is a log file system (the entire FS is a log)
> whereas JFS is a journalling FS (i.e. a FS with a journal). They're quite
> different beasts.

OK, point taken. As I understand it, jfs uses a log aka journal (usually hd8,
type jfslog, in the std rootvg after installation) to prerecord metadata
updates for the logical volumes in that volume group. I'm sure there's more to
it but that's all I can remember now. Reading the paper by Margo Seltzer on
LFS is still on my todo list.

Cheers,
-- 
Jos Backus                          _/ _/_/_/  "Reliability means never
                                   _/ _/   _/   having to say you're sorry."
                                  _/ _/_/_/             -- D. J. Bernstein
                             _/  _/ _/    _/
Jos.Backus@nl.origin-it.com  _/_/  _/_/_/      use Std::Disclaimer;


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message