From owner-freebsd-fs  Mon Sep 30 04:51:38 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id EAA05791
          for fs-outgoing; Mon, 30 Sep 1996 04:51:38 -0700 (PDT)
Received: from badger.gaylord.net (gaylord.async.vt.edu [128.173.18.131])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id EAA05046;
          Mon, 30 Sep 1996 04:50:51 -0700 (PDT)
Received: (from clark@localhost) by badger.gaylord.net (8.7.5/8.7.3) id HAA04283; Mon, 30 Sep 1996 07:50:15 -0400 (EDT)
From: Clark Gaylord <clark@badger.gaylord.net>
Message-Id: <199609301150.HAA04283@badger.gaylord.net>
Subject: Status of mounting HPFS drives
To: freebsd-questions@freebsd.org
Date: Mon, 30 Sep 1996 07:49:28 -0400 (EDT)
Cc: freebsd-fs@freebsd.org
X-Mailer: ELM [version 2.4ME+ PL22 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

Hello FreeBSD-land --

I have installed FBSD 2.1.5 and I am generally pleased with it.
However, I have over 1GB in my OS/2 HPFS partitions that I would
like to access.  After searching the archives, I see that there
is periodically some banter about this, but the issue has not
been raised recently.  Is there anyway to do this, including
using some Linux program (though I'd prefer real FreeBSD)?

I would prefer to use FreeBSD than Linux, but this issue is important
enough that I might have to switch.  There have been a substantial
number of OS/2 users I've known who either have switched to Linux
or run both; I think it would be valuable to the FreeBSD effort if
it were also an option for these people.

I will gladly summarize to the list any private email that is
valuable.

Thank you.
Clark Gaylord
Blacksburg, VA
cgaylord@vt.edu

From owner-freebsd-fs  Mon Sep 30 06:00:00 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id FAA05175
          for fs-outgoing; Mon, 30 Sep 1996 06:00:00 -0700 (PDT)
Received: from minnow.render.com (render.demon.co.uk [158.152.30.118])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id FAA04918;
          Mon, 30 Sep 1996 05:59:41 -0700 (PDT)
Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id NAA10948; Mon, 30 Sep 1996 13:59:21 +0100
Date: Mon, 30 Sep 1996 13:59:21 +0100 (BST)
From: Doug Rabson <dfr@render.com>
To: fs@freebsd.org, lite2@freebsd.org
Subject: Lite2 filesystem code needs testing
Message-ID: <Pine.BSF.3.95.960930135006.10204C-100000@minnow.render.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

I think that the Lite2 Merge has got to a stage where it is stable enough
for testing.  The kernel boots (and shuts down!) cleanly and most of the
filesystems except msdosfs and devfs are converted to the new regime.  I
have seen it panic a couple of times but it is good enough to run xemacs
and a kernel compile.

I just added some code to make it easier to move from a -current system to
a lite2 system - add COMPAT_PRELITE2 to your kernel config.  This is
pretty minimal compatibility - just enough to get -current's getvfsent()
and mount_nfs to work.  In particular, fsck on a dirty root filesystem
seems to fail to remount the filesystem, causing annoying double-reboots
after a panic.

Two major work items need to be completed before this code can go into
-current:  make a LINT kernel compile with no warnings and complete a
'make world'.

--
Doug Rabson, Microsoft RenderMorphics Ltd.	Mail:  dfr@render.com
						Phone: +44 171 734 3761
						FAX:   +44 171 734 6426


From owner-freebsd-fs  Mon Sep 30 06:26:39 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id GAA28414
          for fs-outgoing; Mon, 30 Sep 1996 06:26:39 -0700 (PDT)
Received: from critter.tfs.com ([140.145.230.252])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id GAA28176;
          Mon, 30 Sep 1996 06:26:19 -0700 (PDT)
Received: from critter.tfs.com (localhost.tfs.com [127.0.0.1]) by critter.tfs.com (8.7.5/8.7.3) with ESMTP id PAA08320; Mon, 30 Sep 1996 15:25:48 +0200 (MET DST)
To: Doug Rabson <dfr@render.com>
cc: fs@freebsd.org, lite2@freebsd.org
Subject: Re: Lite2 filesystem code needs testing 
In-reply-to: Your message of "Mon, 30 Sep 1996 13:59:21 BST."
             <Pine.BSF.3.95.960930135006.10204C-100000@minnow.render.com> 
Date: Mon, 30 Sep 1996 15:25:47 +0200
Message-ID: <8318.844089947@critter.tfs.com>
From: Poul-Henning Kamp <phk@critter.tfs.com>
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

In message <Pine.BSF.3.95.960930135006.10204C-100000@minnow.render.com>, Doug R
abson writes:

Cool.

Why don't you make a task-list, check it in and people can grab tasks
from there ?

--
Poul-Henning Kamp           | phk@FreeBSD.ORG       FreeBSD Core-team.
http://www.freebsd.org/~phk | phk@login.dknet.dk    Private mailbox.
whois: [PHK]                | phk@ref.tfs.com       TRW Financial Systems, Inc.
Future will arrive by its own means, progress not so.

From owner-freebsd-fs  Mon Sep 30 06:39:15 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id GAA09209
          for fs-outgoing; Mon, 30 Sep 1996 06:39:15 -0700 (PDT)
Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id GAA09021;
          Mon, 30 Sep 1996 06:38:59 -0700 (PDT)
Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id IAA03675; Mon, 30 Sep 1996 08:37:50 -0500 (EST)
From: "John S. Dyson" <toor@dyson.iquest.net>
Message-Id: <199609301337.IAA03675@dyson.iquest.net>
Subject: Re: Lite2 filesystem code needs testing
To: phk@critter.tfs.com (Poul-Henning Kamp)
Date: Mon, 30 Sep 1996 08:37:50 -0500 (EST)
Cc: dfr@render.com, fs@freebsd.org, lite2@freebsd.org
In-Reply-To: <8318.844089947@critter.tfs.com> from "Poul-Henning Kamp" at Sep 30, 96 03:25:47 pm
Reply-To: dyson@freebsd.org
X-Mailer: ELM [version 2.4 PL24 ME8]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> 
> In message <Pine.BSF.3.95.960930135006.10204C-100000@minnow.render.com>, Doug R
> abson writes:
> 
> Cool.
> 
> Why don't you make a task-list, check it in and people can grab tasks
> from there ?
> 
Good idea.

John

From owner-freebsd-fs  Mon Sep 30 06:49:32 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id GAA18025
          for fs-outgoing; Mon, 30 Sep 1996 06:49:32 -0700 (PDT)
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id GAA17837;
          Mon, 30 Sep 1996 06:49:19 -0700 (PDT)
Received: (from bde@localhost) by godzilla.zeta.org.au (8.7.6/8.6.9) id XAA19437; Mon, 30 Sep 1996 23:38:50 +1000
Date: Mon, 30 Sep 1996 23:38:50 +1000
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199609301338.XAA19437@godzilla.zeta.org.au>
To: dfr@render.com, fs@freebsd.org, lite2@freebsd.org
Subject: Re: Lite2 filesystem code needs testing
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

>Two major work items need to be completed before this code can go into
>-current:  make a LINT kernel compile with no warnings and complete a

That would be 5000 fewer lines of warnings that for -current itself :-].
gcc-2.7 emits about 4800 new ones.

Bruce

From owner-freebsd-fs  Mon Sep 30 07:25:24 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id HAA11634
          for fs-outgoing; Mon, 30 Sep 1996 07:25:24 -0700 (PDT)
Received: from minnow.render.com (render.demon.co.uk [158.152.30.118])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id HAA11433;
          Mon, 30 Sep 1996 07:25:05 -0700 (PDT)
Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id PAA11222; Mon, 30 Sep 1996 15:24:44 +0100
Date: Mon, 30 Sep 1996 15:24:42 +0100 (BST)
From: Doug Rabson <dfr@render.com>
To: Poul-Henning Kamp <phk@critter.tfs.com>
cc: fs@freebsd.org, lite2@freebsd.org
Subject: Re: Lite2 filesystem code needs testing
In-Reply-To: <8318.844089947@critter.tfs.com>
Message-ID: <Pine.BSF.3.95.960930152247.10204E-100000@minnow.render.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

On Mon, 30 Sep 1996, Poul-Henning Kamp wrote:

> In message <Pine.BSF.3.95.960930135006.10204C-100000@minnow.render.com>, Doug R
> abson writes:
> 
> Cool.
> 
> Why don't you make a task-list, check it in and people can grab tasks
> from there ?

I just committed a TODO list.  Jeffrey, you might want to look at my list
and add or remove stuff as appropriate.

--
Doug Rabson, Microsoft RenderMorphics Ltd.	Mail:  dfr@render.com
						Phone: +44 171 734 3761
						FAX:   +44 171 734 6426


From owner-freebsd-fs  Tue Oct  1 20:01:30 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id UAA24780
          for fs-outgoing; Tue, 1 Oct 1996 20:01:30 -0700 (PDT)
Received: from ccs.sogang.ac.kr (ccs.sogang.ac.kr [163.239.1.1])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id UAA24775
          for <freebsd-fs@FreeBSD.ORG>; Tue, 1 Oct 1996 20:01:25 -0700 (PDT)
Received: from cslsun10.sogang.ac.kr by ccs.sogang.ac.kr (8.8.0/Sogang) id LAA21425; Wed, 2 Oct 1996 11:56:56 +0900 (KST)
Received: from localhost by cslsun10.sogang.ac.kr (4.1/SMI-4.1)
	id AA04136; Wed, 2 Oct 96 12:00:01 KST
Date: Wed, 2 Oct 1996 12:00:00 +0900 (KST)
From: Heo Sung-Gwan <heo@cslsun10.sogang.ac.kr>
X-Sender: heo@cslsun10
To: freebsd-fs@FreeBSD.ORG
Subject: nbuf in buffer cache
Message-Id: <Pine.SUN.3.93.961002115902.4123A-100000@cslsun10>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

Hi,

I am curious about the number of buffers(= nbuf) in buffer cache.
The variable nbuf is determined in i386/i386/machdep.c as following: 

 #ifdef	NBUF
 int	nbuf = NBUF;
 #else	
 int	nbuf = 0;
 #endif
 ...

 void
 cpu_startup()
 {
	...
	if (nbuf == 0) {
		nbuf = 30;
		if( physmem > 1024)
			nbuf += min((physmem - 1024) / 12, 1024);
	}
	...
 }

If NBUF is not defined and physical memory is less than 1024 pages(= 4Mbytes) 
then nbuf becomes 30, and otherwise nbuf is 30 + min((physmem - 1024) / 12, 
1024).

Why does the number of buffers is calculated in this fashion? 
30 buffers, 1024 pages, and division by 12 have special meaning? 
There is no comment on source code.

In addition, if there is no user application processes how many buffers 
are enough to run the system without degrading the performance of the system? 
Only 30 buffers? Or better as many as possible?

Please let me know. 

--
Heo Sung-Gwan
Dept. of Computer Science, Sogang University, Seoul, Korea.
E-mail: heo@cslsun10.sogang.ac.kr


From owner-freebsd-fs  Tue Oct  1 20:25:36 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id UAA26070
          for fs-outgoing; Tue, 1 Oct 1996 20:25:36 -0700 (PDT)
Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id UAA26062
          for <freebsd-fs@FreeBSD.ORG>; Tue, 1 Oct 1996 20:25:30 -0700 (PDT)
Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id WAA04042; Tue, 1 Oct 1996 22:23:47 -0500 (EST)
From: "John S. Dyson" <toor@dyson.iquest.net>
Message-Id: <199610020323.WAA04042@dyson.iquest.net>
Subject: Re: nbuf in buffer cache
To: heo@cslsun10.sogang.ac.kr (Heo Sung-Gwan)
Date: Tue, 1 Oct 1996 22:23:46 -0500 (EST)
Cc: freebsd-fs@FreeBSD.ORG
In-Reply-To: <Pine.SUN.3.93.961002115902.4123A-100000@cslsun10> from "Heo Sung-Gwan" at Oct 2, 96 12:00:00 pm
Reply-To: dyson@FreeBSD.ORG
X-Mailer: ELM [version 2.4 PL24 ME8]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

> 
> If NBUF is not defined and physical memory is less than 1024 pages(= 4Mbytes) 
> then nbuf becomes 30, and otherwise nbuf is 30 + min((physmem - 1024) / 12, 
> 1024).
> 
> Why does the number of buffers is calculated in this fashion? 
> 30 buffers, 1024 pages, and division by 12 have special meaning? 
> There is no comment on source code.
>
Experience shows that this is a good number.  30 Buffers is a good minimum
on a very small system.  There has been problems in earlier code (and
perhaps even -current) when running with less than 10 Buffers.

> 
> In addition, if there is no user application processes how many buffers 
> are enough to run the system without degrading the performance of the system? 
> Only 30 buffers? Or better as many as possible?
> 
The performance on a small system is poor (IMO) anyway.  Adding more buffers
will take more memory from runnable processes.  Generally, common wisdom
and practice shows that it is best to minimize paging.  30 buffers represents
approx 240K (on a normally configured filesystem.)  If there is more free
memory, the system will store cached data in memory not associated with
buffers.  On a 4MB system, this is uncommon though.  Unlike other *BSD's
the buffer cache isn't the only place that I/O cached data is stored.  On
FreeBSD the buffer cache is best thought of as a mapping cache, and also the
upper limit of dirty buffer space.  Free memory is used for caching both
file data and unused memory segments (.text,...).

John

From owner-freebsd-fs  Tue Oct  1 21:54:19 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id VAA02403
          for fs-outgoing; Tue, 1 Oct 1996 21:54:19 -0700 (PDT)
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id VAA02396
          for <freebsd-fs@FreeBSD.org>; Tue, 1 Oct 1996 21:54:15 -0700 (PDT)
Received: (from bde@localhost) by godzilla.zeta.org.au (8.7.6/8.6.9) id OAA23573; Wed, 2 Oct 1996 14:50:39 +1000
Date: Wed, 2 Oct 1996 14:50:39 +1000
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199610020450.OAA23573@godzilla.zeta.org.au>
To: heo@cslsun10.sogang.ac.kr, toor@dyson.iquest.net
Subject: Re: nbuf in buffer cache
Cc: freebsd-fs@FreeBSD.org
Sender: owner-fs@FreeBSD.org
X-Loop: FreeBSD.org
Precedence: bulk

>> Why does the number of buffers is calculated in this fashion? 
>> 30 buffers, 1024 pages, and division by 12 have special meaning? 
>> There is no comment on source code.
>>
>Experience shows that this is a good number.  30 Buffers is a good minimum
>on a very small system.  There has been problems in earlier code (and
>perhaps even -current) when running with less than 10 Buffers.
>> 
>The performance on a small system is poor (IMO) anyway.  Adding more buffers
>will take more memory from runnable processes.  Generally, common wisdom
>and practice shows that it is best to minimize paging.  30 buffers represents
>approx 240K (on a normally configured filesystem.)  If there is more free

Experience showed that 240K is about right for a 2MB system running
FreeBSD.1.x, but 30 buffers is far too small.  For file systems with
a block size of 512 (e.g. msdos floppies), it can cache a whole 15K.
For normal ufs file systems with a fragment size of 1K, 1K fragments
are common for directories.

>memory, the system will store cached data in memory not associated with
>buffers.  On a 4MB system, this is uncommon though.  Unlike other *BSD's
>the buffer cache isn't the only place that I/O cached data is stored.  On
>FreeBSD the buffer cache is best thought of as a mapping cache, and also the
>upper limit of dirty buffer space.  Free memory is used for caching both
>file data and unused memory segments (.text,...).

Now 240K is probably too much for metadata alone, but 30 buffers is still
too small.  Metadata blocks are usually small, so 30 buffers usually
limits the amount of metadata cached to much less than 240K.

Bruce

From owner-freebsd-fs  Tue Oct  1 22:14:56 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id WAA03748
          for fs-outgoing; Tue, 1 Oct 1996 22:14:56 -0700 (PDT)
Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id WAA03743
          for <freebsd-fs@FreeBSD.org>; Tue, 1 Oct 1996 22:14:53 -0700 (PDT)
Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id AAA00199; Wed, 2 Oct 1996 00:11:16 -0500 (EST)
From: "John S. Dyson" <toor@dyson.iquest.net>
Message-Id: <199610020511.AAA00199@dyson.iquest.net>
Subject: Re: nbuf in buffer cache
To: bde@zeta.org.au (Bruce Evans)
Date: Wed, 2 Oct 1996 00:11:16 -0500 (EST)
Cc: heo@cslsun10.sogang.ac.kr, freebsd-fs@FreeBSD.org
In-Reply-To: <199610020450.OAA23573@godzilla.zeta.org.au> from "Bruce Evans" at Oct 2, 96 02:50:39 pm
Reply-To: dyson@FreeBSD.org
X-Mailer: ELM [version 2.4 PL24 ME8]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@FreeBSD.org
X-Loop: FreeBSD.org
Precedence: bulk

> 
> >> Why does the number of buffers is calculated in this fashion? 
> >> 30 buffers, 1024 pages, and division by 12 have special meaning? 
> >> There is no comment on source code.
> >>
> >Experience shows that this is a good number.  30 Buffers is a good minimum
> >on a very small system.  There has been problems in earlier code (and
> >perhaps even -current) when running with less than 10 Buffers.
> >> 
> >The performance on a small system is poor (IMO) anyway.  Adding more buffers
> >will take more memory from runnable processes.  Generally, common wisdom
> >and practice shows that it is best to minimize paging.  30 buffers represents
> >approx 240K (on a normally configured filesystem.)  If there is more free
> 
> Experience showed that 240K is about right for a 2MB system running
> FreeBSD.1.x, but 30 buffers is far too small.  For file systems with
> a block size of 512 (e.g. msdos floppies), it can cache a whole 15K.
> For normal ufs file systems with a fragment size of 1K, 1K fragments
> are common for directories.
> 
> >memory, the system will store cached data in memory not associated with
> >buffers.  On a 4MB system, this is uncommon though.  Unlike other *BSD's
> >the buffer cache isn't the only place that I/O cached data is stored.  On
> >FreeBSD the buffer cache is best thought of as a mapping cache, and also the
> >upper limit of dirty buffer space.  Free memory is used for caching both
> >file data and unused memory segments (.text,...).
> 
> Now 240K is probably too much for metadata alone, but 30 buffers is still
> too small.  Metadata blocks are usually small, so 30 buffers usually
> limits the amount of metadata cached to much less than 240K.
> 
So, you would trade paging for file buffering?  I don't think so.  Firstly, the
MSDOS filesystem is a degenerate case.  Many programs have a very steep
curve that if you are running low on memory, they will cause thrashing.  DG
and I found that it is very important to make sure that GCC can have as much
memory as possible.  If (and it is a very big if) there is free (spare)
memory, the system will provide it in the form of the merged VM object cache.
Note also the system prefers to keep metadata in the cache, and to push
file data to the VM objects.  It is then the best of both worlds.

So, to be precise, limiting the number of buffers keeps the freedom
maximized.  The larger the number of buffers, the greater the chance that
there will be too much wired memory for an application.  I have found that
the knee for gcc appears to be about 2M (plus or minus.)  And it is very
sharp.  If you restrict the amount of memory even by 100K-200K, compile
times go through the roof.

Additionally, the issue of MSDOS having a very small cache size isn't valid,
and is limited by the total amount of available memory.

John

From owner-freebsd-fs  Tue Oct  1 23:25:09 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id XAA10827
          for fs-outgoing; Tue, 1 Oct 1996 23:25:09 -0700 (PDT)
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id XAA10822;
          Tue, 1 Oct 1996 23:25:03 -0700 (PDT)
Received: (from bde@localhost) by godzilla.zeta.org.au (8.7.6/8.6.9) id QAA26221; Wed, 2 Oct 1996 16:18:46 +1000
Date: Wed, 2 Oct 1996 16:18:46 +1000
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199610020618.QAA26221@godzilla.zeta.org.au>
To: bde@zeta.org.au, dyson@freebsd.org
Subject: Re: nbuf in buffer cache
Cc: freebsd-fs@freebsd.org, heo@cslsun10.sogang.ac.kr
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

>> Now 240K is probably too much for metadata alone, but 30 buffers is still
>> too small.  Metadata blocks are usually small, so 30 buffers usually
>> limits the amount of metadata cached to much less than 240K.
>> 
>So, you would trade paging for file buffering?  I don't think so.  Firstly, the

No, but allocate enough buffers to hold the memory that you're willing
to allocate for (non VM-object) buffering.  nbuf = memory_allowed /
DEV_BSIZE is too many for static allocation, so dynamic allocation
is required.  sizeof(struct buf) is now 212, so the worst case should
only have nbuf = memory_allowed / (512 + 212).  (struct buf is bloated.
In my first implementation of buffering, for an 8-bit system, DEV_BSIZE
was 256 and sizeof(struct buf) was 13 and I thought that the 5% overhead
was high.  Sigh.)

>So, to be precise, limiting the number of buffers keeps the freedom
>maximized.  The larger the number of buffers, the greater the chance that
>there will be too much wired memory for an application.  I have found that

Limiting the number of buffers instead of limiting the memory allocated
for the buffers sometimes gives more freedom because less memory is
allocated, but it is better to limit the amount allocated explicitly.

Bruce

From owner-freebsd-fs  Wed Oct  2 06:34:12 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id GAA07203
          for fs-outgoing; Wed, 2 Oct 1996 06:34:12 -0700 (PDT)
Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id GAA07182;
          Wed, 2 Oct 1996 06:34:08 -0700 (PDT)
Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id IAA00852; Wed, 2 Oct 1996 08:33:12 -0500 (EST)
From: "John S. Dyson" <toor@dyson.iquest.net>
Message-Id: <199610021333.IAA00852@dyson.iquest.net>
Subject: Re: nbuf in buffer cache
To: bde@zeta.org.au (Bruce Evans)
Date: Wed, 2 Oct 1996 08:33:12 -0500 (EST)
Cc: bde@zeta.org.au, dyson@FreeBSD.org, freebsd-fs@FreeBSD.org,
        heo@cslsun10.sogang.ac.kr
In-Reply-To: <199610020618.QAA26221@godzilla.zeta.org.au> from "Bruce Evans" at Oct 2, 96 04:18:46 pm
Reply-To: dyson@FreeBSD.org
X-Mailer: ELM [version 2.4 PL24 ME8]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@FreeBSD.org
X-Loop: FreeBSD.org
Precedence: bulk

> 
> No, but allocate enough buffers to hold the memory that you're willing
> to allocate for (non VM-object) buffering.  nbuf = memory_allowed /
> DEV_BSIZE is too many for static allocation, so dynamic allocation
> is required.  sizeof(struct buf) is now 212, so the worst case should
> only have nbuf = memory_allowed / (512 + 212).  (struct buf is bloated.
> In my first implementation of buffering, for an 8-bit system, DEV_BSIZE
> was 256 and sizeof(struct buf) was 13 and I thought that the 5% overhead
> was high.  Sigh.)
> 
If you can figure out a way to shrink our current buffers to 13 instead
of just under 256, please do so.  They are NOT easy to shrink.  I think
what the smaller buffer headers did must have been quite different from
what we have now.  Remember also that the amount of buffering space is
not limited by the number of buffers!!!  The buffers are now mostly 
for temporary mappings and pending writes.  The only other required
purpose for buffers is for caching directories.  There is
bias to keep the directories in the buffers.

>
> >So, to be precise, limiting the number of buffers keeps the freedom
> >maximized.  The larger the number of buffers, the greater the chance that
> >there will be too much wired memory for an application.  I have found that
> 
> Limiting the number of buffers instead of limiting the memory allocated
> for the buffers sometimes gives more freedom because less memory is
> allocated, but it is better to limit the amount allocated explicitly.
> 
The mechanism exists in our current vfs_bio to support that.  In fact,
if you notice, the amount of memory used by vfs_bio is limited to
nbuf * 8K.  If you have 16k buffers, it is still limited to nbuf*8k,
so the number of buffers (again, not limiting the buffering space) is 
one half for larger buffers.  You can re-tune those parameters for the
small-block filesystems.  Of course, such file systems encounter many
other inefficiencies in normal operations also.  (In other words, IMO,
msdosfs as it is currently written is not very fast anyway.)

Remember, my argument against excessive numbers of buffers is mostly
for small systems (i.e. 4M.)  Those systems are just not very effective
at caching.  The case that I am most worried about is 4k/8k
ufs systems (the ones most used.)  I do not think that wiring down
large amounts of memory is a wise idea.  If you are complaining about
an excessive buffer header size, then there is an opportunity to
work on it.  (Actually, shouldn't an MSDOSFS use the cluster size
instead of 512 anyway?, we have no problem handling 32k buffers,
if you need them (minor tunable).)  There is another opportunity
to work on solving the MSDOS problem and getting the best of both
worlds (bigger buffer support for more wired-down caching, and
not taking excessive memory.)  Directories are still small though,
but we have different sized buffers on UFS also.

I would suggest also when/if you make a decision to change the way
that buffer sizes/buffer memory is calculated, please consider the
case of 8k UFS (the default.)  Also, I think that many of the
small systems are for "non-wealthy students" who would like to compile
programs with gcc.  It is already slow, and making it slower is not good.

John


From owner-freebsd-fs  Wed Oct  2 07:18:26 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id HAA10418
          for fs-outgoing; Wed, 2 Oct 1996 07:18:26 -0700 (PDT)
Received: from ccs.sogang.ac.kr (ccs.sogang.ac.kr [163.239.1.1])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id HAA10400;
          Wed, 2 Oct 1996 07:18:20 -0700 (PDT)
Received: from cslsun10.sogang.ac.kr by ccs.sogang.ac.kr (8.8.0/Sogang) id XAA28649; Wed, 2 Oct 1996 23:14:01 +0900 (KST)
Received: from localhost by cslsun10.sogang.ac.kr (4.1/SMI-4.1)
	id AA05139; Wed, 2 Oct 96 23:17:04 KST
Date: Wed, 2 Oct 1996 23:17:03 +0900 (KST)
From: Heo Sung-Gwan <heo@cslsun10.sogang.ac.kr>
X-Sender: heo@cslsun10
To: freebsd-hackers@FreeBSD.ORG
Cc: freebsd-fs@FreeBSD.ORG
Subject: vnode and cluster read-ahead
Message-Id: <Pine.SUN.3.93.961002231449.5132A-100000@cslsun10>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

When a file is open serveral times simultaneously cluster read-ahead 
buffer cache using vnode seem to have some problems.
                             
As a process A reads a file F *sequentially* the fields(v_maxra, v_ralen, etc) of the vnode of F increases. As a result read-ahead of next cluster happens.
But when a process B opens F and reads it the values of the fields are 
changed. So the process A's read-ahead is disturbed whenever process B is 
rescheduled.

I think the fields for read-ahead must be in struct file rather than vnode.
There exists one vnode for a file but a file may be open serveral times. 

What's your opinion, hackers?

--
Heo Sung-Gwan
Dept. of Computer Science, Sogang University, Seoul, Korea.
E-mail: heo@cslsun10.sogang.ac.kr


From owner-freebsd-fs  Wed Oct  2 07:40:31 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id HAA12094
          for fs-outgoing; Wed, 2 Oct 1996 07:40:31 -0700 (PDT)
Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id HAA12082;
          Wed, 2 Oct 1996 07:40:26 -0700 (PDT)
Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id JAA00980; Wed, 2 Oct 1996 09:39:43 -0500 (EST)
From: John Dyson <dyson@dyson.iquest.net>
Message-Id: <199610021439.JAA00980@dyson.iquest.net>
Subject: Re: vnode and cluster read-ahead
To: heo@cslsun10.sogang.ac.kr (Heo Sung-Gwan)
Date: Wed, 2 Oct 1996 09:39:42 -0500 (EST)
Cc: freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org
In-Reply-To: <Pine.SUN.3.93.961002231449.5132A-100000@cslsun10> from "Heo Sung-Gwan" at Oct 2, 96 11:17:03 pm
Reply-To: dyson@freebsd.org
X-Mailer: ELM [version 2.4 PL24 ME8]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> When a file is open serveral times simultaneously cluster read-ahead 
> buffer cache using vnode seem to have some problems.
>                              
You are right.

> As a process A reads a file F *sequentially* the fields(v_maxra, v_ralen, etc) of the vnode of F increases. As a result read-ahead of next cluster happens.
> But when a process B opens F and reads it the values of the fields are 
> changed. So the process A's read-ahead is disturbed whenever process B is 
> rescheduled.
> 
> I think the fields for read-ahead must be in struct file rather than vnode.
> There exists one vnode for a file but a file may be open serveral times. 
> 
That is closer to correct.  I am not sure that the struct file is correct
either, but I think that you are on the right track.

John

From owner-freebsd-fs  Wed Oct  2 07:51:28 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id HAA12902
          for fs-outgoing; Wed, 2 Oct 1996 07:51:28 -0700 (PDT)
Received: from root.com (implode.root.com [198.145.90.17])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id HAA12896;
          Wed, 2 Oct 1996 07:51:24 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by root.com (8.7.5/8.6.5) with SMTP id HAA07229; Wed, 2 Oct 1996 07:52:03 -0700 (PDT)
Message-Id: <199610021452.HAA07229@root.com>
X-Authentication-Warning: implode.root.com: Host localhost [127.0.0.1] didn't use HELO protocol
To: Heo Sung-Gwan <heo@cslsun10.sogang.ac.kr>
cc: freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org
Subject: Re: vnode and cluster read-ahead 
In-reply-to: Your message of "Wed, 02 Oct 1996 23:17:03 +0900."
             <Pine.SUN.3.93.961002231449.5132A-100000@cslsun10> 
From: David Greenman <dg@root.com>
Reply-To: dg@root.com
Date: Wed, 02 Oct 1996 07:52:03 -0700
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

>When a file is open serveral times simultaneously cluster read-ahead 
>buffer cache using vnode seem to have some problems.
>                             
>As a process A reads a file F *sequentially* the fields(v_maxra, v_ralen, etc) of the vnode of F increases. As a result read-ahead of next cluster happens.
>But when a process B opens F and reads it the values of the fields are 
>changed. So the process A's read-ahead is disturbed whenever process B is 
>rescheduled.
>
>I think the fields for read-ahead must be in struct file rather than vnode.
>There exists one vnode for a file but a file may be open serveral times. 
>
>What's your opinion, hackers?

   First, this is a very rare situation that almost never occurs in practice.
If the file was just read it will usually still be in the cache (assuming it's
not too large), so all references will be satisfied out of the cache and the
clustering policy won't matter. However, in the case of it not fitting in the
cache, the system will not optimize for sequential reads because they are no
longer entirely sequential...so I think the current algorithm is doing the
right thing. The whole dynamic read-ahead mechanism is just an optimization in
any case and is new with 4.4BSD.
   Even if we did want to change it, there really isn't a way to do what
you're suggesting above - you don't have access to the file struct at the
level that the clustering decision is made. You'd have to change the code
to propagate clustering hints from the read/write system calls. Yuck.

-DG

David Greenman
Core-team/Principal Architect, The FreeBSD Project

From owner-freebsd-fs  Wed Oct  2 07:58:26 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id HAA13316
          for fs-outgoing; Wed, 2 Oct 1996 07:58:26 -0700 (PDT)
Received: from critter.tfs.com ([140.145.230.252])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id HAA13297;
          Wed, 2 Oct 1996 07:58:20 -0700 (PDT)
Received: from critter.tfs.com (localhost.tfs.com [127.0.0.1]) by critter.tfs.com (8.7.5/8.7.3) with ESMTP id QAA03533; Wed, 2 Oct 1996 16:57:24 +0200 (MET DST)
To: dyson@freebsd.org
cc: heo@cslsun10.sogang.ac.kr (Heo Sung-Gwan), freebsd-hackers@freebsd.org,
        freebsd-fs@freebsd.org
Subject: Re: vnode and cluster read-ahead 
In-reply-to: Your message of "Wed, 02 Oct 1996 09:39:42 CDT."
             <199610021439.JAA00980@dyson.iquest.net> 
Date: Wed, 02 Oct 1996 16:57:24 +0200
Message-ID: <3531.844268244@critter.tfs.com>
From: Poul-Henning Kamp <phk@critter.tfs.com>
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

In message <199610021439.JAA00980@dyson.iquest.net>, John Dyson writes:

>> As a process A reads a file F *sequentially* the fields(v_maxra, v_ralen, et
>c) of the vnode of F increases. As a result read-ahead of next cluster happens
>.
>> But when a process B opens F and reads it the values of the fields are 
>> changed. So the process A's read-ahead is disturbed whenever process B is 
>> rescheduled.
>> 
>> I think the fields for read-ahead must be in struct file rather than vnode.
>> There exists one vnode for a file but a file may be open serveral times. 
>> 
>That is closer to correct.  I am not sure that the struct file is correct
>either, but I think that you are on the right track.

No, I don't agree.  Process B will most likely find all it needs in the
buffercache, and thus will not need read-ahead at all.

How to implement this is not clear to me, but I think the best way would
be to calculate the parameters and only if the extend the current read-ahead
(v_maxra...) will they be employed.  This would gracefully handle the
case where process B overtakes process A in reading the file.

--
Poul-Henning Kamp           | phk@FreeBSD.ORG       FreeBSD Core-team.
http://www.freebsd.org/~phk | phk@login.dknet.dk    Private mailbox.
whois: [PHK]                | phk@ref.tfs.com       TRW Financial Systems, Inc.
Future will arrive by its own means, progress not so.

From owner-freebsd-fs  Wed Oct  2 08:21:05 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id IAA14913
          for fs-outgoing; Wed, 2 Oct 1996 08:21:05 -0700 (PDT)
Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id IAA14904;
          Wed, 2 Oct 1996 08:21:00 -0700 (PDT)
Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id KAA01060; Wed, 2 Oct 1996 10:20:48 -0500 (EST)
From: John Dyson <dyson@dyson.iquest.net>
Message-Id: <199610021520.KAA01060@dyson.iquest.net>
Subject: Re: vnode and cluster read-ahead
To: phk@critter.tfs.com (Poul-Henning Kamp)
Date: Wed, 2 Oct 1996 10:20:48 -0500 (EST)
Cc: dyson@freebsd.org, heo@cslsun10.sogang.ac.kr, freebsd-hackers@freebsd.org,
        freebsd-fs@freebsd.org
In-Reply-To: <3531.844268244@critter.tfs.com> from "Poul-Henning Kamp" at Oct 2, 96 04:57:24 pm
Reply-To: dyson@freebsd.org
X-Mailer: ELM [version 2.4 PL24 ME8]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> 
> In message <199610021439.JAA00980@dyson.iquest.net>, John Dyson writes:
> 
> >> As a process A reads a file F *sequentially* the fields(v_maxra, v_ralen, et
> >c) of the vnode of F increases. As a result read-ahead of next cluster happens
> >.
> >> But when a process B opens F and reads it the values of the fields are 
> >> changed. So the process A's read-ahead is disturbed whenever process B is 
> >> rescheduled.
> >> 
> >> I think the fields for read-ahead must be in struct file rather than vnode.
> >> There exists one vnode for a file but a file may be open serveral times. 
> >> 
> >That is closer to correct.  I am not sure that the struct file is correct
> >either, but I think that you are on the right track.
> 
> No, I don't agree.  Process B will most likely find all it needs in the
                                          ^^^^^^
> buffercache, and thus will not need read-ahead at all.
> 

I agree with the term "likely", but it is possible that two processes
are not reading the entire file sequentially.  Also, it is possible that
the file size is much bigger than main memory, thereby busting the cache.
Read-ahead is then the only performance improvement to be had.  Nowadays,
I think that drives actually have segmented read-ahead caches also.  We
don't though.

John


From owner-freebsd-fs  Wed Oct  2 08:23:04 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id IAA15053
          for fs-outgoing; Wed, 2 Oct 1996 08:23:04 -0700 (PDT)
Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id IAA15047;
          Wed, 2 Oct 1996 08:23:00 -0700 (PDT)
Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id KAA01068; Wed, 2 Oct 1996 10:22:16 -0500 (EST)
From: John Dyson <dyson@dyson.iquest.net>
Message-Id: <199610021522.KAA01068@dyson.iquest.net>
Subject: Re: vnode and cluster read-ahead
To: dg@Root.COM
Date: Wed, 2 Oct 1996 10:22:16 -0500 (EST)
Cc: heo@cslsun10.sogang.ac.kr, freebsd-hackers@freebsd.org,
        freebsd-fs@freebsd.org
In-Reply-To: <199610021452.HAA07229@root.com> from "David Greenman" at Oct 2, 96 07:52:03 am
Reply-To: dyson@freebsd.org
X-Mailer: ELM [version 2.4 PL24 ME8]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

>
>    Even if we did want to change it, there really isn't a way to do what
> you're suggesting above - you don't have access to the file struct at the
> level that the clustering decision is made. You'd have to change the code
> to propagate clustering hints from the read/write system calls. Yuck.
> 
I actually thought about doing it (and may have discussed it with you, David),
and I think that was my conclusion also.  The existing interfaces don't
convieniently support it.

John

From owner-freebsd-fs  Wed Oct  2 10:25:12 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id KAA22479
          for fs-outgoing; Wed, 2 Oct 1996 10:25:12 -0700 (PDT)
Received: from minnow.render.com (render.demon.co.uk [158.152.30.118])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id KAA22455;
          Wed, 2 Oct 1996 10:24:33 -0700 (PDT)
Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id SAA23478; Wed, 2 Oct 1996 18:21:14 +0100
Date: Wed, 2 Oct 1996 18:21:13 +0100 (BST)
From: Doug Rabson <dfr@render.com>
To: dyson@freebsd.org
cc: Poul-Henning Kamp <phk@critter.tfs.com>, heo@cslsun10.sogang.ac.kr,
        freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org
Subject: Re: vnode and cluster read-ahead
In-Reply-To: <199610021520.KAA01060@dyson.iquest.net>
Message-ID: <Pine.BSF.3.95.961002181522.10204N-100000@minnow.render.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

On Wed, 2 Oct 1996, John Dyson wrote:

> > 
> > In message <199610021439.JAA00980@dyson.iquest.net>, John Dyson writes:
> > 
> > >> As a process A reads a file F *sequentially* the fields(v_maxra, v_ralen, et
> > >c) of the vnode of F increases. As a result read-ahead of next cluster happens
> > >.
> > >> But when a process B opens F and reads it the values of the fields are 
> > >> changed. So the process A's read-ahead is disturbed whenever process B is 
> > >> rescheduled.
> > >> 
> > >> I think the fields for read-ahead must be in struct file rather than vnode.
> > >> There exists one vnode for a file but a file may be open serveral times. 
> > >> 
> > >That is closer to correct.  I am not sure that the struct file is correct
> > >either, but I think that you are on the right track.
> > 
> > No, I don't agree.  Process B will most likely find all it needs in the
>                                           ^^^^^^
> > buffercache, and thus will not need read-ahead at all.
> > 
> 
> I agree with the term "likely", but it is possible that two processes
> are not reading the entire file sequentially.  Also, it is possible that
> the file size is much bigger than main memory, thereby busting the cache.
> Read-ahead is then the only performance improvement to be had.  Nowadays,
> I think that drives actually have segmented read-ahead caches also.  We
> don't though.

You could maintain a number of 'pending readahead' structures indexed by
vnode and block number.  Each call to cluster_read would check for a
pending readahead by hashing.  For efficiency, keep a pointer to the last
readahead structure used by cluster_read in the vnode in place of the
existing in-vnode readahead data.  Should be no slower than the current
system for single process reads and it saves 4 bytes per vnode :-).

--
Doug Rabson, Microsoft RenderMorphics Ltd.	Mail:  dfr@render.com
						Phone: +44 171 734 3761
						FAX:   +44 171 734 6426


From owner-freebsd-fs  Wed Oct  2 10:40:45 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id KAA23657
          for fs-outgoing; Wed, 2 Oct 1996 10:40:45 -0700 (PDT)
Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id KAA23647;
          Wed, 2 Oct 1996 10:40:39 -0700 (PDT)
Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id MAA01329; Wed, 2 Oct 1996 12:40:10 -0500 (EST)
From: John Dyson <dyson@dyson.iquest.net>
Message-Id: <199610021740.MAA01329@dyson.iquest.net>
Subject: Re: vnode and cluster read-ahead
To: dfr@render.com (Doug Rabson)
Date: Wed, 2 Oct 1996 12:40:10 -0500 (EST)
Cc: dyson@freebsd.org, phk@critter.tfs.com, heo@cslsun10.sogang.ac.kr,
        freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org
In-Reply-To: <Pine.BSF.3.95.961002181522.10204N-100000@minnow.render.com> from "Doug Rabson" at Oct 2, 96 06:21:13 pm
Reply-To: dyson@freebsd.org
X-Mailer: ELM [version 2.4 PL24 ME8]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> 
> You could maintain a number of 'pending readahead' structures indexed by
> vnode and block number.  Each call to cluster_read would check for a
> pending readahead by hashing.  For efficiency, keep a pointer to the last
> readahead structure used by cluster_read in the vnode in place of the
> existing in-vnode readahead data.  Should be no slower than the current
> system for single process reads and it saves 4 bytes per vnode :-).
> 
Pretty cool idea.  I am remembering now that this deficiency in our read
ahead code is well known.  This might be something really good for 2.3 or
3.1 :-).  (Unless someone else wants to implement it -- hint hint :-)).

John

From owner-freebsd-fs  Wed Oct  2 14:59:37 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id OAA09955
          for fs-outgoing; Wed, 2 Oct 1996 14:59:37 -0700 (PDT)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id OAA09950;
          Wed, 2 Oct 1996 14:59:33 -0700 (PDT)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id OAA04844; Wed, 2 Oct 1996 14:57:30 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199610022157.OAA04844@phaeton.artisoft.com>
Subject: Re: vnode and cluster read-ahead
To: heo@cslsun10.sogang.ac.kr (Heo Sung-Gwan)
Date: Wed, 2 Oct 1996 14:57:30 -0700 (MST)
Cc: freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org
In-Reply-To: <Pine.SUN.3.93.961002231449.5132A-100000@cslsun10> from "Heo Sung-Gwan" at Oct 2, 96 11:17:03 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> When a file is open serveral times simultaneously cluster read-ahead 
> buffer cache using vnode seem to have some problems.
>                              
> As a process A reads a file F *sequentially* the fields(v_maxra, v_ralen, etc) of the vnode of F increases. As a result read-ahead of next cluster happens.
> But when a process B opens F and reads it the values of the fields are 
> changed. So the process A's read-ahead is disturbed whenever process B is 
> rescheduled.
> 
> I think the fields for read-ahead must be in struct file rather than vnode.
> There exists one vnode for a file but a file may be open serveral times. 
> 
> What's your opinion, hackers?

Matt Day noted this problem some time ago.

The problem increases when you have multiple threads in a single process
with conflicting acess domains.

One soloution would be to "trust cache locality to work".  This is not
very satisfying for read-ahead.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

From owner-freebsd-fs  Thu Oct  3 01:02:32 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id BAA10808
          for fs-outgoing; Thu, 3 Oct 1996 01:02:32 -0700 (PDT)
Received: from who.cdrom.com (who.cdrom.com [204.216.27.3])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id BAA10798;
          Thu, 3 Oct 1996 01:02:28 -0700 (PDT)
Received: from ocean.campus.luth.se (ocean.campus.luth.se [130.240.194.116])
          by who.cdrom.com (8.7.5/8.6.11) with ESMTP id BAA27380
          ; Thu, 3 Oct 1996 01:02:26 -0700 (PDT)
Received: (from karpen@localhost) by ocean.campus.luth.se (8.7.5/8.7.3) id KAA25133; Thu, 3 Oct 1996 10:05:46 +0200 (MET DST)
From: Mikael Karpberg <karpen@ocean.campus.luth.se>
Message-Id: <199610030805.KAA25133@ocean.campus.luth.se>
Subject: Re: nbuf in buffer cache
To: dyson@FreeBSD.org
Date: Thu, 3 Oct 1996 10:05:45 +0200 (MET DST)
Cc: bde@zeta.org.au, heo@cslsun10.sogang.ac.kr, freebsd-fs@FreeBSD.org
In-Reply-To: <199610020511.AAA00199@dyson.iquest.net> from "John S. Dyson" at "Oct 2, 96 00:11:16 am"
X-Mailer: ELM [version 2.4ME+ PL22 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@FreeBSD.org
X-Loop: FreeBSD.org
Precedence: bulk

Hi!

> memory as possible.  If (and it is a very big if) there is free (spare)
> memory, the system will provide it in the form of the merged VM object cache.
> Note also the system prefers to keep metadata in the cache, and to push
> file data to the VM objects.  It is then the best of both worlds.
> 
> So, to be precise, limiting the number of buffers keeps the freedom
> maximized.  The larger the number of buffers, the greater the chance that
> there will be too much wired memory for an application.  I have found that
> the knee for gcc appears to be about 2M (plus or minus.)  And it is very
> sharp.  If you restrict the amount of memory even by 100K-200K, compile
> times go through the roof.
> 
> Additionally, the issue of MSDOS having a very small cache size isn't valid,
> and is limited by the total amount of available memory.

Umm... hold on a second, here... :-)
I always thought Linux etc used all free memory for disk caching, and
that the BSD's used a formula (basically something like some percentage of
the available memory) to determine the size of a static buffer, used as
disk cache. Now... it makes sense if this changes when you use a merged
disk cache and VM system. Someone let me in on how things work? :-)

  /Mikael

From owner-freebsd-fs  Thu Oct  3 04:16:52 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id EAA21721
          for fs-outgoing; Thu, 3 Oct 1996 04:16:52 -0700 (PDT)
Received: from minnow.render.com (render.demon.co.uk [158.152.30.118])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id EAA21711;
          Thu, 3 Oct 1996 04:16:45 -0700 (PDT)
Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id KAA25763; Thu, 3 Oct 1996 10:35:20 +0100
Date: Thu, 3 Oct 1996 10:35:19 +0100 (BST)
From: Doug Rabson <dfr@render.com>
To: dyson@freebsd.org
cc: phk@critter.tfs.com, heo@cslsun10.sogang.ac.kr,
        freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org
Subject: Re: vnode and cluster read-ahead
In-Reply-To: <199610021740.MAA01329@dyson.iquest.net>
Message-ID: <Pine.BSF.3.95.961003102610.10204O-100000@minnow.render.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

On Wed, 2 Oct 1996, John Dyson wrote:

> > 
> > You could maintain a number of 'pending readahead' structures indexed by
> > vnode and block number.  Each call to cluster_read would check for a
> > pending readahead by hashing.  For efficiency, keep a pointer to the last
> > readahead structure used by cluster_read in the vnode in place of the
> > existing in-vnode readahead data.  Should be no slower than the current
> > system for single process reads and it saves 4 bytes per vnode :-).
> > 
> Pretty cool idea.  I am remembering now that this deficiency in our read
> ahead code is well known.  This might be something really good for 2.3 or
> 3.1 :-).  (Unless someone else wants to implement it -- hint hint :-)).

On the subject of saving memory, I firmly believe that signficant
performance improvements can be made just by reducing the memory footprint
of algorithms.  In our 3D graphics work, we have found that making
important datastructures fit into cache lines (and using an aligning
allocator to make sure that they start on cache line boundaries) can
improve performance by as much as 20%.

When future processors from Intel will have clock speeds of 400Mhz and
above but have a 75Mhz memory bus to level 2 cache, this will become even
more important towards the end of '97 and beyond.  The size of a piece of
software and its memory usage patterns will dominate its performance
profile.  Maybe we need a 'Campaign for Small Software' :-).

--
Doug Rabson, Microsoft RenderMorphics Ltd.	Mail:  dfr@render.com
						Phone: +44 171 734 3761
						FAX:   +44 171 734 6426


From owner-freebsd-fs  Thu Oct  3 06:14:21 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id GAA25500
          for fs-outgoing; Thu, 3 Oct 1996 06:14:21 -0700 (PDT)
Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id GAA25494;
          Thu, 3 Oct 1996 06:14:17 -0700 (PDT)
Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id IAA00602; Thu, 3 Oct 1996 08:12:30 -0500 (EST)
From: "John S. Dyson" <toor@dyson.iquest.net>
Message-Id: <199610031312.IAA00602@dyson.iquest.net>
Subject: Re: vnode and cluster read-ahead
To: dfr@render.com (Doug Rabson)
Date: Thu, 3 Oct 1996 08:12:30 -0500 (EST)
Cc: dyson@freebsd.org, phk@critter.tfs.com, heo@cslsun10.sogang.ac.kr,
        freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org
In-Reply-To: <Pine.BSF.3.95.961003102610.10204O-100000@minnow.render.com> from "Doug Rabson" at Oct 3, 96 10:35:19 am
Reply-To: dyson@freebsd.org
X-Mailer: ELM [version 2.4 PL24 ME8]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> 
> On the subject of saving memory, I firmly believe that signficant
> performance improvements can be made just by reducing the memory footprint
> of algorithms.  In our 3D graphics work, we have found that making
> important datastructures fit into cache lines (and using an aligning
> allocator to make sure that they start on cache line boundaries) can
> improve performance by as much as 20%.
> 
The pmap code is a perfect example of that.  There are times that I have
"improved" the code, and noted a net slowdown, because it has grown.
Soon, I intend to chop out another 1-2k out of pmap.o.  Smaller is
definitely better sometimes.

John

From owner-freebsd-fs  Thu Oct  3 06:20:43 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id GAA25742
          for fs-outgoing; Thu, 3 Oct 1996 06:20:43 -0700 (PDT)
Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id GAA25737;
          Thu, 3 Oct 1996 06:20:39 -0700 (PDT)
Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id IAA00624; Thu, 3 Oct 1996 08:20:17 -0500 (EST)
From: "John S. Dyson" <toor@dyson.iquest.net>
Message-Id: <199610031320.IAA00624@dyson.iquest.net>
Subject: Re: nbuf in buffer cache
To: karpen@ocean.campus.luth.se (Mikael Karpberg)
Date: Thu, 3 Oct 1996 08:20:17 -0500 (EST)
Cc: dyson@FreeBSD.org, bde@zeta.org.au, heo@cslsun10.sogang.ac.kr,
        freebsd-fs@FreeBSD.org
In-Reply-To: <199610030805.KAA25133@ocean.campus.luth.se> from "Mikael Karpberg" at Oct 3, 96 10:05:45 am
Reply-To: dyson@FreeBSD.org
X-Mailer: ELM [version 2.4 PL24 ME8]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@FreeBSD.org
X-Loop: FreeBSD.org
Precedence: bulk

>
> Umm... hold on a second, here... :-)
> I always thought Linux etc used all free memory for disk caching, and
> that the BSD's used a formula (basically something like some percentage of
> the available memory) to determine the size of a static buffer, used as
> disk cache. Now... it makes sense if this changes when you use a merged
> disk cache and VM system. Someone let me in on how things work? :-)
> 
FreeBSD uses all of available memory for disk cache (it has actually had
a true merged VM/buffer cache longer than Linux.)  Linux has used a dynamic
buffer cache for a long time though (which is technically different.) The
only type of data that must be in a buffer is directory info.  I am about
ready to consider 2x-3x the number of buffers and changing a few tunables
so that the cache will not take any more space.  Since buffers only take
200 or so bytes apiece, it will not hurt (much) to increase the number of
buffers even on a small system.  The perf won't go down as long as I change the
formula so that the memory limit isn't 8K * nbuf, but is 2-3K * nbuf.

John

From owner-freebsd-fs  Thu Oct  3 06:57:33 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id GAA28103
          for fs-outgoing; Thu, 3 Oct 1996 06:57:33 -0700 (PDT)
Received: from minnow.render.com (render.demon.co.uk [158.152.30.118])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id GAA28092;
          Thu, 3 Oct 1996 06:57:22 -0700 (PDT)
Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id OAA26130; Thu, 3 Oct 1996 14:54:58 +0100
Date: Thu, 3 Oct 1996 14:54:56 +0100 (BST)
From: Doug Rabson <dfr@render.com>
To: dyson@freebsd.org
cc: phk@critter.tfs.com, heo@cslsun10.sogang.ac.kr,
        freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org
Subject: Re: vnode and cluster read-ahead
In-Reply-To: <199610031312.IAA00602@dyson.iquest.net>
Message-ID: <Pine.BSF.3.95.961003144401.10204Q-100000@minnow.render.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

On Thu, 3 Oct 1996, John S. Dyson wrote:

> > 
> > On the subject of saving memory, I firmly believe that signficant
> > performance improvements can be made just by reducing the memory footprint
> > of algorithms.  In our 3D graphics work, we have found that making
> > important datastructures fit into cache lines (and using an aligning
> > allocator to make sure that they start on cache line boundaries) can
> > improve performance by as much as 20%.
> > 
> The pmap code is a perfect example of that.  There are times that I have
> "improved" the code, and noted a net slowdown, because it has grown.
> Soon, I intend to chop out another 1-2k out of pmap.o.  Smaller is
> definitely better sometimes.

You may find that increasing the size of struct pv_entry to 32 bytes and
arranging get_pv_entry to return new pv_entries on 32 byte boundaries will
improve performance for operations that traverse pmaps which contain a
large number of entries.  Making structures like this fit cleanly into
cache lines reduces the average number of cache misses needed to access a
large quantity of data.

If in addition, you arrange those functions to access the struct pv_entry
sequentially from start to end, you will benefit from the fact that the 8
words of a cache line are read sequentially after a cache miss by the
pentium and are available for use by instructions as soon as they are
read, i.e. you can use the first couple of words in the cache line while
the processor reads the rest.   Looking at pmap_remove_entry() it seems to
do this already but you can only benefit from it if the structure starts
on a cache line boundary.

--
Doug Rabson, Microsoft RenderMorphics Ltd.	Mail:  dfr@render.com
						Phone: +44 171 734 3761
						FAX:   +44 171 734 6426


From owner-freebsd-fs  Thu Oct  3 07:10:49 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id HAA28917
          for fs-outgoing; Thu, 3 Oct 1996 07:10:49 -0700 (PDT)
Received: from minnow.render.com (render.demon.co.uk [158.152.30.118])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id HAA28759;
          Thu, 3 Oct 1996 07:08:43 -0700 (PDT)
Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id PAA26204; Thu, 3 Oct 1996 15:06:09 +0100
Date: Thu, 3 Oct 1996 15:06:09 +0100 (BST)
From: Doug Rabson <dfr@render.com>
To: dyson@freebsd.org
cc: Mikael Karpberg <karpen@ocean.campus.luth.se>, bde@zeta.org.au,
        heo@cslsun10.sogang.ac.kr, freebsd-fs@freebsd.org
Subject: Re: nbuf in buffer cache
In-Reply-To: <199610031320.IAA00624@dyson.iquest.net>
Message-ID: <Pine.BSF.3.95.961003150224.10204R-100000@minnow.render.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

On Thu, 3 Oct 1996, John S. Dyson wrote:

> >
> > Umm... hold on a second, here... :-)
> > I always thought Linux etc used all free memory for disk caching, and
> > that the BSD's used a formula (basically something like some percentage of
> > the available memory) to determine the size of a static buffer, used as
> > disk cache. Now... it makes sense if this changes when you use a merged
> > disk cache and VM system. Someone let me in on how things work? :-)
> > 
> FreeBSD uses all of available memory for disk cache (it has actually had
> a true merged VM/buffer cache longer than Linux.)  Linux has used a dynamic
> buffer cache for a long time though (which is technically different.) The
> only type of data that must be in a buffer is directory info.  I am about
> ready to consider 2x-3x the number of buffers and changing a few tunables
> so that the cache will not take any more space.  Since buffers only take
> 200 or so bytes apiece, it will not hurt (much) to increase the number of
> buffers even on a small system.  The perf won't go down as long as I change the
> formula so that the memory limit isn't 8K * nbuf, but is 2-3K * nbuf.

Having more buffers would improve performance for NFSv3 since data which
has been written to the server but not committed is held in specially
marked dirty buffers.  Having a limited supply of buffers forces the
system to commit data earlier, which involves another client-server round
trip and a possible wait for the server's sync operation.

It would be nice if instead of marking the buffer for a later commit, the
underlying pages could be marked instead.  This would be tricky to fit
into the existing vnode system though.

--
Doug Rabson, Microsoft RenderMorphics Ltd.	Mail:  dfr@render.com
						Phone: +44 171 734 3761
						FAX:   +44 171 734 6426


From owner-freebsd-fs  Thu Oct  3 08:36:11 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id IAA03308
          for fs-outgoing; Thu, 3 Oct 1996 08:36:11 -0700 (PDT)
Received: from ccs.sogang.ac.kr (ccs.sogang.ac.kr [163.239.1.1])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id IAA03278;
          Thu, 3 Oct 1996 08:36:03 -0700 (PDT)
Received: from cslsun10.sogang.ac.kr by ccs.sogang.ac.kr (8.8.0/Sogang) id AAA04190; Fri, 4 Oct 1996 00:31:37 +0900 (KST)
Received: from localhost by cslsun10.sogang.ac.kr (4.1/SMI-4.1)
	id AA06595; Fri, 4 Oct 96 00:34:41 KST
Date: Fri, 4 Oct 1996 00:34:40 +0900 (KST)
From: Heo Sung-Gwan <heo@cslsun10.sogang.ac.kr>
X-Sender: heo@cslsun10
To: freebsd-hackers@FreeBSD.ORG
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: vnode and cluster read-ahead
Message-Id: <Pine.SUN.3.93.961004003010.6582A-100000@cslsun10>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk


John Dyson writes:
>> 
>> You could maintain a number of 'pending readahead' structures indexed by
>> vnode and block number.  Each call to cluster_read would check for a
>> pending readahead by hashing.  For efficiency, keep a pointer to the last
>> readahead structure used by cluster_read in the vnode in place of the
>> existing in-vnode readahead data.  Should be no slower than the current
>> system for single process reads and it saves 4 bytes per vnode :-).
> 
>Pretty cool idea.  I am remembering now that this deficiency in our read
>ahead code is well known.  This might be something really good for 2.3 or
>3.1 :-).  (Unless someone else wants to implement it -- hint hint :-)).
>

I suggest a new idea. The fields for read-ahead(maxra, lenra, etc) are
in file structure or other structure(e.g. Doug Rabson's readahead structure)
that is pointed by a new field in file structure. And vnode has a new field 
to contain the point to the file structure. This vnode field is filled every 
read system call with the point to the file structure at vn_read() 
in kern/vfs_vnops.c Then it is possible that the file structure is accessed
through vnode in cluster_read. 

Because the system calls are nonpreemptive the point to the file structure 
in the vnode is not changed until the current read system call is finished.  

This method removes the hashing using vnode and block number.

Is it really possible?

--
Heo Sung-Gwan
Dept. of Computer Science, Sogang University, Seoul, Korea.
E-mail: heo@cslsun10.sogang.ac.kr


From owner-freebsd-fs  Thu Oct  3 09:07:29 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id JAA05343
          for fs-outgoing; Thu, 3 Oct 1996 09:07:29 -0700 (PDT)
Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id JAA05325;
          Thu, 3 Oct 1996 09:07:23 -0700 (PDT)
Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id LAA00894; Thu, 3 Oct 1996 11:07:01 -0500 (EST)
From: John Dyson <dyson@dyson.iquest.net>
Message-Id: <199610031607.LAA00894@dyson.iquest.net>
Subject: Re: nbuf in buffer cache
To: dfr@render.com (Doug Rabson)
Date: Thu, 3 Oct 1996 11:07:01 -0500 (EST)
Cc: dyson@freebsd.org, karpen@ocean.campus.luth.se, bde@zeta.org.au,
        heo@cslsun10.sogang.ac.kr, freebsd-fs@freebsd.org
In-Reply-To: <Pine.BSF.3.95.961003150224.10204R-100000@minnow.render.com> from "Doug Rabson" at Oct 3, 96 03:06:09 pm
Reply-To: dyson@freebsd.org
X-Mailer: ELM [version 2.4 PL24 ME8]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> 
> It would be nice if instead of marking the buffer for a later commit, the
> underlying pages could be marked instead.  This would be tricky to fit
> into the existing vnode system though.
> 
We can do that in the current vfs_bio, modulo some bugs.  I probably
won't get to it until the NEXT big release -- my 2.2/3.0 plate is so full,
that it is spilling over (mixed metaphor, I think :-)).

John


From owner-freebsd-fs  Thu Oct  3 10:10:56 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id KAA08601
          for fs-outgoing; Thu, 3 Oct 1996 10:10:56 -0700 (PDT)
Received: from minnow.render.com (render.demon.co.uk [158.152.30.118])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id KAA08588;
          Thu, 3 Oct 1996 10:10:40 -0700 (PDT)
Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id SAA26561; Thu, 3 Oct 1996 18:10:27 +0100
Date: Thu, 3 Oct 1996 18:10:24 +0100 (BST)
From: Doug Rabson <dfr@render.com>
To: dyson@freebsd.org
cc: karpen@ocean.campus.luth.se, bde@zeta.org.au, heo@cslsun10.sogang.ac.kr,
        freebsd-fs@freebsd.org
Subject: Re: nbuf in buffer cache
In-Reply-To: <199610031607.LAA00894@dyson.iquest.net>
Message-ID: <Pine.BSF.3.95.961003180945.10204T-100000@minnow.render.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

On Thu, 3 Oct 1996, John Dyson wrote:

> > 
> > It would be nice if instead of marking the buffer for a later commit, the
> > underlying pages could be marked instead.  This would be tricky to fit
> > into the existing vnode system though.
> > 
> We can do that in the current vfs_bio, modulo some bugs.  I probably
> won't get to it until the NEXT big release -- my 2.2/3.0 plate is so full,
> that it is spilling over (mixed metaphor, I think :-)).

There's no rush.  The performance within the buffer metaphor is fine most
of the time.

--
Doug Rabson, Microsoft RenderMorphics Ltd.	Mail:  dfr@render.com
						Phone: +44 171 734 3761
						FAX:   +44 171 734 6426


From owner-freebsd-fs  Thu Oct  3 10:22:54 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id KAA09285
          for fs-outgoing; Thu, 3 Oct 1996 10:22:54 -0700 (PDT)
Received: from minnow.render.com (render.demon.co.uk [158.152.30.118])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id KAA09101;
          Thu, 3 Oct 1996 10:20:04 -0700 (PDT)
Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id SAA26614; Thu, 3 Oct 1996 18:18:41 +0100
Date: Thu, 3 Oct 1996 18:18:38 +0100 (BST)
From: Doug Rabson <dfr@render.com>
To: Heo Sung-Gwan <heo@cslsun10.sogang.ac.kr>
cc: freebsd-hackers@FreeBSD.org, freebsd-fs@FreeBSD.org
Subject: Re: vnode and cluster read-ahead
In-Reply-To: <Pine.SUN.3.93.961004003010.6582A-100000@cslsun10>
Message-ID: <Pine.BSF.3.95.961003181143.10204U-100000@minnow.render.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@FreeBSD.org
X-Loop: FreeBSD.org
Precedence: bulk

On Fri, 4 Oct 1996, Heo Sung-Gwan wrote:

> 
> John Dyson writes:
> >> 
> >> You could maintain a number of 'pending readahead' structures indexed by
> >> vnode and block number.  Each call to cluster_read would check for a
> >> pending readahead by hashing.  For efficiency, keep a pointer to the last
> >> readahead structure used by cluster_read in the vnode in place of the
> >> existing in-vnode readahead data.  Should be no slower than the current
> >> system for single process reads and it saves 4 bytes per vnode :-).
> > 
> >Pretty cool idea.  I am remembering now that this deficiency in our read
> >ahead code is well known.  This might be something really good for 2.3 or
> >3.1 :-).  (Unless someone else wants to implement it -- hint hint :-)).
> >
> 
> I suggest a new idea. The fields for read-ahead(maxra, lenra, etc) are
> in file structure or other structure(e.g. Doug Rabson's readahead structure)
> that is pointed by a new field in file structure. And vnode has a new field 
> to contain the point to the file structure. This vnode field is filled every 
> read system call with the point to the file structure at vn_read() 
> in kern/vfs_vnops.c Then it is possible that the file structure is accessed
> through vnode in cluster_read. 

Not all the vnodes in the system are associated with file structures.  The
NFS server uses vnodes directly along with some other oddities like exec
and coredumps.  If we optimise cluster_read for normal open files, we
should try and avoid pessimising it for other vnode users in the system.

> 
> Because the system calls are nonpreemptive the point to the file structure 
> in the vnode is not changed until the current read system call is finished.  

I have vain hopes of a future kernel which is multithreading and
introducing a new complication to that is not a good idea IMHO.  In
addition, multiple userland threads could fool a system where readaheads
were calculater per-open-file.

> 
> This method removes the hashing using vnode and block number.

For the common single reader case, the vnode would cache a pointer to the
readahead structure, avoiding the hash.  The hash would be a simple O(1)
operation anyway for the multiple reader case and so should not be a real
performance problem.

> 
> Is it really possible?

A friend of mine always used to answer, 'Anything is possible, after all
its only software' to that question :-).

--
Doug Rabson, Microsoft RenderMorphics Ltd.	Mail:  dfr@render.com
						Phone: +44 171 734 3761
						FAX:   +44 171 734 6426


From owner-freebsd-fs  Thu Oct  3 14:20:45 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id OAA25382
          for fs-outgoing; Thu, 3 Oct 1996 14:20:45 -0700 (PDT)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id OAA25372;
          Thu, 3 Oct 1996 14:20:41 -0700 (PDT)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id OAA06789; Thu, 3 Oct 1996 14:18:58 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199610032118.OAA06789@phaeton.artisoft.com>
Subject: Re: vnode and cluster read-ahead
To: dfr@render.com (Doug Rabson)
Date: Thu, 3 Oct 1996 14:18:57 -0700 (MST)
Cc: heo@cslsun10.sogang.ac.kr, freebsd-hackers@FreeBSD.org,
        freebsd-fs@FreeBSD.org
In-Reply-To: <Pine.BSF.3.95.961003181143.10204U-100000@minnow.render.com> from "Doug Rabson" at Oct 3, 96 06:18:38 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@FreeBSD.org
X-Loop: FreeBSD.org
Precedence: bulk

> > I suggest a new idea. The fields for read-ahead(maxra, lenra, etc)
> > are in file structure or other structure(e.g. Doug Rabson's readahead
> > structure) that is pointed by a new field in file structure. And
> > vnode has a new field to contain the point to the file structure.
> > This vnode field is filled every read system call with the point
> > to the file structure at vn_read() in kern/vfs_vnops.c Then it is
> > possible that the file structure is accessed through vnode in
> > cluster_read. 
> 
> Not all the vnodes in the system are associated with file structures.  The
> NFS server uses vnodes directly along with some other oddities like exec
> and coredumps.  If we optimise cluster_read for normal open files, we
> should try and avoid pessimising it for other vnode users in the system.

To deal with this, you would have to add a "read ahead hints parameter"
to the thing, and for NFS, pass one that will result in no change in
the algorithm.

This might be a good thing, but it would require minor changes to huge
amounts of kernel code.

In addition, it is not clear that the reverse mapping could be
successful; you could change a vnode pointer on call down, but it
would mean that you have destroyed call reentrancy for the interface,
since reentering on the same vnode would potentially blow the same
field before the downcall code could use it.  Again, moving to a
parameter instead of a vnode encoding would fix this, at possibly
unacceptable cost.

> I have vain hopes of a future kernel which is multithreading and
> introducing a new complication to that is not a good idea IMHO.  In
> addition, multiple userland threads could fool a system where readaheads
> were calculater per-open-file.

I agree.  In addition, moving to an async call gate to implement
threading, where you make the same call through a different trap
entry point, and potentially blocking operations automagically
generate an async context record plus a context switch, would
definitely tickle this problem.


> > This method removes the hashing using vnode and block number.
> 
> For the common single reader case, the vnode would cache a pointer to the
> readahead structure, avoiding the hash.  The hash would be a simple O(1)
> operation anyway for the multiple reader case and so should not be a real
> performance problem.

I agree again.  You either trust cache locality to work, or we might
as well through out all caching, since we should measure all algorithms
by the same yardstick.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.