From owner-freebsd-fs Mon Sep 30 04:51:38 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id EAA05791 for fs-outgoing; Mon, 30 Sep 1996 04:51:38 -0700 (PDT) Received: from badger.gaylord.net (gaylord.async.vt.edu [128.173.18.131]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id EAA05046; Mon, 30 Sep 1996 04:50:51 -0700 (PDT) Received: (from clark@localhost) by badger.gaylord.net (8.7.5/8.7.3) id HAA04283; Mon, 30 Sep 1996 07:50:15 -0400 (EDT) From: Clark Gaylord Message-Id: <199609301150.HAA04283@badger.gaylord.net> Subject: Status of mounting HPFS drives To: freebsd-questions@freebsd.org Date: Mon, 30 Sep 1996 07:49:28 -0400 (EDT) Cc: freebsd-fs@freebsd.org X-Mailer: ELM [version 2.4ME+ PL22 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Hello FreeBSD-land -- I have installed FBSD 2.1.5 and I am generally pleased with it. However, I have over 1GB in my OS/2 HPFS partitions that I would like to access. After searching the archives, I see that there is periodically some banter about this, but the issue has not been raised recently. Is there anyway to do this, including using some Linux program (though I'd prefer real FreeBSD)? I would prefer to use FreeBSD than Linux, but this issue is important enough that I might have to switch. There have been a substantial number of OS/2 users I've known who either have switched to Linux or run both; I think it would be valuable to the FreeBSD effort if it were also an option for these people. I will gladly summarize to the list any private email that is valuable. Thank you. Clark Gaylord Blacksburg, VA cgaylord@vt.edu From owner-freebsd-fs Mon Sep 30 06:00:00 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id FAA05175 for fs-outgoing; Mon, 30 Sep 1996 06:00:00 -0700 (PDT) Received: from minnow.render.com (render.demon.co.uk [158.152.30.118]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id FAA04918; Mon, 30 Sep 1996 05:59:41 -0700 (PDT) Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id NAA10948; Mon, 30 Sep 1996 13:59:21 +0100 Date: Mon, 30 Sep 1996 13:59:21 +0100 (BST) From: Doug Rabson To: fs@freebsd.org, lite2@freebsd.org Subject: Lite2 filesystem code needs testing Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk I think that the Lite2 Merge has got to a stage where it is stable enough for testing. The kernel boots (and shuts down!) cleanly and most of the filesystems except msdosfs and devfs are converted to the new regime. I have seen it panic a couple of times but it is good enough to run xemacs and a kernel compile. I just added some code to make it easier to move from a -current system to a lite2 system - add COMPAT_PRELITE2 to your kernel config. This is pretty minimal compatibility - just enough to get -current's getvfsent() and mount_nfs to work. In particular, fsck on a dirty root filesystem seems to fail to remount the filesystem, causing annoying double-reboots after a panic. Two major work items need to be completed before this code can go into -current: make a LINT kernel compile with no warnings and complete a 'make world'. -- Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com Phone: +44 171 734 3761 FAX: +44 171 734 6426 From owner-freebsd-fs Mon Sep 30 06:26:39 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id GAA28414 for fs-outgoing; Mon, 30 Sep 1996 06:26:39 -0700 (PDT) Received: from critter.tfs.com ([140.145.230.252]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id GAA28176; Mon, 30 Sep 1996 06:26:19 -0700 (PDT) Received: from critter.tfs.com (localhost.tfs.com [127.0.0.1]) by critter.tfs.com (8.7.5/8.7.3) with ESMTP id PAA08320; Mon, 30 Sep 1996 15:25:48 +0200 (MET DST) To: Doug Rabson cc: fs@freebsd.org, lite2@freebsd.org Subject: Re: Lite2 filesystem code needs testing In-reply-to: Your message of "Mon, 30 Sep 1996 13:59:21 BST." Date: Mon, 30 Sep 1996 15:25:47 +0200 Message-ID: <8318.844089947@critter.tfs.com> From: Poul-Henning Kamp Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk In message , Doug R abson writes: Cool. Why don't you make a task-list, check it in and people can grab tasks from there ? -- Poul-Henning Kamp | phk@FreeBSD.ORG FreeBSD Core-team. http://www.freebsd.org/~phk | phk@login.dknet.dk Private mailbox. whois: [PHK] | phk@ref.tfs.com TRW Financial Systems, Inc. Future will arrive by its own means, progress not so. From owner-freebsd-fs Mon Sep 30 06:39:15 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id GAA09209 for fs-outgoing; Mon, 30 Sep 1996 06:39:15 -0700 (PDT) Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id GAA09021; Mon, 30 Sep 1996 06:38:59 -0700 (PDT) Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id IAA03675; Mon, 30 Sep 1996 08:37:50 -0500 (EST) From: "John S. Dyson" Message-Id: <199609301337.IAA03675@dyson.iquest.net> Subject: Re: Lite2 filesystem code needs testing To: phk@critter.tfs.com (Poul-Henning Kamp) Date: Mon, 30 Sep 1996 08:37:50 -0500 (EST) Cc: dfr@render.com, fs@freebsd.org, lite2@freebsd.org In-Reply-To: <8318.844089947@critter.tfs.com> from "Poul-Henning Kamp" at Sep 30, 96 03:25:47 pm Reply-To: dyson@freebsd.org X-Mailer: ELM [version 2.4 PL24 ME8] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > > In message , Doug R > abson writes: > > Cool. > > Why don't you make a task-list, check it in and people can grab tasks > from there ? > Good idea. John From owner-freebsd-fs Mon Sep 30 06:49:32 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id GAA18025 for fs-outgoing; Mon, 30 Sep 1996 06:49:32 -0700 (PDT) Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id GAA17837; Mon, 30 Sep 1996 06:49:19 -0700 (PDT) Received: (from bde@localhost) by godzilla.zeta.org.au (8.7.6/8.6.9) id XAA19437; Mon, 30 Sep 1996 23:38:50 +1000 Date: Mon, 30 Sep 1996 23:38:50 +1000 From: Bruce Evans Message-Id: <199609301338.XAA19437@godzilla.zeta.org.au> To: dfr@render.com, fs@freebsd.org, lite2@freebsd.org Subject: Re: Lite2 filesystem code needs testing Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk >Two major work items need to be completed before this code can go into >-current: make a LINT kernel compile with no warnings and complete a That would be 5000 fewer lines of warnings that for -current itself :-]. gcc-2.7 emits about 4800 new ones. Bruce From owner-freebsd-fs Mon Sep 30 07:25:24 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id HAA11634 for fs-outgoing; Mon, 30 Sep 1996 07:25:24 -0700 (PDT) Received: from minnow.render.com (render.demon.co.uk [158.152.30.118]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id HAA11433; Mon, 30 Sep 1996 07:25:05 -0700 (PDT) Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id PAA11222; Mon, 30 Sep 1996 15:24:44 +0100 Date: Mon, 30 Sep 1996 15:24:42 +0100 (BST) From: Doug Rabson To: Poul-Henning Kamp cc: fs@freebsd.org, lite2@freebsd.org Subject: Re: Lite2 filesystem code needs testing In-Reply-To: <8318.844089947@critter.tfs.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk On Mon, 30 Sep 1996, Poul-Henning Kamp wrote: > In message , Doug R > abson writes: > > Cool. > > Why don't you make a task-list, check it in and people can grab tasks > from there ? I just committed a TODO list. Jeffrey, you might want to look at my list and add or remove stuff as appropriate. -- Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com Phone: +44 171 734 3761 FAX: +44 171 734 6426 From owner-freebsd-fs Tue Oct 1 20:01:30 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id UAA24780 for fs-outgoing; Tue, 1 Oct 1996 20:01:30 -0700 (PDT) Received: from ccs.sogang.ac.kr (ccs.sogang.ac.kr [163.239.1.1]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id UAA24775 for ; Tue, 1 Oct 1996 20:01:25 -0700 (PDT) Received: from cslsun10.sogang.ac.kr by ccs.sogang.ac.kr (8.8.0/Sogang) id LAA21425; Wed, 2 Oct 1996 11:56:56 +0900 (KST) Received: from localhost by cslsun10.sogang.ac.kr (4.1/SMI-4.1) id AA04136; Wed, 2 Oct 96 12:00:01 KST Date: Wed, 2 Oct 1996 12:00:00 +0900 (KST) From: Heo Sung-Gwan X-Sender: heo@cslsun10 To: freebsd-fs@FreeBSD.ORG Subject: nbuf in buffer cache Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk Hi, I am curious about the number of buffers(= nbuf) in buffer cache. The variable nbuf is determined in i386/i386/machdep.c as following: #ifdef NBUF int nbuf = NBUF; #else int nbuf = 0; #endif ... void cpu_startup() { ... if (nbuf == 0) { nbuf = 30; if( physmem > 1024) nbuf += min((physmem - 1024) / 12, 1024); } ... } If NBUF is not defined and physical memory is less than 1024 pages(= 4Mbytes) then nbuf becomes 30, and otherwise nbuf is 30 + min((physmem - 1024) / 12, 1024). Why does the number of buffers is calculated in this fashion? 30 buffers, 1024 pages, and division by 12 have special meaning? There is no comment on source code. In addition, if there is no user application processes how many buffers are enough to run the system without degrading the performance of the system? Only 30 buffers? Or better as many as possible? Please let me know. -- Heo Sung-Gwan Dept. of Computer Science, Sogang University, Seoul, Korea. E-mail: heo@cslsun10.sogang.ac.kr From owner-freebsd-fs Tue Oct 1 20:25:36 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id UAA26070 for fs-outgoing; Tue, 1 Oct 1996 20:25:36 -0700 (PDT) Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id UAA26062 for ; Tue, 1 Oct 1996 20:25:30 -0700 (PDT) Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id WAA04042; Tue, 1 Oct 1996 22:23:47 -0500 (EST) From: "John S. Dyson" Message-Id: <199610020323.WAA04042@dyson.iquest.net> Subject: Re: nbuf in buffer cache To: heo@cslsun10.sogang.ac.kr (Heo Sung-Gwan) Date: Tue, 1 Oct 1996 22:23:46 -0500 (EST) Cc: freebsd-fs@FreeBSD.ORG In-Reply-To: from "Heo Sung-Gwan" at Oct 2, 96 12:00:00 pm Reply-To: dyson@FreeBSD.ORG X-Mailer: ELM [version 2.4 PL24 ME8] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk > > If NBUF is not defined and physical memory is less than 1024 pages(= 4Mbytes) > then nbuf becomes 30, and otherwise nbuf is 30 + min((physmem - 1024) / 12, > 1024). > > Why does the number of buffers is calculated in this fashion? > 30 buffers, 1024 pages, and division by 12 have special meaning? > There is no comment on source code. > Experience shows that this is a good number. 30 Buffers is a good minimum on a very small system. There has been problems in earlier code (and perhaps even -current) when running with less than 10 Buffers. > > In addition, if there is no user application processes how many buffers > are enough to run the system without degrading the performance of the system? > Only 30 buffers? Or better as many as possible? > The performance on a small system is poor (IMO) anyway. Adding more buffers will take more memory from runnable processes. Generally, common wisdom and practice shows that it is best to minimize paging. 30 buffers represents approx 240K (on a normally configured filesystem.) If there is more free memory, the system will store cached data in memory not associated with buffers. On a 4MB system, this is uncommon though. Unlike other *BSD's the buffer cache isn't the only place that I/O cached data is stored. On FreeBSD the buffer cache is best thought of as a mapping cache, and also the upper limit of dirty buffer space. Free memory is used for caching both file data and unused memory segments (.text,...). John From owner-freebsd-fs Tue Oct 1 21:54:19 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id VAA02403 for fs-outgoing; Tue, 1 Oct 1996 21:54:19 -0700 (PDT) Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id VAA02396 for ; Tue, 1 Oct 1996 21:54:15 -0700 (PDT) Received: (from bde@localhost) by godzilla.zeta.org.au (8.7.6/8.6.9) id OAA23573; Wed, 2 Oct 1996 14:50:39 +1000 Date: Wed, 2 Oct 1996 14:50:39 +1000 From: Bruce Evans Message-Id: <199610020450.OAA23573@godzilla.zeta.org.au> To: heo@cslsun10.sogang.ac.kr, toor@dyson.iquest.net Subject: Re: nbuf in buffer cache Cc: freebsd-fs@FreeBSD.org Sender: owner-fs@FreeBSD.org X-Loop: FreeBSD.org Precedence: bulk >> Why does the number of buffers is calculated in this fashion? >> 30 buffers, 1024 pages, and division by 12 have special meaning? >> There is no comment on source code. >> >Experience shows that this is a good number. 30 Buffers is a good minimum >on a very small system. There has been problems in earlier code (and >perhaps even -current) when running with less than 10 Buffers. >> >The performance on a small system is poor (IMO) anyway. Adding more buffers >will take more memory from runnable processes. Generally, common wisdom >and practice shows that it is best to minimize paging. 30 buffers represents >approx 240K (on a normally configured filesystem.) If there is more free Experience showed that 240K is about right for a 2MB system running FreeBSD.1.x, but 30 buffers is far too small. For file systems with a block size of 512 (e.g. msdos floppies), it can cache a whole 15K. For normal ufs file systems with a fragment size of 1K, 1K fragments are common for directories. >memory, the system will store cached data in memory not associated with >buffers. On a 4MB system, this is uncommon though. Unlike other *BSD's >the buffer cache isn't the only place that I/O cached data is stored. On >FreeBSD the buffer cache is best thought of as a mapping cache, and also the >upper limit of dirty buffer space. Free memory is used for caching both >file data and unused memory segments (.text,...). Now 240K is probably too much for metadata alone, but 30 buffers is still too small. Metadata blocks are usually small, so 30 buffers usually limits the amount of metadata cached to much less than 240K. Bruce From owner-freebsd-fs Tue Oct 1 22:14:56 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id WAA03748 for fs-outgoing; Tue, 1 Oct 1996 22:14:56 -0700 (PDT) Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id WAA03743 for ; Tue, 1 Oct 1996 22:14:53 -0700 (PDT) Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id AAA00199; Wed, 2 Oct 1996 00:11:16 -0500 (EST) From: "John S. Dyson" Message-Id: <199610020511.AAA00199@dyson.iquest.net> Subject: Re: nbuf in buffer cache To: bde@zeta.org.au (Bruce Evans) Date: Wed, 2 Oct 1996 00:11:16 -0500 (EST) Cc: heo@cslsun10.sogang.ac.kr, freebsd-fs@FreeBSD.org In-Reply-To: <199610020450.OAA23573@godzilla.zeta.org.au> from "Bruce Evans" at Oct 2, 96 02:50:39 pm Reply-To: dyson@FreeBSD.org X-Mailer: ELM [version 2.4 PL24 ME8] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@FreeBSD.org X-Loop: FreeBSD.org Precedence: bulk > > >> Why does the number of buffers is calculated in this fashion? > >> 30 buffers, 1024 pages, and division by 12 have special meaning? > >> There is no comment on source code. > >> > >Experience shows that this is a good number. 30 Buffers is a good minimum > >on a very small system. There has been problems in earlier code (and > >perhaps even -current) when running with less than 10 Buffers. > >> > >The performance on a small system is poor (IMO) anyway. Adding more buffers > >will take more memory from runnable processes. Generally, common wisdom > >and practice shows that it is best to minimize paging. 30 buffers represents > >approx 240K (on a normally configured filesystem.) If there is more free > > Experience showed that 240K is about right for a 2MB system running > FreeBSD.1.x, but 30 buffers is far too small. For file systems with > a block size of 512 (e.g. msdos floppies), it can cache a whole 15K. > For normal ufs file systems with a fragment size of 1K, 1K fragments > are common for directories. > > >memory, the system will store cached data in memory not associated with > >buffers. On a 4MB system, this is uncommon though. Unlike other *BSD's > >the buffer cache isn't the only place that I/O cached data is stored. On > >FreeBSD the buffer cache is best thought of as a mapping cache, and also the > >upper limit of dirty buffer space. Free memory is used for caching both > >file data and unused memory segments (.text,...). > > Now 240K is probably too much for metadata alone, but 30 buffers is still > too small. Metadata blocks are usually small, so 30 buffers usually > limits the amount of metadata cached to much less than 240K. > So, you would trade paging for file buffering? I don't think so. Firstly, the MSDOS filesystem is a degenerate case. Many programs have a very steep curve that if you are running low on memory, they will cause thrashing. DG and I found that it is very important to make sure that GCC can have as much memory as possible. If (and it is a very big if) there is free (spare) memory, the system will provide it in the form of the merged VM object cache. Note also the system prefers to keep metadata in the cache, and to push file data to the VM objects. It is then the best of both worlds. So, to be precise, limiting the number of buffers keeps the freedom maximized. The larger the number of buffers, the greater the chance that there will be too much wired memory for an application. I have found that the knee for gcc appears to be about 2M (plus or minus.) And it is very sharp. If you restrict the amount of memory even by 100K-200K, compile times go through the roof. Additionally, the issue of MSDOS having a very small cache size isn't valid, and is limited by the total amount of available memory. John From owner-freebsd-fs Tue Oct 1 23:25:09 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id XAA10827 for fs-outgoing; Tue, 1 Oct 1996 23:25:09 -0700 (PDT) Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id XAA10822; Tue, 1 Oct 1996 23:25:03 -0700 (PDT) Received: (from bde@localhost) by godzilla.zeta.org.au (8.7.6/8.6.9) id QAA26221; Wed, 2 Oct 1996 16:18:46 +1000 Date: Wed, 2 Oct 1996 16:18:46 +1000 From: Bruce Evans Message-Id: <199610020618.QAA26221@godzilla.zeta.org.au> To: bde@zeta.org.au, dyson@freebsd.org Subject: Re: nbuf in buffer cache Cc: freebsd-fs@freebsd.org, heo@cslsun10.sogang.ac.kr Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk >> Now 240K is probably too much for metadata alone, but 30 buffers is still >> too small. Metadata blocks are usually small, so 30 buffers usually >> limits the amount of metadata cached to much less than 240K. >> >So, you would trade paging for file buffering? I don't think so. Firstly, the No, but allocate enough buffers to hold the memory that you're willing to allocate for (non VM-object) buffering. nbuf = memory_allowed / DEV_BSIZE is too many for static allocation, so dynamic allocation is required. sizeof(struct buf) is now 212, so the worst case should only have nbuf = memory_allowed / (512 + 212). (struct buf is bloated. In my first implementation of buffering, for an 8-bit system, DEV_BSIZE was 256 and sizeof(struct buf) was 13 and I thought that the 5% overhead was high. Sigh.) >So, to be precise, limiting the number of buffers keeps the freedom >maximized. The larger the number of buffers, the greater the chance that >there will be too much wired memory for an application. I have found that Limiting the number of buffers instead of limiting the memory allocated for the buffers sometimes gives more freedom because less memory is allocated, but it is better to limit the amount allocated explicitly. Bruce From owner-freebsd-fs Wed Oct 2 06:34:12 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id GAA07203 for fs-outgoing; Wed, 2 Oct 1996 06:34:12 -0700 (PDT) Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id GAA07182; Wed, 2 Oct 1996 06:34:08 -0700 (PDT) Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id IAA00852; Wed, 2 Oct 1996 08:33:12 -0500 (EST) From: "John S. Dyson" Message-Id: <199610021333.IAA00852@dyson.iquest.net> Subject: Re: nbuf in buffer cache To: bde@zeta.org.au (Bruce Evans) Date: Wed, 2 Oct 1996 08:33:12 -0500 (EST) Cc: bde@zeta.org.au, dyson@FreeBSD.org, freebsd-fs@FreeBSD.org, heo@cslsun10.sogang.ac.kr In-Reply-To: <199610020618.QAA26221@godzilla.zeta.org.au> from "Bruce Evans" at Oct 2, 96 04:18:46 pm Reply-To: dyson@FreeBSD.org X-Mailer: ELM [version 2.4 PL24 ME8] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@FreeBSD.org X-Loop: FreeBSD.org Precedence: bulk > > No, but allocate enough buffers to hold the memory that you're willing > to allocate for (non VM-object) buffering. nbuf = memory_allowed / > DEV_BSIZE is too many for static allocation, so dynamic allocation > is required. sizeof(struct buf) is now 212, so the worst case should > only have nbuf = memory_allowed / (512 + 212). (struct buf is bloated. > In my first implementation of buffering, for an 8-bit system, DEV_BSIZE > was 256 and sizeof(struct buf) was 13 and I thought that the 5% overhead > was high. Sigh.) > If you can figure out a way to shrink our current buffers to 13 instead of just under 256, please do so. They are NOT easy to shrink. I think what the smaller buffer headers did must have been quite different from what we have now. Remember also that the amount of buffering space is not limited by the number of buffers!!! The buffers are now mostly for temporary mappings and pending writes. The only other required purpose for buffers is for caching directories. There is bias to keep the directories in the buffers. > > >So, to be precise, limiting the number of buffers keeps the freedom > >maximized. The larger the number of buffers, the greater the chance that > >there will be too much wired memory for an application. I have found that > > Limiting the number of buffers instead of limiting the memory allocated > for the buffers sometimes gives more freedom because less memory is > allocated, but it is better to limit the amount allocated explicitly. > The mechanism exists in our current vfs_bio to support that. In fact, if you notice, the amount of memory used by vfs_bio is limited to nbuf * 8K. If you have 16k buffers, it is still limited to nbuf*8k, so the number of buffers (again, not limiting the buffering space) is one half for larger buffers. You can re-tune those parameters for the small-block filesystems. Of course, such file systems encounter many other inefficiencies in normal operations also. (In other words, IMO, msdosfs as it is currently written is not very fast anyway.) Remember, my argument against excessive numbers of buffers is mostly for small systems (i.e. 4M.) Those systems are just not very effective at caching. The case that I am most worried about is 4k/8k ufs systems (the ones most used.) I do not think that wiring down large amounts of memory is a wise idea. If you are complaining about an excessive buffer header size, then there is an opportunity to work on it. (Actually, shouldn't an MSDOSFS use the cluster size instead of 512 anyway?, we have no problem handling 32k buffers, if you need them (minor tunable).) There is another opportunity to work on solving the MSDOS problem and getting the best of both worlds (bigger buffer support for more wired-down caching, and not taking excessive memory.) Directories are still small though, but we have different sized buffers on UFS also. I would suggest also when/if you make a decision to change the way that buffer sizes/buffer memory is calculated, please consider the case of 8k UFS (the default.) Also, I think that many of the small systems are for "non-wealthy students" who would like to compile programs with gcc. It is already slow, and making it slower is not good. John From owner-freebsd-fs Wed Oct 2 07:18:26 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id HAA10418 for fs-outgoing; Wed, 2 Oct 1996 07:18:26 -0700 (PDT) Received: from ccs.sogang.ac.kr (ccs.sogang.ac.kr [163.239.1.1]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id HAA10400; Wed, 2 Oct 1996 07:18:20 -0700 (PDT) Received: from cslsun10.sogang.ac.kr by ccs.sogang.ac.kr (8.8.0/Sogang) id XAA28649; Wed, 2 Oct 1996 23:14:01 +0900 (KST) Received: from localhost by cslsun10.sogang.ac.kr (4.1/SMI-4.1) id AA05139; Wed, 2 Oct 96 23:17:04 KST Date: Wed, 2 Oct 1996 23:17:03 +0900 (KST) From: Heo Sung-Gwan X-Sender: heo@cslsun10 To: freebsd-hackers@FreeBSD.ORG Cc: freebsd-fs@FreeBSD.ORG Subject: vnode and cluster read-ahead Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk When a file is open serveral times simultaneously cluster read-ahead buffer cache using vnode seem to have some problems. As a process A reads a file F *sequentially* the fields(v_maxra, v_ralen, etc) of the vnode of F increases. As a result read-ahead of next cluster happens. But when a process B opens F and reads it the values of the fields are changed. So the process A's read-ahead is disturbed whenever process B is rescheduled. I think the fields for read-ahead must be in struct file rather than vnode. There exists one vnode for a file but a file may be open serveral times. What's your opinion, hackers? -- Heo Sung-Gwan Dept. of Computer Science, Sogang University, Seoul, Korea. E-mail: heo@cslsun10.sogang.ac.kr From owner-freebsd-fs Wed Oct 2 07:40:31 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id HAA12094 for fs-outgoing; Wed, 2 Oct 1996 07:40:31 -0700 (PDT) Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id HAA12082; Wed, 2 Oct 1996 07:40:26 -0700 (PDT) Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id JAA00980; Wed, 2 Oct 1996 09:39:43 -0500 (EST) From: John Dyson Message-Id: <199610021439.JAA00980@dyson.iquest.net> Subject: Re: vnode and cluster read-ahead To: heo@cslsun10.sogang.ac.kr (Heo Sung-Gwan) Date: Wed, 2 Oct 1996 09:39:42 -0500 (EST) Cc: freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org In-Reply-To: from "Heo Sung-Gwan" at Oct 2, 96 11:17:03 pm Reply-To: dyson@freebsd.org X-Mailer: ELM [version 2.4 PL24 ME8] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > When a file is open serveral times simultaneously cluster read-ahead > buffer cache using vnode seem to have some problems. > You are right. > As a process A reads a file F *sequentially* the fields(v_maxra, v_ralen, etc) of the vnode of F increases. As a result read-ahead of next cluster happens. > But when a process B opens F and reads it the values of the fields are > changed. So the process A's read-ahead is disturbed whenever process B is > rescheduled. > > I think the fields for read-ahead must be in struct file rather than vnode. > There exists one vnode for a file but a file may be open serveral times. > That is closer to correct. I am not sure that the struct file is correct either, but I think that you are on the right track. John From owner-freebsd-fs Wed Oct 2 07:51:28 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id HAA12902 for fs-outgoing; Wed, 2 Oct 1996 07:51:28 -0700 (PDT) Received: from root.com (implode.root.com [198.145.90.17]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id HAA12896; Wed, 2 Oct 1996 07:51:24 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by root.com (8.7.5/8.6.5) with SMTP id HAA07229; Wed, 2 Oct 1996 07:52:03 -0700 (PDT) Message-Id: <199610021452.HAA07229@root.com> X-Authentication-Warning: implode.root.com: Host localhost [127.0.0.1] didn't use HELO protocol To: Heo Sung-Gwan cc: freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org Subject: Re: vnode and cluster read-ahead In-reply-to: Your message of "Wed, 02 Oct 1996 23:17:03 +0900." From: David Greenman Reply-To: dg@root.com Date: Wed, 02 Oct 1996 07:52:03 -0700 Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk >When a file is open serveral times simultaneously cluster read-ahead >buffer cache using vnode seem to have some problems. > >As a process A reads a file F *sequentially* the fields(v_maxra, v_ralen, etc) of the vnode of F increases. As a result read-ahead of next cluster happens. >But when a process B opens F and reads it the values of the fields are >changed. So the process A's read-ahead is disturbed whenever process B is >rescheduled. > >I think the fields for read-ahead must be in struct file rather than vnode. >There exists one vnode for a file but a file may be open serveral times. > >What's your opinion, hackers? First, this is a very rare situation that almost never occurs in practice. If the file was just read it will usually still be in the cache (assuming it's not too large), so all references will be satisfied out of the cache and the clustering policy won't matter. However, in the case of it not fitting in the cache, the system will not optimize for sequential reads because they are no longer entirely sequential...so I think the current algorithm is doing the right thing. The whole dynamic read-ahead mechanism is just an optimization in any case and is new with 4.4BSD. Even if we did want to change it, there really isn't a way to do what you're suggesting above - you don't have access to the file struct at the level that the clustering decision is made. You'd have to change the code to propagate clustering hints from the read/write system calls. Yuck. -DG David Greenman Core-team/Principal Architect, The FreeBSD Project From owner-freebsd-fs Wed Oct 2 07:58:26 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id HAA13316 for fs-outgoing; Wed, 2 Oct 1996 07:58:26 -0700 (PDT) Received: from critter.tfs.com ([140.145.230.252]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id HAA13297; Wed, 2 Oct 1996 07:58:20 -0700 (PDT) Received: from critter.tfs.com (localhost.tfs.com [127.0.0.1]) by critter.tfs.com (8.7.5/8.7.3) with ESMTP id QAA03533; Wed, 2 Oct 1996 16:57:24 +0200 (MET DST) To: dyson@freebsd.org cc: heo@cslsun10.sogang.ac.kr (Heo Sung-Gwan), freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org Subject: Re: vnode and cluster read-ahead In-reply-to: Your message of "Wed, 02 Oct 1996 09:39:42 CDT." <199610021439.JAA00980@dyson.iquest.net> Date: Wed, 02 Oct 1996 16:57:24 +0200 Message-ID: <3531.844268244@critter.tfs.com> From: Poul-Henning Kamp Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk In message <199610021439.JAA00980@dyson.iquest.net>, John Dyson writes: >> As a process A reads a file F *sequentially* the fields(v_maxra, v_ralen, et >c) of the vnode of F increases. As a result read-ahead of next cluster happens >. >> But when a process B opens F and reads it the values of the fields are >> changed. So the process A's read-ahead is disturbed whenever process B is >> rescheduled. >> >> I think the fields for read-ahead must be in struct file rather than vnode. >> There exists one vnode for a file but a file may be open serveral times. >> >That is closer to correct. I am not sure that the struct file is correct >either, but I think that you are on the right track. No, I don't agree. Process B will most likely find all it needs in the buffercache, and thus will not need read-ahead at all. How to implement this is not clear to me, but I think the best way would be to calculate the parameters and only if the extend the current read-ahead (v_maxra...) will they be employed. This would gracefully handle the case where process B overtakes process A in reading the file. -- Poul-Henning Kamp | phk@FreeBSD.ORG FreeBSD Core-team. http://www.freebsd.org/~phk | phk@login.dknet.dk Private mailbox. whois: [PHK] | phk@ref.tfs.com TRW Financial Systems, Inc. Future will arrive by its own means, progress not so. From owner-freebsd-fs Wed Oct 2 08:21:05 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id IAA14913 for fs-outgoing; Wed, 2 Oct 1996 08:21:05 -0700 (PDT) Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id IAA14904; Wed, 2 Oct 1996 08:21:00 -0700 (PDT) Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id KAA01060; Wed, 2 Oct 1996 10:20:48 -0500 (EST) From: John Dyson Message-Id: <199610021520.KAA01060@dyson.iquest.net> Subject: Re: vnode and cluster read-ahead To: phk@critter.tfs.com (Poul-Henning Kamp) Date: Wed, 2 Oct 1996 10:20:48 -0500 (EST) Cc: dyson@freebsd.org, heo@cslsun10.sogang.ac.kr, freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org In-Reply-To: <3531.844268244@critter.tfs.com> from "Poul-Henning Kamp" at Oct 2, 96 04:57:24 pm Reply-To: dyson@freebsd.org X-Mailer: ELM [version 2.4 PL24 ME8] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > > In message <199610021439.JAA00980@dyson.iquest.net>, John Dyson writes: > > >> As a process A reads a file F *sequentially* the fields(v_maxra, v_ralen, et > >c) of the vnode of F increases. As a result read-ahead of next cluster happens > >. > >> But when a process B opens F and reads it the values of the fields are > >> changed. So the process A's read-ahead is disturbed whenever process B is > >> rescheduled. > >> > >> I think the fields for read-ahead must be in struct file rather than vnode. > >> There exists one vnode for a file but a file may be open serveral times. > >> > >That is closer to correct. I am not sure that the struct file is correct > >either, but I think that you are on the right track. > > No, I don't agree. Process B will most likely find all it needs in the ^^^^^^ > buffercache, and thus will not need read-ahead at all. > I agree with the term "likely", but it is possible that two processes are not reading the entire file sequentially. Also, it is possible that the file size is much bigger than main memory, thereby busting the cache. Read-ahead is then the only performance improvement to be had. Nowadays, I think that drives actually have segmented read-ahead caches also. We don't though. John From owner-freebsd-fs Wed Oct 2 08:23:04 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id IAA15053 for fs-outgoing; Wed, 2 Oct 1996 08:23:04 -0700 (PDT) Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id IAA15047; Wed, 2 Oct 1996 08:23:00 -0700 (PDT) Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id KAA01068; Wed, 2 Oct 1996 10:22:16 -0500 (EST) From: John Dyson Message-Id: <199610021522.KAA01068@dyson.iquest.net> Subject: Re: vnode and cluster read-ahead To: dg@Root.COM Date: Wed, 2 Oct 1996 10:22:16 -0500 (EST) Cc: heo@cslsun10.sogang.ac.kr, freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org In-Reply-To: <199610021452.HAA07229@root.com> from "David Greenman" at Oct 2, 96 07:52:03 am Reply-To: dyson@freebsd.org X-Mailer: ELM [version 2.4 PL24 ME8] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > > Even if we did want to change it, there really isn't a way to do what > you're suggesting above - you don't have access to the file struct at the > level that the clustering decision is made. You'd have to change the code > to propagate clustering hints from the read/write system calls. Yuck. > I actually thought about doing it (and may have discussed it with you, David), and I think that was my conclusion also. The existing interfaces don't convieniently support it. John From owner-freebsd-fs Wed Oct 2 10:25:12 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id KAA22479 for fs-outgoing; Wed, 2 Oct 1996 10:25:12 -0700 (PDT) Received: from minnow.render.com (render.demon.co.uk [158.152.30.118]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id KAA22455; Wed, 2 Oct 1996 10:24:33 -0700 (PDT) Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id SAA23478; Wed, 2 Oct 1996 18:21:14 +0100 Date: Wed, 2 Oct 1996 18:21:13 +0100 (BST) From: Doug Rabson To: dyson@freebsd.org cc: Poul-Henning Kamp , heo@cslsun10.sogang.ac.kr, freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org Subject: Re: vnode and cluster read-ahead In-Reply-To: <199610021520.KAA01060@dyson.iquest.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk On Wed, 2 Oct 1996, John Dyson wrote: > > > > In message <199610021439.JAA00980@dyson.iquest.net>, John Dyson writes: > > > > >> As a process A reads a file F *sequentially* the fields(v_maxra, v_ralen, et > > >c) of the vnode of F increases. As a result read-ahead of next cluster happens > > >. > > >> But when a process B opens F and reads it the values of the fields are > > >> changed. So the process A's read-ahead is disturbed whenever process B is > > >> rescheduled. > > >> > > >> I think the fields for read-ahead must be in struct file rather than vnode. > > >> There exists one vnode for a file but a file may be open serveral times. > > >> > > >That is closer to correct. I am not sure that the struct file is correct > > >either, but I think that you are on the right track. > > > > No, I don't agree. Process B will most likely find all it needs in the > ^^^^^^ > > buffercache, and thus will not need read-ahead at all. > > > > I agree with the term "likely", but it is possible that two processes > are not reading the entire file sequentially. Also, it is possible that > the file size is much bigger than main memory, thereby busting the cache. > Read-ahead is then the only performance improvement to be had. Nowadays, > I think that drives actually have segmented read-ahead caches also. We > don't though. You could maintain a number of 'pending readahead' structures indexed by vnode and block number. Each call to cluster_read would check for a pending readahead by hashing. For efficiency, keep a pointer to the last readahead structure used by cluster_read in the vnode in place of the existing in-vnode readahead data. Should be no slower than the current system for single process reads and it saves 4 bytes per vnode :-). -- Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com Phone: +44 171 734 3761 FAX: +44 171 734 6426 From owner-freebsd-fs Wed Oct 2 10:40:45 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id KAA23657 for fs-outgoing; Wed, 2 Oct 1996 10:40:45 -0700 (PDT) Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id KAA23647; Wed, 2 Oct 1996 10:40:39 -0700 (PDT) Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id MAA01329; Wed, 2 Oct 1996 12:40:10 -0500 (EST) From: John Dyson Message-Id: <199610021740.MAA01329@dyson.iquest.net> Subject: Re: vnode and cluster read-ahead To: dfr@render.com (Doug Rabson) Date: Wed, 2 Oct 1996 12:40:10 -0500 (EST) Cc: dyson@freebsd.org, phk@critter.tfs.com, heo@cslsun10.sogang.ac.kr, freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org In-Reply-To: from "Doug Rabson" at Oct 2, 96 06:21:13 pm Reply-To: dyson@freebsd.org X-Mailer: ELM [version 2.4 PL24 ME8] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > > You could maintain a number of 'pending readahead' structures indexed by > vnode and block number. Each call to cluster_read would check for a > pending readahead by hashing. For efficiency, keep a pointer to the last > readahead structure used by cluster_read in the vnode in place of the > existing in-vnode readahead data. Should be no slower than the current > system for single process reads and it saves 4 bytes per vnode :-). > Pretty cool idea. I am remembering now that this deficiency in our read ahead code is well known. This might be something really good for 2.3 or 3.1 :-). (Unless someone else wants to implement it -- hint hint :-)). John From owner-freebsd-fs Wed Oct 2 14:59:37 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id OAA09955 for fs-outgoing; Wed, 2 Oct 1996 14:59:37 -0700 (PDT) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id OAA09950; Wed, 2 Oct 1996 14:59:33 -0700 (PDT) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id OAA04844; Wed, 2 Oct 1996 14:57:30 -0700 From: Terry Lambert Message-Id: <199610022157.OAA04844@phaeton.artisoft.com> Subject: Re: vnode and cluster read-ahead To: heo@cslsun10.sogang.ac.kr (Heo Sung-Gwan) Date: Wed, 2 Oct 1996 14:57:30 -0700 (MST) Cc: freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org In-Reply-To: from "Heo Sung-Gwan" at Oct 2, 96 11:17:03 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > When a file is open serveral times simultaneously cluster read-ahead > buffer cache using vnode seem to have some problems. > > As a process A reads a file F *sequentially* the fields(v_maxra, v_ralen, etc) of the vnode of F increases. As a result read-ahead of next cluster happens. > But when a process B opens F and reads it the values of the fields are > changed. So the process A's read-ahead is disturbed whenever process B is > rescheduled. > > I think the fields for read-ahead must be in struct file rather than vnode. > There exists one vnode for a file but a file may be open serveral times. > > What's your opinion, hackers? Matt Day noted this problem some time ago. The problem increases when you have multiple threads in a single process with conflicting acess domains. One soloution would be to "trust cache locality to work". This is not very satisfying for read-ahead. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. From owner-freebsd-fs Thu Oct 3 01:02:32 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id BAA10808 for fs-outgoing; Thu, 3 Oct 1996 01:02:32 -0700 (PDT) Received: from who.cdrom.com (who.cdrom.com [204.216.27.3]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id BAA10798; Thu, 3 Oct 1996 01:02:28 -0700 (PDT) Received: from ocean.campus.luth.se (ocean.campus.luth.se [130.240.194.116]) by who.cdrom.com (8.7.5/8.6.11) with ESMTP id BAA27380 ; Thu, 3 Oct 1996 01:02:26 -0700 (PDT) Received: (from karpen@localhost) by ocean.campus.luth.se (8.7.5/8.7.3) id KAA25133; Thu, 3 Oct 1996 10:05:46 +0200 (MET DST) From: Mikael Karpberg Message-Id: <199610030805.KAA25133@ocean.campus.luth.se> Subject: Re: nbuf in buffer cache To: dyson@FreeBSD.org Date: Thu, 3 Oct 1996 10:05:45 +0200 (MET DST) Cc: bde@zeta.org.au, heo@cslsun10.sogang.ac.kr, freebsd-fs@FreeBSD.org In-Reply-To: <199610020511.AAA00199@dyson.iquest.net> from "John S. Dyson" at "Oct 2, 96 00:11:16 am" X-Mailer: ELM [version 2.4ME+ PL22 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@FreeBSD.org X-Loop: FreeBSD.org Precedence: bulk Hi! > memory as possible. If (and it is a very big if) there is free (spare) > memory, the system will provide it in the form of the merged VM object cache. > Note also the system prefers to keep metadata in the cache, and to push > file data to the VM objects. It is then the best of both worlds. > > So, to be precise, limiting the number of buffers keeps the freedom > maximized. The larger the number of buffers, the greater the chance that > there will be too much wired memory for an application. I have found that > the knee for gcc appears to be about 2M (plus or minus.) And it is very > sharp. If you restrict the amount of memory even by 100K-200K, compile > times go through the roof. > > Additionally, the issue of MSDOS having a very small cache size isn't valid, > and is limited by the total amount of available memory. Umm... hold on a second, here... :-) I always thought Linux etc used all free memory for disk caching, and that the BSD's used a formula (basically something like some percentage of the available memory) to determine the size of a static buffer, used as disk cache. Now... it makes sense if this changes when you use a merged disk cache and VM system. Someone let me in on how things work? :-) /Mikael From owner-freebsd-fs Thu Oct 3 04:16:52 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id EAA21721 for fs-outgoing; Thu, 3 Oct 1996 04:16:52 -0700 (PDT) Received: from minnow.render.com (render.demon.co.uk [158.152.30.118]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id EAA21711; Thu, 3 Oct 1996 04:16:45 -0700 (PDT) Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id KAA25763; Thu, 3 Oct 1996 10:35:20 +0100 Date: Thu, 3 Oct 1996 10:35:19 +0100 (BST) From: Doug Rabson To: dyson@freebsd.org cc: phk@critter.tfs.com, heo@cslsun10.sogang.ac.kr, freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org Subject: Re: vnode and cluster read-ahead In-Reply-To: <199610021740.MAA01329@dyson.iquest.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk On Wed, 2 Oct 1996, John Dyson wrote: > > > > You could maintain a number of 'pending readahead' structures indexed by > > vnode and block number. Each call to cluster_read would check for a > > pending readahead by hashing. For efficiency, keep a pointer to the last > > readahead structure used by cluster_read in the vnode in place of the > > existing in-vnode readahead data. Should be no slower than the current > > system for single process reads and it saves 4 bytes per vnode :-). > > > Pretty cool idea. I am remembering now that this deficiency in our read > ahead code is well known. This might be something really good for 2.3 or > 3.1 :-). (Unless someone else wants to implement it -- hint hint :-)). On the subject of saving memory, I firmly believe that signficant performance improvements can be made just by reducing the memory footprint of algorithms. In our 3D graphics work, we have found that making important datastructures fit into cache lines (and using an aligning allocator to make sure that they start on cache line boundaries) can improve performance by as much as 20%. When future processors from Intel will have clock speeds of 400Mhz and above but have a 75Mhz memory bus to level 2 cache, this will become even more important towards the end of '97 and beyond. The size of a piece of software and its memory usage patterns will dominate its performance profile. Maybe we need a 'Campaign for Small Software' :-). -- Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com Phone: +44 171 734 3761 FAX: +44 171 734 6426 From owner-freebsd-fs Thu Oct 3 06:14:21 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id GAA25500 for fs-outgoing; Thu, 3 Oct 1996 06:14:21 -0700 (PDT) Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id GAA25494; Thu, 3 Oct 1996 06:14:17 -0700 (PDT) Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id IAA00602; Thu, 3 Oct 1996 08:12:30 -0500 (EST) From: "John S. Dyson" Message-Id: <199610031312.IAA00602@dyson.iquest.net> Subject: Re: vnode and cluster read-ahead To: dfr@render.com (Doug Rabson) Date: Thu, 3 Oct 1996 08:12:30 -0500 (EST) Cc: dyson@freebsd.org, phk@critter.tfs.com, heo@cslsun10.sogang.ac.kr, freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org In-Reply-To: from "Doug Rabson" at Oct 3, 96 10:35:19 am Reply-To: dyson@freebsd.org X-Mailer: ELM [version 2.4 PL24 ME8] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > > On the subject of saving memory, I firmly believe that signficant > performance improvements can be made just by reducing the memory footprint > of algorithms. In our 3D graphics work, we have found that making > important datastructures fit into cache lines (and using an aligning > allocator to make sure that they start on cache line boundaries) can > improve performance by as much as 20%. > The pmap code is a perfect example of that. There are times that I have "improved" the code, and noted a net slowdown, because it has grown. Soon, I intend to chop out another 1-2k out of pmap.o. Smaller is definitely better sometimes. John From owner-freebsd-fs Thu Oct 3 06:20:43 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id GAA25742 for fs-outgoing; Thu, 3 Oct 1996 06:20:43 -0700 (PDT) Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id GAA25737; Thu, 3 Oct 1996 06:20:39 -0700 (PDT) Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id IAA00624; Thu, 3 Oct 1996 08:20:17 -0500 (EST) From: "John S. Dyson" Message-Id: <199610031320.IAA00624@dyson.iquest.net> Subject: Re: nbuf in buffer cache To: karpen@ocean.campus.luth.se (Mikael Karpberg) Date: Thu, 3 Oct 1996 08:20:17 -0500 (EST) Cc: dyson@FreeBSD.org, bde@zeta.org.au, heo@cslsun10.sogang.ac.kr, freebsd-fs@FreeBSD.org In-Reply-To: <199610030805.KAA25133@ocean.campus.luth.se> from "Mikael Karpberg" at Oct 3, 96 10:05:45 am Reply-To: dyson@FreeBSD.org X-Mailer: ELM [version 2.4 PL24 ME8] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@FreeBSD.org X-Loop: FreeBSD.org Precedence: bulk > > Umm... hold on a second, here... :-) > I always thought Linux etc used all free memory for disk caching, and > that the BSD's used a formula (basically something like some percentage of > the available memory) to determine the size of a static buffer, used as > disk cache. Now... it makes sense if this changes when you use a merged > disk cache and VM system. Someone let me in on how things work? :-) > FreeBSD uses all of available memory for disk cache (it has actually had a true merged VM/buffer cache longer than Linux.) Linux has used a dynamic buffer cache for a long time though (which is technically different.) The only type of data that must be in a buffer is directory info. I am about ready to consider 2x-3x the number of buffers and changing a few tunables so that the cache will not take any more space. Since buffers only take 200 or so bytes apiece, it will not hurt (much) to increase the number of buffers even on a small system. The perf won't go down as long as I change the formula so that the memory limit isn't 8K * nbuf, but is 2-3K * nbuf. John From owner-freebsd-fs Thu Oct 3 06:57:33 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id GAA28103 for fs-outgoing; Thu, 3 Oct 1996 06:57:33 -0700 (PDT) Received: from minnow.render.com (render.demon.co.uk [158.152.30.118]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id GAA28092; Thu, 3 Oct 1996 06:57:22 -0700 (PDT) Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id OAA26130; Thu, 3 Oct 1996 14:54:58 +0100 Date: Thu, 3 Oct 1996 14:54:56 +0100 (BST) From: Doug Rabson To: dyson@freebsd.org cc: phk@critter.tfs.com, heo@cslsun10.sogang.ac.kr, freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org Subject: Re: vnode and cluster read-ahead In-Reply-To: <199610031312.IAA00602@dyson.iquest.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk On Thu, 3 Oct 1996, John S. Dyson wrote: > > > > On the subject of saving memory, I firmly believe that signficant > > performance improvements can be made just by reducing the memory footprint > > of algorithms. In our 3D graphics work, we have found that making > > important datastructures fit into cache lines (and using an aligning > > allocator to make sure that they start on cache line boundaries) can > > improve performance by as much as 20%. > > > The pmap code is a perfect example of that. There are times that I have > "improved" the code, and noted a net slowdown, because it has grown. > Soon, I intend to chop out another 1-2k out of pmap.o. Smaller is > definitely better sometimes. You may find that increasing the size of struct pv_entry to 32 bytes and arranging get_pv_entry to return new pv_entries on 32 byte boundaries will improve performance for operations that traverse pmaps which contain a large number of entries. Making structures like this fit cleanly into cache lines reduces the average number of cache misses needed to access a large quantity of data. If in addition, you arrange those functions to access the struct pv_entry sequentially from start to end, you will benefit from the fact that the 8 words of a cache line are read sequentially after a cache miss by the pentium and are available for use by instructions as soon as they are read, i.e. you can use the first couple of words in the cache line while the processor reads the rest. Looking at pmap_remove_entry() it seems to do this already but you can only benefit from it if the structure starts on a cache line boundary. -- Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com Phone: +44 171 734 3761 FAX: +44 171 734 6426 From owner-freebsd-fs Thu Oct 3 07:10:49 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id HAA28917 for fs-outgoing; Thu, 3 Oct 1996 07:10:49 -0700 (PDT) Received: from minnow.render.com (render.demon.co.uk [158.152.30.118]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id HAA28759; Thu, 3 Oct 1996 07:08:43 -0700 (PDT) Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id PAA26204; Thu, 3 Oct 1996 15:06:09 +0100 Date: Thu, 3 Oct 1996 15:06:09 +0100 (BST) From: Doug Rabson To: dyson@freebsd.org cc: Mikael Karpberg , bde@zeta.org.au, heo@cslsun10.sogang.ac.kr, freebsd-fs@freebsd.org Subject: Re: nbuf in buffer cache In-Reply-To: <199610031320.IAA00624@dyson.iquest.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk On Thu, 3 Oct 1996, John S. Dyson wrote: > > > > Umm... hold on a second, here... :-) > > I always thought Linux etc used all free memory for disk caching, and > > that the BSD's used a formula (basically something like some percentage of > > the available memory) to determine the size of a static buffer, used as > > disk cache. Now... it makes sense if this changes when you use a merged > > disk cache and VM system. Someone let me in on how things work? :-) > > > FreeBSD uses all of available memory for disk cache (it has actually had > a true merged VM/buffer cache longer than Linux.) Linux has used a dynamic > buffer cache for a long time though (which is technically different.) The > only type of data that must be in a buffer is directory info. I am about > ready to consider 2x-3x the number of buffers and changing a few tunables > so that the cache will not take any more space. Since buffers only take > 200 or so bytes apiece, it will not hurt (much) to increase the number of > buffers even on a small system. The perf won't go down as long as I change the > formula so that the memory limit isn't 8K * nbuf, but is 2-3K * nbuf. Having more buffers would improve performance for NFSv3 since data which has been written to the server but not committed is held in specially marked dirty buffers. Having a limited supply of buffers forces the system to commit data earlier, which involves another client-server round trip and a possible wait for the server's sync operation. It would be nice if instead of marking the buffer for a later commit, the underlying pages could be marked instead. This would be tricky to fit into the existing vnode system though. -- Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com Phone: +44 171 734 3761 FAX: +44 171 734 6426 From owner-freebsd-fs Thu Oct 3 08:36:11 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id IAA03308 for fs-outgoing; Thu, 3 Oct 1996 08:36:11 -0700 (PDT) Received: from ccs.sogang.ac.kr (ccs.sogang.ac.kr [163.239.1.1]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id IAA03278; Thu, 3 Oct 1996 08:36:03 -0700 (PDT) Received: from cslsun10.sogang.ac.kr by ccs.sogang.ac.kr (8.8.0/Sogang) id AAA04190; Fri, 4 Oct 1996 00:31:37 +0900 (KST) Received: from localhost by cslsun10.sogang.ac.kr (4.1/SMI-4.1) id AA06595; Fri, 4 Oct 96 00:34:41 KST Date: Fri, 4 Oct 1996 00:34:40 +0900 (KST) From: Heo Sung-Gwan X-Sender: heo@cslsun10 To: freebsd-hackers@FreeBSD.ORG Cc: freebsd-fs@FreeBSD.ORG Subject: Re: vnode and cluster read-ahead Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk John Dyson writes: >> >> You could maintain a number of 'pending readahead' structures indexed by >> vnode and block number. Each call to cluster_read would check for a >> pending readahead by hashing. For efficiency, keep a pointer to the last >> readahead structure used by cluster_read in the vnode in place of the >> existing in-vnode readahead data. Should be no slower than the current >> system for single process reads and it saves 4 bytes per vnode :-). > >Pretty cool idea. I am remembering now that this deficiency in our read >ahead code is well known. This might be something really good for 2.3 or >3.1 :-). (Unless someone else wants to implement it -- hint hint :-)). > I suggest a new idea. The fields for read-ahead(maxra, lenra, etc) are in file structure or other structure(e.g. Doug Rabson's readahead structure) that is pointed by a new field in file structure. And vnode has a new field to contain the point to the file structure. This vnode field is filled every read system call with the point to the file structure at vn_read() in kern/vfs_vnops.c Then it is possible that the file structure is accessed through vnode in cluster_read. Because the system calls are nonpreemptive the point to the file structure in the vnode is not changed until the current read system call is finished. This method removes the hashing using vnode and block number. Is it really possible? -- Heo Sung-Gwan Dept. of Computer Science, Sogang University, Seoul, Korea. E-mail: heo@cslsun10.sogang.ac.kr From owner-freebsd-fs Thu Oct 3 09:07:29 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id JAA05343 for fs-outgoing; Thu, 3 Oct 1996 09:07:29 -0700 (PDT) Received: from dyson.iquest.net (dyson.iquest.net [198.70.144.127]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id JAA05325; Thu, 3 Oct 1996 09:07:23 -0700 (PDT) Received: (from root@localhost) by dyson.iquest.net (8.7.5/8.6.9) id LAA00894; Thu, 3 Oct 1996 11:07:01 -0500 (EST) From: John Dyson Message-Id: <199610031607.LAA00894@dyson.iquest.net> Subject: Re: nbuf in buffer cache To: dfr@render.com (Doug Rabson) Date: Thu, 3 Oct 1996 11:07:01 -0500 (EST) Cc: dyson@freebsd.org, karpen@ocean.campus.luth.se, bde@zeta.org.au, heo@cslsun10.sogang.ac.kr, freebsd-fs@freebsd.org In-Reply-To: from "Doug Rabson" at Oct 3, 96 03:06:09 pm Reply-To: dyson@freebsd.org X-Mailer: ELM [version 2.4 PL24 ME8] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > > It would be nice if instead of marking the buffer for a later commit, the > underlying pages could be marked instead. This would be tricky to fit > into the existing vnode system though. > We can do that in the current vfs_bio, modulo some bugs. I probably won't get to it until the NEXT big release -- my 2.2/3.0 plate is so full, that it is spilling over (mixed metaphor, I think :-)). John From owner-freebsd-fs Thu Oct 3 10:10:56 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id KAA08601 for fs-outgoing; Thu, 3 Oct 1996 10:10:56 -0700 (PDT) Received: from minnow.render.com (render.demon.co.uk [158.152.30.118]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id KAA08588; Thu, 3 Oct 1996 10:10:40 -0700 (PDT) Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id SAA26561; Thu, 3 Oct 1996 18:10:27 +0100 Date: Thu, 3 Oct 1996 18:10:24 +0100 (BST) From: Doug Rabson To: dyson@freebsd.org cc: karpen@ocean.campus.luth.se, bde@zeta.org.au, heo@cslsun10.sogang.ac.kr, freebsd-fs@freebsd.org Subject: Re: nbuf in buffer cache In-Reply-To: <199610031607.LAA00894@dyson.iquest.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk On Thu, 3 Oct 1996, John Dyson wrote: > > > > It would be nice if instead of marking the buffer for a later commit, the > > underlying pages could be marked instead. This would be tricky to fit > > into the existing vnode system though. > > > We can do that in the current vfs_bio, modulo some bugs. I probably > won't get to it until the NEXT big release -- my 2.2/3.0 plate is so full, > that it is spilling over (mixed metaphor, I think :-)). There's no rush. The performance within the buffer metaphor is fine most of the time. -- Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com Phone: +44 171 734 3761 FAX: +44 171 734 6426 From owner-freebsd-fs Thu Oct 3 10:22:54 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id KAA09285 for fs-outgoing; Thu, 3 Oct 1996 10:22:54 -0700 (PDT) Received: from minnow.render.com (render.demon.co.uk [158.152.30.118]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id KAA09101; Thu, 3 Oct 1996 10:20:04 -0700 (PDT) Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id SAA26614; Thu, 3 Oct 1996 18:18:41 +0100 Date: Thu, 3 Oct 1996 18:18:38 +0100 (BST) From: Doug Rabson To: Heo Sung-Gwan cc: freebsd-hackers@FreeBSD.org, freebsd-fs@FreeBSD.org Subject: Re: vnode and cluster read-ahead In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@FreeBSD.org X-Loop: FreeBSD.org Precedence: bulk On Fri, 4 Oct 1996, Heo Sung-Gwan wrote: > > John Dyson writes: > >> > >> You could maintain a number of 'pending readahead' structures indexed by > >> vnode and block number. Each call to cluster_read would check for a > >> pending readahead by hashing. For efficiency, keep a pointer to the last > >> readahead structure used by cluster_read in the vnode in place of the > >> existing in-vnode readahead data. Should be no slower than the current > >> system for single process reads and it saves 4 bytes per vnode :-). > > > >Pretty cool idea. I am remembering now that this deficiency in our read > >ahead code is well known. This might be something really good for 2.3 or > >3.1 :-). (Unless someone else wants to implement it -- hint hint :-)). > > > > I suggest a new idea. The fields for read-ahead(maxra, lenra, etc) are > in file structure or other structure(e.g. Doug Rabson's readahead structure) > that is pointed by a new field in file structure. And vnode has a new field > to contain the point to the file structure. This vnode field is filled every > read system call with the point to the file structure at vn_read() > in kern/vfs_vnops.c Then it is possible that the file structure is accessed > through vnode in cluster_read. Not all the vnodes in the system are associated with file structures. The NFS server uses vnodes directly along with some other oddities like exec and coredumps. If we optimise cluster_read for normal open files, we should try and avoid pessimising it for other vnode users in the system. > > Because the system calls are nonpreemptive the point to the file structure > in the vnode is not changed until the current read system call is finished. I have vain hopes of a future kernel which is multithreading and introducing a new complication to that is not a good idea IMHO. In addition, multiple userland threads could fool a system where readaheads were calculater per-open-file. > > This method removes the hashing using vnode and block number. For the common single reader case, the vnode would cache a pointer to the readahead structure, avoiding the hash. The hash would be a simple O(1) operation anyway for the multiple reader case and so should not be a real performance problem. > > Is it really possible? A friend of mine always used to answer, 'Anything is possible, after all its only software' to that question :-). -- Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com Phone: +44 171 734 3761 FAX: +44 171 734 6426 From owner-freebsd-fs Thu Oct 3 14:20:45 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id OAA25382 for fs-outgoing; Thu, 3 Oct 1996 14:20:45 -0700 (PDT) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id OAA25372; Thu, 3 Oct 1996 14:20:41 -0700 (PDT) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id OAA06789; Thu, 3 Oct 1996 14:18:58 -0700 From: Terry Lambert Message-Id: <199610032118.OAA06789@phaeton.artisoft.com> Subject: Re: vnode and cluster read-ahead To: dfr@render.com (Doug Rabson) Date: Thu, 3 Oct 1996 14:18:57 -0700 (MST) Cc: heo@cslsun10.sogang.ac.kr, freebsd-hackers@FreeBSD.org, freebsd-fs@FreeBSD.org In-Reply-To: from "Doug Rabson" at Oct 3, 96 06:18:38 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@FreeBSD.org X-Loop: FreeBSD.org Precedence: bulk > > I suggest a new idea. The fields for read-ahead(maxra, lenra, etc) > > are in file structure or other structure(e.g. Doug Rabson's readahead > > structure) that is pointed by a new field in file structure. And > > vnode has a new field to contain the point to the file structure. > > This vnode field is filled every read system call with the point > > to the file structure at vn_read() in kern/vfs_vnops.c Then it is > > possible that the file structure is accessed through vnode in > > cluster_read. > > Not all the vnodes in the system are associated with file structures. The > NFS server uses vnodes directly along with some other oddities like exec > and coredumps. If we optimise cluster_read for normal open files, we > should try and avoid pessimising it for other vnode users in the system. To deal with this, you would have to add a "read ahead hints parameter" to the thing, and for NFS, pass one that will result in no change in the algorithm. This might be a good thing, but it would require minor changes to huge amounts of kernel code. In addition, it is not clear that the reverse mapping could be successful; you could change a vnode pointer on call down, but it would mean that you have destroyed call reentrancy for the interface, since reentering on the same vnode would potentially blow the same field before the downcall code could use it. Again, moving to a parameter instead of a vnode encoding would fix this, at possibly unacceptable cost. > I have vain hopes of a future kernel which is multithreading and > introducing a new complication to that is not a good idea IMHO. In > addition, multiple userland threads could fool a system where readaheads > were calculater per-open-file. I agree. In addition, moving to an async call gate to implement threading, where you make the same call through a different trap entry point, and potentially blocking operations automagically generate an async context record plus a context switch, would definitely tickle this problem. > > This method removes the hashing using vnode and block number. > > For the common single reader case, the vnode would cache a pointer to the > readahead structure, avoiding the hash. The hash would be a simple O(1) > operation anyway for the multiple reader case and so should not be a real > performance problem. I agree again. You either trust cache locality to work, or we might as well through out all caching, since we should measure all algorithms by the same yardstick. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.