From owner-freebsd-fs Sun Aug 8 10:55:50 1999 Delivered-To: freebsd-fs@freebsd.org Received: from florence.pavilion.net (florence.pavilion.net [194.242.128.25]) by hub.freebsd.org (Postfix) with ESMTP id B4604150DA; Sun, 8 Aug 1999 10:55:36 -0700 (PDT) (envelope-from joe@florence.pavilion.net) Received: (from joe@localhost) by florence.pavilion.net (8.9.3/8.8.8) id SAA01243; Sun, 8 Aug 1999 18:51:12 +0100 (BST) (envelope-from joe) Date: Sun, 8 Aug 1999 18:51:12 +0100 From: Josef Karthauser To: hackers@freebsd.org, fs@freebsd.org Subject: Disk label recovery - request for suggestions. Message-ID: <19990808185112.A99557@pavilion.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4i X-NCC-RegID: uk.pavilion Organisation: Pavilion Internet plc, 24 The Old Steine, Brighton, BN1 1EL, England Phone: +44-845-333-5000 Fax: +44-845-333-5001 Mobile: +44-403-596893 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org A few weeks ago I had a problem with a missing partition table and disklabel. Niall Smart forwarded me a small C program for scanning a drive for superblocks and rewriting a disklabel table. I'd like to do some work on integrating this into FreeBSD because it seems too useful to leave out. At the very least it could be a stand along tool that works on UFS slices, that'd be easy. What I'm wondering though is whether it should be an extension to the disklabel program. If so, what extra work is required to make it work with non UFS file systems - is 'disklabel' used on non UFS fs's? Joe -- Josef Karthauser FreeBSD: How many times have you booted today? Technical Manager Viagra for your server (http://www.uk.freebsd.org) Pavilion Internet plc. [joe@pavilion.net, joe@uk.freebsd.org, joe@tao.org.uk] To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Aug 10 12:43: 0 1999 Delivered-To: freebsd-fs@freebsd.org Received: from crufty.research.bell-labs.com (crufty.research.bell-labs.com [204.178.16.49]) by hub.freebsd.org (Postfix) with SMTP id 3EA0914EE5 for ; Tue, 10 Aug 1999 12:42:51 -0700 (PDT) (envelope-from vernick@bell-labs.com) Received: from bronx.dnrc.bell-labs.com ([135.180.160.8]) by crufty; Tue Aug 10 15:40:49 EDT 1999 Received: from bell-labs.com (shortstop [135.180.181.58]) by bronx.dnrc.bell-labs.com (8.9.3/8.9.3) with ESMTP id PAA10504 for ; Tue, 10 Aug 1999 15:41:19 -0400 (EDT) Message-ID: <37B07E3D.16F2B334@bell-labs.com> Date: Tue, 10 Aug 1999 15:32:13 -0400 From: Michael Vernick X-Mailer: Mozilla 4.5 [en] (WinNT; U) X-Accept-Language: en MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Help with understand file system performance Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Greetings, It's been a few years since I've hacked with FreeBSD, but I'm back and I need some help deciphering some of the file system performance numbers that I'm currently getting. I'm sure that this has probably been discussed before but I haven't found any good related material. The machine is a P166 w/ 32MB RAM and two 1GB SCSI disks (one for OS and one for Data) running FreeBSD 3.2-RELEASE. The Kernel configuration uses all defaults. My experiment consists of the following two steps: 1. Create a directory structure of files (depending on certain parameters like height and width of structure) where the files are randomly (uniform distribution) chosen to be between 10KB and 20KB. The total number of files is around 6400 for a total size of about 100MB. 2. Then a reader program is run that randomly reads a subset (3200) of the files. The reader program can have from 1 to 8 processes (fork() is used to create each process). Each process simply uses 'rand()' to get a random file, opens the file ('open()'), reads the file in its entirety using 1 'read(sizeOfFile)' call, then closes the file. Each experiment is run 8 times (varying the number of processes from 1-8) on each different directory structure. The structures, in a nutshell, can be deep (lots of subdirs with few files per directory, or wide with few subdirs and lots of files per directory). Both a single file system and two file systems on the same physical disk are compared. The performance metric is simply bytes/sec read. My results show that: 1. Performance degrades significantly (15-20%) when going from 1 to 2 processes then slowly increases as more processes are run. The same performance is achieved when running a single reader vs. running 8 readers. This happens for each type of directory structure. Is this because of the overhead of directory operations and context switches? I would have hoped to get more parallelism with more processes (i.e. keep the disk at fuller saturation because of Tagged Queuing) but the results don't show that. 2. Performance degrades about 15% for the 1 process experiment when the files are split across 2 file systems vs. a single file system. This one has me somewhat perplexed. Is it because there is more directory information thrashing from disk to memory? 3. On a per process basis, performance increases when the number of files per directory increases/number of subdirs decreases. Is this because there is a better chance the directory information about the file could be in memory? In general, my conjecture is that the more directory information that can be stored in memory, the better, thus leaving all disk activity for retrieving the actual files. Are there kernel parameters which configure how much memory is allocated to directory information (metadata) vs. actual file data. Our goal, of course, it to maximize performance. So any help in the tuning of our system (i.e. reading lots of ~15KB files) would be appreciated. I've started to look through the kernel source code to figure out what is going in, but it isn't easy. There is lots of indirection via function pointers. I've also just started looking through the 4.4BSD OS Design book. Is there any FreeBSD documentation about the file system code? I really didn't see anything in the handbook. Thanks for any help. It's good to be back. Michael Vernick, Ph.D. Multimedia Applications Research Lucent Bell Labs To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Aug 10 23:50:18 1999 Delivered-To: freebsd-fs@freebsd.org Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131]) by hub.freebsd.org (Postfix) with ESMTP id 0B39714DE9 for ; Tue, 10 Aug 1999 23:50:09 -0700 (PDT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id IAA32011; Wed, 11 Aug 1999 08:48:39 +0200 (CEST) (envelope-from phk@critter.freebsd.dk) To: Michael Vernick Cc: freebsd-fs@FreeBSD.ORG Subject: Re: Help with understand file system performance In-reply-to: Your message of "Tue, 10 Aug 1999 15:32:13 EDT." <37B07E3D.16F2B334@bell-labs.com> Date: Wed, 11 Aug 1999 08:48:38 +0200 Message-ID: <32009.934354118@critter.freebsd.dk> From: Poul-Henning Kamp Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message <37B07E3D.16F2B334@bell-labs.com>, Michael Vernick writes: >My results show that: > >1. Performance degrades significantly (15-20%) when going from 1 to 2 >processes then slowly increases as more processes are run. The same >performance is achieved when running a single reader vs. running 8 >readers. This happens for each type of directory structure. That is a good sign: It means that you don't have to do unnatural things to your application to get full throughput out of our file system. >2. Performance degrades about 15% for the 1 process experiment when the >files are split across 2 file systems vs. a single file system. This >one has me somewhat perplexed. Is it because there is more directory >information thrashing from disk to memory? That sounds weird... Do you have twice as many directories this way ? Or are the two filesystems on the same physical disk ? if so you are seeking much more. >3. On a per process basis, performance increases when the number of >files per directory increases/number of subdirs decreases. Is this >because there is a better chance the directory information about the >file could be in memory? Yes. The minimum directory size is the fragsize of the filesystem, filling the directories better means better performance. >Our goal, of course, it to maximize performance. So any help in the >tuning of our system (i.e. reading lots of ~15KB files) would be >appreciated Try fiddling the newfs parameters. I see 17% speedup using: newfs -b 16384 -f 4096 -c 100 Try to fill your directories so they are just below the fragment size of the filesystem (Ie: <1024 bytes for no newfs options, < 4096 bytes with the above options). -- Poul-Henning Kamp FreeBSD coreteam member phk@FreeBSD.ORG "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 11 6:38:29 1999 Delivered-To: freebsd-fs@freebsd.org Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31]) by hub.freebsd.org (Postfix) with ESMTP id D252014D2E; Wed, 11 Aug 1999 06:38:23 -0700 (PDT) (envelope-from des@flood.ping.uio.no) Received: (from des@localhost) by flood.ping.uio.no (8.9.3/8.9.3) id PAA12263; Wed, 11 Aug 1999 15:38:06 +0200 (CEST) (envelope-from des) To: Josef Karthauser Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Disk label recovery - request for suggestions. References: <19990808185112.A99557@pavilion.net> From: Dag-Erling Smorgrav Date: 11 Aug 1999 15:38:05 +0200 In-Reply-To: Josef Karthauser's message of "Sun, 8 Aug 1999 18:51:12 +0100" Message-ID: Lines: 10 X-Mailer: Gnus v5.5/Emacs 19.34 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Josef Karthauser writes: > If so, what extra work is required to make it work with non UFS file > systems - is 'disklabel' used on non UFS fs's? Disklabel doesn't work at the fs level, it works at the slice level - dividing slices into partitions, in which you can create file systems. DES -- Dag-Erling Smorgrav - des@flood.ping.uio.no To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 11 9:16:31 1999 Delivered-To: freebsd-fs@freebsd.org Received: from florence.pavilion.net (florence.pavilion.net [194.242.128.25]) by hub.freebsd.org (Postfix) with ESMTP id AB48A14EE4; Wed, 11 Aug 1999 09:16:23 -0700 (PDT) (envelope-from joe@florence.pavilion.net) Received: (from joe@localhost) by florence.pavilion.net (8.9.3/8.8.8) id RAA13400; Wed, 11 Aug 1999 17:15:14 +0100 (BST) (envelope-from joe) Date: Wed, 11 Aug 1999 17:15:14 +0100 From: Josef Karthauser To: Dag-Erling Smorgrav Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Disk label recovery - request for suggestions. Message-ID: <19990811171514.X88035@pavilion.net> References: <19990808185112.A99557@pavilion.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4i In-Reply-To: ; from Dag-Erling Smorgrav on Wed, Aug 11, 1999 at 03:38:05PM +0200 X-NCC-RegID: uk.pavilion Organisation: Pavilion Internet plc, 24 The Old Steine, Brighton, BN1 1EL, England Phone: +44-845-333-5000 Fax: +44-845-333-5001 Mobile: +44-403-596893 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wed, Aug 11, 1999 at 03:38:05PM +0200, Dag-Erling Smorgrav wrote: > Josef Karthauser writes: > > If so, what extra work is required to make it work with non UFS file > > systems - is 'disklabel' used on non UFS fs's? > > Disklabel doesn't work at the fs level, it works at the slice level - > dividing slices into partitions, in which you can create file systems. Ahha - of course. Ok, let me re-phrase the question then. By looking at the contents of the superblocks on a UFS file system it's possible to reconstruct a disklabel for a slice. Is this trick possible with other kinds of file systems too? (Does it even make sense to ask that question?). Should this recovery functionality be part of an already existing tool, like disklabel, or should it be a completely new tool? Opinions? Would it be possible to tag swap partitions with an equivalent of a superblock to make their recognition easier under failure conditions? Joe -- Josef Karthauser FreeBSD: How many times have you booted today? Technical Manager Viagra for your server (http://www.uk.freebsd.org) Pavilion Internet plc. [joe@pavilion.net, joe@uk.freebsd.org, joe@tao.org.uk] To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 11 9:23:43 1999 Delivered-To: freebsd-fs@freebsd.org Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31]) by hub.freebsd.org (Postfix) with ESMTP id 206E515531; Wed, 11 Aug 1999 09:23:33 -0700 (PDT) (envelope-from des@flood.ping.uio.no) Received: (from des@localhost) by flood.ping.uio.no (8.9.3/8.9.3) id SAA13015; Wed, 11 Aug 1999 18:23:25 +0200 (CEST) (envelope-from des) To: Josef Karthauser Cc: Dag-Erling Smorgrav , hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Disk label recovery - request for suggestions. References: <19990808185112.A99557@pavilion.net> <19990811171514.X88035@pavilion.net> From: Dag-Erling Smorgrav Date: 11 Aug 1999 18:23:24 +0200 In-Reply-To: Josef Karthauser's message of "Wed, 11 Aug 1999 17:15:14 +0100" Message-ID: Lines: 22 X-Mailer: Gnus v5.5/Emacs 19.34 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Josef Karthauser writes: > Ahha - of course. Ok, let me re-phrase the question then. By looking > at the contents of the superblocks on a UFS file system it's possible to > reconstruct a disklabel for a slice. Well, it's possible to reconstruct the label information for *that particular UFS file system*, since if you know the location of the superblock (or one of its backup copies), you can determine the offset and size of the FS. It won't tell you anything about *other* partitions though. > Is this trick possible with other > kinds of file systems too? That's totally dependent on the particular file system. For instance, a swap partition contains no metadata (that I know of), so all you can do is deduce it's size and position from the sizes and positions of surrounding partitions, and of the slice they're in. DES -- Dag-Erling Smorgrav - des@flood.ping.uio.no To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 11 9:35:52 1999 Delivered-To: freebsd-fs@freebsd.org Received: from florence.pavilion.net (florence.pavilion.net [194.242.128.25]) by hub.freebsd.org (Postfix) with ESMTP id 0745815581; Wed, 11 Aug 1999 09:35:44 -0700 (PDT) (envelope-from joe@florence.pavilion.net) Received: (from joe@localhost) by florence.pavilion.net (8.9.3/8.8.8) id RAA16474; Wed, 11 Aug 1999 17:35:35 +0100 (BST) (envelope-from joe) Date: Wed, 11 Aug 1999 17:35:35 +0100 From: Josef Karthauser To: Dag-Erling Smorgrav Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Disk label recovery - request for suggestions. Message-ID: <19990811173535.Y88035@pavilion.net> References: <19990808185112.A99557@pavilion.net> <19990811171514.X88035@pavilion.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4i In-Reply-To: ; from Dag-Erling Smorgrav on Wed, Aug 11, 1999 at 06:23:24PM +0200 X-NCC-RegID: uk.pavilion Organisation: Pavilion Internet plc, 24 The Old Steine, Brighton, BN1 1EL, England Phone: +44-845-333-5000 Fax: +44-845-333-5001 Mobile: +44-403-596893 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wed, Aug 11, 1999 at 06:23:24PM +0200, Dag-Erling Smorgrav wrote: > Josef Karthauser writes: > > Ahha - of course. Ok, let me re-phrase the question then. By looking > > at the contents of the superblocks on a UFS file system it's possible to > > reconstruct a disklabel for a slice. > > Well, it's possible to reconstruct the label information for *that > particular UFS file system*, since if you know the location of the > superblock (or one of its backup copies), you can determine the offset > and size of the FS. It won't tell you anything about *other* > partitions though. That's ok, because each slice has its _own_ label. If the bios partition table loses it's mind that's a little more work :). > > Is this trick possible with other kinds of file systems too? > > That's totally dependent on the particular file system. For instance, > a swap partition contains no metadata (that I know of), so all you can > do is deduce it's size and position from the sizes and positions of > surrounding partitions, and of the slice they're in. > What are the implications of adding a metadata structure to the swap structure. (It only needs a block :). [Although thinking out loud, it's complicated because there's no 'newfs' process that touches the partition, on the other hand the size of the partition is known at swap-mounting time, so the meta data could be written at that point.] Joe -- Josef Karthauser FreeBSD: How many times have you booted today? Technical Manager Viagra for your server (http://www.uk.freebsd.org) Pavilion Internet plc. [joe@pavilion.net, joe@uk.freebsd.org, joe@tao.org.uk] To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 11 9:47:23 1999 Delivered-To: freebsd-fs@freebsd.org Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31]) by hub.freebsd.org (Postfix) with ESMTP id 0352014C59; Wed, 11 Aug 1999 09:47:17 -0700 (PDT) (envelope-from des@flood.ping.uio.no) Received: (from des@localhost) by flood.ping.uio.no (8.9.3/8.9.3) id SAA13180; Wed, 11 Aug 1999 18:46:51 +0200 (CEST) (envelope-from des) To: Josef Karthauser Cc: Dag-Erling Smorgrav , hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Disk label recovery - request for suggestions. References: <19990808185112.A99557@pavilion.net> <19990811171514.X88035@pavilion.net> <19990811173535.Y88035@pavilion.net> From: Dag-Erling Smorgrav Date: 11 Aug 1999 18:46:51 +0200 In-Reply-To: Josef Karthauser's message of "Wed, 11 Aug 1999 17:35:35 +0100" Message-ID: Lines: 19 X-Mailer: Gnus v5.5/Emacs 19.34 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Josef Karthauser writes: > On Wed, Aug 11, 1999 at 06:23:24PM +0200, Dag-Erling Smorgrav wrote: > > Josef Karthauser writes: > > > Ahha - of course. Ok, let me re-phrase the question then. By looking > > > at the contents of the superblocks on a UFS file system it's possible to > > > reconstruct a disklabel for a slice. > > Well, it's possible to reconstruct the label information for *that > > particular UFS file system*, since if you know the location of the > > superblock (or one of its backup copies), you can determine the offset > > and size of the FS. It won't tell you anything about *other* > > partitions though. > That's ok, because each slice has its _own_ label. If the bios partition > table loses it's mind that's a little more work :). You're confusing partitions and slices. DES -- Dag-Erling Smorgrav - des@flood.ping.uio.no To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 11 10: 1:12 1999 Delivered-To: freebsd-fs@freebsd.org Received: from florence.pavilion.net (florence.pavilion.net [194.242.128.25]) by hub.freebsd.org (Postfix) with ESMTP id BC313155A1; Wed, 11 Aug 1999 10:01:01 -0700 (PDT) (envelope-from joe@florence.pavilion.net) Received: (from joe@localhost) by florence.pavilion.net (8.9.3/8.8.8) id SAA19912; Wed, 11 Aug 1999 18:00:48 +0100 (BST) (envelope-from joe) Date: Wed, 11 Aug 1999 18:00:48 +0100 From: Josef Karthauser To: Dag-Erling Smorgrav Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Disk label recovery - request for suggestions. Message-ID: <19990811180048.Z88035@pavilion.net> References: <19990808185112.A99557@pavilion.net> <19990811171514.X88035@pavilion.net> <19990811173535.Y88035@pavilion.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4i In-Reply-To: ; from Dag-Erling Smorgrav on Wed, Aug 11, 1999 at 06:46:51PM +0200 X-NCC-RegID: uk.pavilion Organisation: Pavilion Internet plc, 24 The Old Steine, Brighton, BN1 1EL, England Phone: +44-845-333-5000 Fax: +44-845-333-5001 Mobile: +44-403-596893 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wed, Aug 11, 1999 at 06:46:51PM +0200, Dag-Erling Smorgrav wrote: > Josef Karthauser writes: > > On Wed, Aug 11, 1999 at 06:23:24PM +0200, Dag-Erling Smorgrav wrote: > > > Josef Karthauser writes: > > > > Ahha - of course. Ok, let me re-phrase the question then. By looking > > > > at the contents of the superblocks on a UFS file system it's possible to > > > > reconstruct a disklabel for a slice. > > > Well, it's possible to reconstruct the label information for *that > > > particular UFS file system*, since if you know the location of the > > > superblock (or one of its backup copies), you can determine the offset > > > and size of the FS. It won't tell you anything about *other* > > > partitions though. > > That's ok, because each slice has its _own_ label. If the bios partition > > table loses it's mind that's a little more work :). > > You're confusing partitions and slices. I don't think so - PC's have a partition table. Us FreeBSDers call these partitions 'slices', and subdivide these into FreeBSD partitions. Each slice (pc partition) has a disklabel which denotes where the FreeBSD partitions live on the slice. I see what you were saying now above now. I agree that the superblock for a UFS file system won't tell anything about other UFS partitions, but a block by block search of the whole slice will identify potential superblocks that will. Joe -- Josef Karthauser FreeBSD: How many times have you booted today? Technical Manager Viagra for your server (http://www.uk.freebsd.org) Pavilion Internet plc. [joe@pavilion.net, joe@uk.freebsd.org, joe@tao.org.uk] To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 11 10:33:39 1999 Delivered-To: freebsd-fs@freebsd.org Received: from rover.village.org (rover.village.org [204.144.255.49]) by hub.freebsd.org (Postfix) with ESMTP id AEA5115776; Wed, 11 Aug 1999 10:33:19 -0700 (PDT) (envelope-from imp@harmony.village.org) Received: from harmony.village.org (harmony.village.org [10.0.0.6]) by rover.village.org (8.9.3/8.9.3) with ESMTP id LAA20642; Wed, 11 Aug 1999 11:33:05 -0600 (MDT) (envelope-from imp@harmony.village.org) Received: from harmony.village.org (localhost.village.org [127.0.0.1]) by harmony.village.org (8.9.3/8.8.3) with ESMTP id LAA18169; Wed, 11 Aug 1999 11:33:30 -0600 (MDT) Message-Id: <199908111733.LAA18169@harmony.village.org> To: Dag-Erling Smorgrav Subject: Re: Disk label recovery - request for suggestions. Cc: Josef Karthauser , hackers@FreeBSD.ORG, fs@FreeBSD.ORG In-reply-to: Your message of "11 Aug 1999 18:23:24 +0200." References: <19990808185112.A99557@pavilion.net> <19990811171514.X88035@pavilion.net> Date: Wed, 11 Aug 1999 11:33:30 -0600 From: Warner Losh Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message Dag-Erling Smorgrav writes: : superblock (or one of its backup copies), you can determine the offset : and size of the FS. It won't tell you anything about *other* : partitions though. It will give a fairly strong hint, however. If you know what is taken up by this partition, you can remove it from the pool of available space and guess with a relatively high degree of accuracy that the next partition begins where this one ends. : > Is this trick possible with other : > kinds of file systems too? : : That's totally dependent on the particular file system. For instance, : a swap partition contains no metadata (that I know of), so all you can : do is deduce it's size and position from the sizes and positions of : surrounding partitions, and of the slice they're in. Yes. This is true.... That's one of the problems of my disklabel reconstruction program that tries to run fast... It slows way down when it hits the swap area... Warner To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 11 14:48:19 1999 Delivered-To: freebsd-fs@freebsd.org Received: from gatewaya.anheuser-busch.com (gatewaya.anheuser-busch.com [151.145.250.252]) by hub.freebsd.org (Postfix) with SMTP id 61BFE14E31; Wed, 11 Aug 1999 14:48:01 -0700 (PDT) (envelope-from Matthew.Alton@anheuser-busch.com) Received: by gatewaya.anheuser-busch.com; id QAA24660; Wed, 11 Aug 1999 16:49:19 -0500 Received: from stlexggtw002-pozzoli.fw-users.busch.com(151.145.101.130) by gatewaya.anheuser-busch.com via smap (V5.0) id xma024583; Wed, 11 Aug 99 16:48:57 -0500 Received: from stlabcexg006.anheuser-busch.com ([151.145.101.161]) by 151.145.101.130 (Norton AntiVirus for Internet Email Gateways 1.0) ; Wed, 11 Aug 1999 21:46:50 0000 (GMT) Received: by stlabcexg006.anheuser-busch.com with Internet Mail Service (5.5.2448.0) id ; Wed, 11 Aug 1999 16:46:33 -0500 Message-ID: <0740CBD1D149D31193EB0008C7C56836EB8B05@STLABCEXG012> From: "Alton, Matthew" To: "'Hackers@FreeBSD.ORG'" , "'fs@FreeBSD.ORG'" Subject: BSD-XFS Update Date: Wed, 11 Aug 1999 16:46:46 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2448.0) Content-Type: text/plain Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org SGI has released a portion of the XFS source code under the GPL: http://oss.sgi.com/projects/xfs/download/ the source file is xfs_log.tar.gz. Of greater interest at this stage are the documents in: http://oss.sgi.com/projects/xfs/design_docs/ I am currently researching methods for implementing the 64-bit syscalls stat64(), fstat64(), lseek64() &etc. delineated in the SGI design doc _64 Bit File Access_ by Adam Sweeney. The BSD-XFS port will be made available as a patch to the RELEASE FreeBSD kernels. Matthew Alton Computer Services - UNIX Systems Administration (314)632-6644 matthew.alton@anheuser-busch.com alton@plantnet.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 11 15:36:38 1999 Delivered-To: freebsd-fs@freebsd.org Received: from alpo.whistle.com (alpo.whistle.com [207.76.204.38]) by hub.freebsd.org (Postfix) with ESMTP id A844C14DEE; Wed, 11 Aug 1999 15:36:33 -0700 (PDT) (envelope-from julian@whistle.com) Received: from current1.whistle.com (current1.whistle.com [207.76.205.22]) by alpo.whistle.com (8.9.1a/8.9.1) with SMTP id PAA71891; Wed, 11 Aug 1999 15:34:57 -0700 (PDT) Date: Wed, 11 Aug 1999 15:35:31 -0700 (PDT) From: Julian Elischer To: "Alton, Matthew" Cc: "'Hackers@FreeBSD.ORG'" , "'fs@FreeBSD.ORG'" Subject: Re: BSD-XFS Update In-Reply-To: <0740CBD1D149D31193EB0008C7C56836EB8B05@STLABCEXG012> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org stat, fstat, lseek are all already 64 bits in freebsd..... On Wed, 11 Aug 1999, Alton, Matthew wrote: > SGI has released a portion of the XFS source code under the GPL: > > http://oss.sgi.com/projects/xfs/download/ > > the source file is xfs_log.tar.gz. > > Of greater interest at this stage are the documents in: > > http://oss.sgi.com/projects/xfs/design_docs/ > > I am currently researching methods for implementing the 64-bit > syscalls stat64(), fstat64(), lseek64() &etc. delineated in the > SGI design doc _64 Bit File Access_ by Adam Sweeney. > > The BSD-XFS port will be made available as a patch to the RELEASE > FreeBSD kernels. > > > Matthew Alton > Computer Services - UNIX Systems Administration > (314)632-6644 matthew.alton@anheuser-busch.com > alton@plantnet.com > > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-hackers" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 11 16: 4:21 1999 Delivered-To: freebsd-fs@freebsd.org Received: from gatewaya.anheuser-busch.com (gatewaya.anheuser-busch.com [151.145.250.252]) by hub.freebsd.org (Postfix) with SMTP id 65C5615636; Wed, 11 Aug 1999 16:04:06 -0700 (PDT) (envelope-from Matthew.Alton@anheuser-busch.com) Received: by gatewaya.anheuser-busch.com; id SAA01734; Wed, 11 Aug 1999 18:05:51 -0500 Received: from stlexggtw002-pozzoli.fw-users.busch.com(151.145.101.130) by gatewaya.anheuser-busch.com via smap (V5.0) id xma001698; Wed, 11 Aug 99 18:05:44 -0500 Received: from stlabcexg006.anheuser-busch.com ([151.145.101.161]) by 151.145.101.130 (Norton AntiVirus for Internet Email Gateways 1.0) ; Wed, 11 Aug 1999 23:03:37 0000 (GMT) Received: by stlabcexg006.anheuser-busch.com with Internet Mail Service (5.5.2448.0) id ; Wed, 11 Aug 1999 18:03:20 -0500 Message-ID: <0740CBD1D149D31193EB0008C7C56836EB8B06@STLABCEXG012> From: "Alton, Matthew" To: "'Julian Elischer'" Cc: "'Hackers@FreeBSD.ORG'" , "'fs@FreeBSD.ORG'" Subject: RE: BSD-XFS Update Date: Wed, 11 Aug 1999 18:03:33 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2448.0) Content-Type: text/plain Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Quite so. Thank you. I initially only looked at things like: 19 COMPAT POSIX { long lseek(int fd, long offset, int whence); } from /usr/src/sys/kern/syscalls.master and assumed a 32-bit long int. The easy way to deal with this is to change the calls in the XFS code. The syscall part is mostly done. > -----Original Message----- > From: Julian Elischer [SMTP:julian@whistle.com] > Sent: Wednesday, August 11, 1999 5:36 PM > To: Alton, Matthew > Cc: 'Hackers@FreeBSD.ORG'; 'fs@FreeBSD.ORG' > Subject: Re: BSD-XFS Update > > stat, fstat, lseek are all already 64 bits in freebsd..... > > > On Wed, 11 Aug 1999, Alton, Matthew wrote: > > > SGI has released a portion of the XFS source code under the GPL: > > > > http://oss.sgi.com/projects/xfs/download/ > > > > the source file is xfs_log.tar.gz. > > > > Of greater interest at this stage are the documents in: > > > > http://oss.sgi.com/projects/xfs/design_docs/ > > > > I am currently researching methods for implementing the 64-bit > > syscalls stat64(), fstat64(), lseek64() &etc. delineated in the > > SGI design doc _64 Bit File Access_ by Adam Sweeney. > > > > The BSD-XFS port will be made available as a patch to the RELEASE > > FreeBSD kernels. > > > > > > Matthew Alton > > Computer Services - UNIX Systems Administration > > (314)632-6644 matthew.alton@anheuser-busch.com > > alton@plantnet.com > > > > > > > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > > with "unsubscribe freebsd-hackers" in the body of the message > > > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 11 19:38:28 1999 Delivered-To: freebsd-fs@freebsd.org Received: from lestat.nas.nasa.gov (lestat.nas.nasa.gov [129.99.33.127]) by hub.freebsd.org (Postfix) with ESMTP id 0686414CFE; Wed, 11 Aug 1999 19:38:20 -0700 (PDT) (envelope-from thorpej@lestat.nas.nasa.gov) Received: from lestat (localhost [127.0.0.1]) by lestat.nas.nasa.gov (8.8.8/8.6.12) with ESMTP id TAA00416; Wed, 11 Aug 1999 19:37:23 -0700 (PDT) Message-Id: <199908120237.TAA00416@lestat.nas.nasa.gov> To: "Alton, Matthew" Cc: "'Hackers@FreeBSD.ORG'" , "'fs@FreeBSD.ORG'" Subject: Re: BSD-XFS Update Reply-To: Jason Thorpe From: Jason Thorpe Date: Wed, 11 Aug 1999 19:37:22 -0700 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wed, 11 Aug 1999 16:46:46 -0500 "Alton, Matthew" wrote: > I am currently researching methods for implementing the 64-bit > syscalls stat64(), fstat64(), lseek64() &etc. delineated in the > SGI design doc _64 Bit File Access_ by Adam Sweeney. ...which, of course, is completely unnecessary, as systems derived from 4.4BSD have always had 64-bit file offsets. -- Jason R. Thorpe To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 12 0:14: 7 1999 Delivered-To: freebsd-fs@freebsd.org Received: from raditex.se (gandalf.raditex.se [192.5.36.18]) by hub.freebsd.org (Postfix) with ESMTP id A9FFF14D43 for ; Thu, 12 Aug 1999 00:14:01 -0700 (PDT) (envelope-from ps@raditex.se) Received: (from ps@localhost) by raditex.se (8.9.3/8.9.3) id JAA14266 for freebsd-fs@freebsd.org; Thu, 12 Aug 1999 09:11:56 +0200 (CEST) (envelope-from ps) Date: Tue, 10 Aug 1999 16:32:45 +0200 From: Patrik Sundberg To: freebsd-fs@freebsd.org Subject: mfs and imagefile (/usr/src/sbin/newfs/mkfs.c) Message-ID: <19990810163402.B10448@radiac.sickla.raditex.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.6i X-Mailer: Mutt 0.95.6i Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hi, I hope this goes to the right forum, otherwise I apologize. I have been trying to set up a FreeBSD box to avoid disc-writes. This lead me to use mfs-filesystems for things like /var. In the process of doing this I wanted to initialize a mfs-fs from an image-file. I thought the -F option was the way to go, but after testing a bit and reading the source it seems like when using -F one always gets an empty filesystem - it doesn't care about the contents of the file given. We asked Andrzej Bialecki(picobsd) about it and he too thought the -F flag was the way to accomplish this, but later came to the same conclusion as we did. The relevant sourcecode (mkfs.c): if(filename) { unsigned char buf[BUFSIZ]; unsigned long l,l1; fd = open(filename,O_RDWR|O_TRUNC|O_CREAT,0644); if(fd < 0) err(12, "%s", filename); for(l=0;l< fssize * sectorsize;l += l1) { l1 = fssize * sectorsize; if (BUFSIZ < l1) l1 = BUFSIZ; if (l1 != write(fd,buf,l1)) err(12, "%s", filename); } membase = mmap(0, fssize * sectorsize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if(membase == MAP_FAILED) err(12, "mmap"); close(fd); } else { It makes the file 0 size and then writes an uninitialized buffer to it until it is of correct size(?). Is there any reason for not having the possibility to use the contents of the file to initialize the fs? Maybe we could have a flag which specifies the behaviour of -F ? -- Patrik Sundberg - email: ps@raditex.se - PGP: finger ps@raditex.se ---> telefon: 08-636 59 39 - mobiltelefon: 070-760 22 40 <--- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 12 3:47:27 1999 Delivered-To: freebsd-fs@freebsd.org Received: from critter.freebsd.dk (wandering-wizard.cybercity.dk [212.242.41.238]) by hub.freebsd.org (Postfix) with ESMTP id A495914FDF for ; Thu, 12 Aug 1999 03:47:24 -0700 (PDT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id LAA00331; Thu, 12 Aug 1999 11:05:34 +0200 (CEST) (envelope-from phk@critter.freebsd.dk) To: Patrik Sundberg Cc: freebsd-fs@FreeBSD.ORG Subject: Re: mfs and imagefile (/usr/src/sbin/newfs/mkfs.c) In-reply-to: Your message of "Tue, 10 Aug 1999 16:32:45 +0200." <19990810163402.B10448@radiac.sickla.raditex.se> Date: Thu, 12 Aug 1999 11:05:33 +0200 Message-ID: <329.934448733@critter.freebsd.dk> From: Poul-Henning Kamp Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org >We asked Andrzej Bialecki(picobsd) about it and he too thought the -F flag was >the way to accomplish this, but later came to the same conclusion as we did. The -F flag was added because we didn't have a working vn(4) at the time and Jordan and I were sick and tired of make release falling over because of sick floppy disks. It works the opposite of what you want: it preserves the contents of the MFS after you unmount it. Feel free to add code for what you suggest, but make it read the input from a filedescriptor so that I can gunzip < mfs.image.gz | mount_mfs ... -- Poul-Henning Kamp FreeBSD coreteam member phk@FreeBSD.ORG "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 12 7:31:18 1999 Delivered-To: freebsd-fs@freebsd.org Received: from frmug.org (frmug-gw.frmug.org [193.56.58.252]) by hub.freebsd.org (Postfix) with ESMTP id 2BD601577D for ; Thu, 12 Aug 1999 07:31:09 -0700 (PDT) (envelope-from roberto@keltia.freenix.fr) Received: (from uucp@localhost) by frmug.org (8.9.1/frmug-2.3/nospam) with UUCP id QAA20199 for freebsd-fs@FreeBSD.ORG; Thu, 12 Aug 1999 16:31:16 +0200 (CEST) (envelope-from roberto@keltia.freenix.fr) Received: by keltia.freenix.fr (Postfix, from userid 101) id CBD3A870B; Thu, 12 Aug 1999 13:36:48 +0200 (CEST) Date: Thu, 12 Aug 1999 13:36:48 +0200 From: Ollivier Robert To: freebsd-fs@FreeBSD.ORG Subject: Re: Help with understand file system performance Message-ID: <19990812133648.A64754@keltia.freenix.fr> Mail-Followup-To: freebsd-fs@FreeBSD.ORG References: <37B07E3D.16F2B334@bell-labs.com> <32009.934354118@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii User-Agent: Mutt/0.95.5i In-Reply-To: <32009.934354118@critter.freebsd.dk>; from Poul-Henning Kamp on Wed, Aug 11, 1999 at 08:48:38AM +0200 X-Operating-System: FreeBSD 4.0-CURRENT/ELF ctm#5543 AMD-K6 MMX @ 200 MHz Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org According to Poul-Henning Kamp: > Yes. The minimum directory size is the fragsize of the filesystem, I'm afraid it is not the case... 214 [13:35] root@tara:/src# df . Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/da0s2d 1375362 602742 662592 48% /src 215 [13:35] root@tara:/src# dumpfs /dev/rda0s2d|more magic 11954 time Thu Aug 12 13:34:13 1999 id [ 360d5f59 20984fb4 ] cylgrp dynamic inodes 4.4BSD nbfree 68880 ndir 22396 nifree 217288 nffree 29178 ncg 44 ncyl 694 size 1419379 blocks 1375362 bsize 8192 shift 13 mask 0xffffe000 fsize 1024 shift 10 mask 0xfffffc00 ^^^^ 216 [13:35] root@tara:/src# ll total 5 drwxr-xr-x 2 roberto staff 512 Sep 26 1998 CVS/ ^^^ drwxr-xr-x 4 root wheel 512 Jan 24 1999 obj/ drwxr-xr-x 48 roberto staff 1024 Mar 11 00:51 ports/ drwxr-xr-x 21 roberto staff 512 Jul 10 18:06 src/ drwxr-xr-x 2 root staff 1024 Jul 26 22:53 world/ -- Ollivier ROBERT -=- FreeBSD: The Power to Serve! -=- roberto@keltia.freenix.fr FreeBSD keltia.freenix.fr 4.0-CURRENT #73: Sat Jul 31 15:36:05 CEST 1999 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 12 7:37:32 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id 19A171577D for ; Thu, 12 Aug 1999 07:37:28 -0700 (PDT) (envelope-from zzhang@cs.binghamton.edu) Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72]) by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id KAA05322; Thu, 12 Aug 1999 10:37:33 -0400 (EDT) Date: Thu, 12 Aug 1999 10:24:30 -0400 (EDT) From: Zhihui Zhang To: Ollivier Robert Cc: freebsd-fs@FreeBSD.ORG Subject: Re: Help with understand file system performance In-Reply-To: <19990812133648.A64754@keltia.freenix.fr> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Thu, 12 Aug 1999, Ollivier Robert wrote: > According to Poul-Henning Kamp: > > Yes. The minimum directory size is the fragsize of the filesystem, > > I'm afraid it is not the case... > > 214 [13:35] root@tara:/src# df . > Filesystem 1K-blocks Used Avail Capacity Mounted on > /dev/da0s2d 1375362 602742 662592 48% /src > 215 [13:35] root@tara:/src# dumpfs /dev/rda0s2d|more > magic 11954 time Thu Aug 12 13:34:13 1999 > id [ 360d5f59 20984fb4 ] > cylgrp dynamic inodes 4.4BSD > nbfree 68880 ndir 22396 nifree 217288 nffree 29178 > ncg 44 ncyl 694 size 1419379 blocks 1375362 > bsize 8192 shift 13 mask 0xffffe000 > fsize 1024 shift 10 mask 0xfffffc00 > ^^^^ > 216 [13:35] root@tara:/src# ll > total 5 > drwxr-xr-x 2 roberto staff 512 Sep 26 1998 CVS/ > ^^^ The fsize is the number of bytes in a fragment. Even if your file is 1 byte, that file needs 1024 bytes to store. However, the byte count is still one byte. In your example, the byte count is 512 bytes. -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 12 7:48:48 1999 Delivered-To: freebsd-fs@freebsd.org Received: from critter.freebsd.dk (wandering-wizard.cybercity.dk [212.242.41.238]) by hub.freebsd.org (Postfix) with ESMTP id A779414EDF for ; Thu, 12 Aug 1999 07:48:43 -0700 (PDT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id QAA01406; Thu, 12 Aug 1999 16:45:09 +0200 (CEST) (envelope-from phk@critter.freebsd.dk) To: Zhihui Zhang Cc: Ollivier Robert , freebsd-fs@FreeBSD.ORG Subject: Re: Help with understand file system performance In-reply-to: Your message of "Thu, 12 Aug 1999 10:24:30 EDT." Date: Thu, 12 Aug 1999 16:45:09 +0200 Message-ID: <1404.934469109@critter.freebsd.dk> From: Poul-Henning Kamp Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message , Zhi hui Zhang writes: > >> According to Poul-Henning Kamp: >> > Yes. The minimum directory size is the fragsize of the filesystem, >> >> I'm afraid it is not the case... >> >> 216 [13:35] root@tara:/src# ll >> total 5 >> drwxr-xr-x 2 roberto staff 512 Sep 26 1998 CVS/ >> ^^^ >The fsize is the number of bytes in a fragment. Even if your file is 1 >byte, that file needs 1024 bytes to store. However, the byte count is >still one byte. In your example, the byte count is 512 bytes. Yeah, well, the real issue is if the UFS implementation works on the 512 bytes size of the fragsize. -- Poul-Henning Kamp FreeBSD coreteam member phk@FreeBSD.ORG "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 12 16:14:57 1999 Delivered-To: freebsd-fs@freebsd.org Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id 4373114C1D for ; Thu, 12 Aug 1999 16:14:52 -0700 (PDT) (envelope-from tlambert@usr04.primenet.com) Received: (from daemon@localhost) by smtp04.primenet.com (8.9.3/8.9.3) id QAA03382; Thu, 12 Aug 1999 16:14:24 -0700 (MST) Received: from usr04.primenet.com(206.165.6.204) via SMTP by smtp04.primenet.com, id smtpdAAAvhaqvg; Thu Aug 12 16:14:16 1999 Received: (from tlambert@localhost) by usr04.primenet.com (8.8.5/8.8.5) id QAA23506; Thu, 12 Aug 1999 16:14:05 -0700 (MST) From: Terry Lambert Message-Id: <199908122314.QAA23506@usr04.primenet.com> Subject: Re: Help with understand file system performance To: phk@critter.freebsd.dk (Poul-Henning Kamp) Date: Thu, 12 Aug 1999 23:14:05 +0000 (GMT) Cc: zzhang@cs.binghamton.edu, roberto@keltia.freenix.fr, freebsd-fs@FreeBSD.ORG In-Reply-To: <1404.934469109@critter.freebsd.dk> from "Poul-Henning Kamp" at Aug 12, 99 04:45:09 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Poul-Henning Kamp writes: > Zhihui Zhang writes: > > > >> According to Poul-Henning Kamp: > >> > Yes. The minimum directory size is the fragsize of the filesystem, > >> > >> I'm afraid it is not the case... > >> > >> 216 [13:35] root@tara:/src# ll > >> total 5 > >> drwxr-xr-x 2 roberto staff 512 Sep 26 1998 CVS/ > >> ^^^ > >The fsize is the number of bytes in a fragment. Even if your file is 1 > >byte, that file needs 1024 bytes to store. However, the byte count is > >still one byte. In your example, the byte count is 512 bytes. > > Yeah, well, the real issue is if the UFS implementation works on > the 512 bytes size of the fragsize. Poul's right. More particularly, there are two concepts here: 1) File system block size 2) Directory entry block size The directory entry block size is a physical disk block. This is intentional for the purposes of atomicity of directory entry block updates. In point of fact, the code is incapable of dealing with anything other than BLKATOFF()-type semantics. Directories are files. This is an implementation detail, and the wording of POSIX specifically distances itself from the concept that directories and files are the same primitive object. This is probably in an attempt to allow VMS, NT, and NetWare filesystems claim POSIX compliance. The filesystem block allocation table in directories is unique, in that it is generally used as a convenience for locating physical blocks, rather than using the standard filesystem block access mechanisms, when reading or writing directories. There are a number of performance penalties for this, especially on large directories, where it is not possible to trigger sequential readahead through use of the getdents() system call sequentially accessing sequential 512b/physical_block_size extents. There also appears to be a misunderstanding about frags here: > >> drwxr-xr-x 2 roberto staff 512 Sep 26 1998 CVS/ > >> ^^^ > >The fsize is the number of bytes in a fragment. Even if your file is 1 > >byte, that file needs 1024 bytes to store. However, the byte count is > >still one byte. In your example, the byte count is 512 bytes. The frag size is, by default, 1/8 of the filesystem block size. For a filesystem block size of 4096, the frag size is 512b, which is the physical block size on most media (e.g. most everything that you might have an FFS on, except not Japanese magneto-optical and some Japanese winchester disk drives). The frag size can be tuned down below this (i.e. 1/4, 1/2, 1). The only case where 1024 bytes of physical disk would be used is at a filesystem block size of 8192 (or greater), which, divided by 8, gives 1024b (or greater). In this case, the directory entry structure size is... still the physical device block size, or 512b. As an exercise for the reader, try implementing a directory entry block size in excess of 512b (e.g. 1024b, in an attempt to support both 8.3 names and 256 character Unicode names for files). The problem you will encounter is that the physical disk only guarantees atomicity at the block I/O level. Soft Updates allow this to work for file contents, but inodes are still 128 bytes (sub 1 physical device block) and directory entry blocks are still 512b (equal to or sub the physical device block size. There aren't really structures to allow for an encapsulated update of these objects to occur, to allow them to exceed the physical device block size, yet remain atomic. What happens at the inode data contents level, is that new blocks are allocated, given the new content for the region, verified that they are written to disk, and then the direct block list in the inode, or the direct block list of an indirect block pointed to by the inode or by another indirect block, is updated. This means that if a crash occurs before the block list is modified, the old contents remain, in their entirety, and if a crash occurs after the block list is modified, the fact that the data is verified on disk before the update occurs, the new contents are there, in their entirety. This is called an encapsulated two stage commit, in database terms. For inodes, indirect blocks, and directory entry blocks, there is no two stage commit, because there is no indirection of their data contents. Hope this sets things straight in your mind (not you, Poul, I know you already understand it 8-)). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 12 18:16:49 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id 2E34F14C9E for ; Thu, 12 Aug 1999 18:16:41 -0700 (PDT) (envelope-from zzhang@cs.binghamton.edu) Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72]) by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id VAA20795; Thu, 12 Aug 1999 21:16:16 -0400 (EDT) Date: Thu, 12 Aug 1999 21:02:32 -0400 (EDT) From: Zhihui Zhang To: Terry Lambert Cc: Poul-Henning Kamp , roberto@keltia.freenix.fr, freebsd-fs@FreeBSD.ORG Subject: Re: Help with understand file system performance In-Reply-To: <199908122314.QAA23506@usr04.primenet.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Thu, 12 Aug 1999, Terry Lambert wrote: > The filesystem block allocation table in directories is unique, in > that it is generally used as a convenience for locating physical > blocks, rather than using the standard filesystem block access > mechanisms, when reading or writing directories. Directory files have the same on-disk structure as regular files. However, they can never have holes and they can only be incremented at the end of the file in device block chunks. No directory entry can cross the device block boundary to guarantee the atomic update. However, I do not know why you say the block map (direct and indirect blocks) of a directory is only used as a convenience. I mean there is a need to call VOP_BMAP() on a directory file. The routine ffs_blkatoff() calls bread(), which in turn calls VOP_BMAP(). The in-core inode does have several fields to facilitate the insertion of new directory entries. But we still need the block map (block allocation table). Directory files are also specical in that we can not write into them with the write() system call as normal files. They use a special routine to grow, i.e., ufs_direnter(). By the way, we can use read() system call to read directory files as we do with normal files. > There are a number of performance penalties for this, especially > on large directories, where it is not possible to trigger sequential > readahead through use of the getdents() system call sequentially > accessing sequential 512b/physical_block_size extents. I do not understand this. The read-ahead mechanism should work on any files. I thought the reorganization of diretory entries within a directory block when you delete an entry is an inefficiency. Does this issue have anything to do with the VMIO directory issue discussed earlier this year? > The frag size can be tuned down below this (i.e. 1/4, 1/2, 1). > > The only case where 1024 bytes of physical disk would be used is at > a filesystem block size of 8192 (or greater), which, divided by 8, > gives 1024b (or greater). I did not realize this before. The maximum ratio is 8. So if the filesystem block is 8192, the allocation unit (fragment size) can not be 512 because 8192/512 > 8. > This is called an encapsulated two stage commit, in database terms. > > For inodes, indirect blocks, and directory entry blocks, there is > no two stage commit, because there is no indirection of their data > contents. I guess you mean that their data are not managed by any higher level metadata which must be updated together. Thanks for your help. -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Aug 13 6:13:20 1999 Delivered-To: freebsd-fs@freebsd.org Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.26.10.9]) by hub.freebsd.org (Postfix) with ESMTP id 975C414C25 for ; Fri, 13 Aug 1999 06:13:12 -0700 (PDT) (envelope-from bde@godzilla.zeta.org.au) Received: (from bde@localhost) by godzilla.zeta.org.au (8.8.7/8.8.7) id XAA14647; Fri, 13 Aug 1999 23:13:21 +1000 Date: Fri, 13 Aug 1999 23:13:21 +1000 From: Bruce Evans Message-Id: <199908131313.XAA14647@godzilla.zeta.org.au> To: phk@critter.freebsd.dk, vernick@bell-labs.com Subject: Re: Help with understand file system performance Cc: freebsd-fs@FreeBSD.ORG Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org >>3. On a per process basis, performance increases when the number of >>files per directory increases/number of subdirs decreases. Is this >>because there is a better chance the directory information about the >>file could be in memory? > >Yes. The minimum directory size is the fragsize of the filesystem, >filling the directories better means better performance. > >>Our goal, of course, it to maximize performance. So any help in the >>tuning of our system (i.e. reading lots of ~15KB files) would be >>appreciated Try increasing nbuf. I think effective caching of directories still requires 1 buffer per directory. >Try fiddling the newfs parameters. I see 17% speedup using: > > newfs -b 16384 -f 4096 -c 100 I see a 48% speedup using linux mkfs.ext2 -b 4096 $device $size_of_device_in_4k_units :->. This is despite (or because of) ext2fs's block allocator being broken (it essentially ignores cylinder groups). The following times are for `tar zxpf linux-2.2.9.tar.gz', unmount (to sync), and `tar cf /dev/null linux lost+found' on a new filesystem on a Quantum KA disk on an overclocked Celeron-366 system: ffs-4096-512: 41.82 real 3.24 user 3.34 sys 3.05 real 0.00 user 0.07 sys 16.53 real 0.08 user 1.34 sys ffs-4096-1024: 35.89 real 3.35 user 3.70 sys 2.11 real 0.00 user 0.07 sys 15.53 real 0.13 user 1.39 sys ffs-4096-2048: 29.32 real 3.24 user 4.36 sys 1.17 real 0.00 user 0.07 sys 12.20 real 0.14 user 1.42 sys ffs-4096-4096: 28.85 real 3.34 user 4.51 sys 1.12 real 0.00 user 0.07 sys 11.24 real 0.10 user 1.59 sys ffs-8192-1024: 33.39 real 3.26 user 5.44 sys 2.94 real 0.00 user 0.07 sys 13.40 real 0.12 user 1.18 sys ffs-8192-2048: 28.08 real 3.29 user 3.01 sys 2.32 real 0.00 user 0.07 sys 11.21 real 0.06 user 1.26 sys ffs-8192-4096: 25.05 real 3.27 user 2.99 sys 1.87 real 0.00 user 0.07 sys 9.17 real 0.09 user 1.21 sys ffs-8192-8192: 23.27 real 3.27 user 2.82 sys 1.53 real 0.00 user 0.07 sys 8.94 real 0.10 user 1.23 sys ffs-16384-2048: 28.22 real 3.43 user 4.78 sys 2.52 real 0.00 user 0.07 sys 12.01 real 0.10 user 1.55 sys ffs-16384-4096: 24.32 real 3.41 user 3.51 sys 1.97 real 0.00 user 0.07 sys 10.56 real 0.11 user 1.37 sys ffs-16384-8192: 23.63 real 3.33 user 3.35 sys 2.35 real 0.00 user 0.07 sys 8.66 real 0.09 user 1.15 sys ffs-16384-16384: 85.41 real 3.33 user 3.28 sys 2.00 real 0.00 user 0.08 sys 9.51 real 0.10 user 1.17 sys ext2fs-1024-1024: 36.33 real 3.33 user 3.67 sys 1.42 real 0.00 user 0.07 sys 14.49 real 0.10 user 2.28 sys ext2fs-4096-4096: 20.81 real 3.38 user 3.54 sys 1.01 real 0.00 user 0.07 sys 6.96 real 0.12 user 1.57 sys Note the anomalously slow times for ffs-16384-16384. I analyzed why ffs was slow and ext2fs was fast for the `tar cf' part a year or two ago. It was because ffs handles fragments poorly and needs many more small (but not small enough to be in the drive's cache) backwards seeks. Not using fragments reduced ext2fs's advantage significantly but not completely. ffs-4K-4K is only slightly faster than ffs-8K-1K now, presumably because drive caches are larger and command overheads are relatively higher (the KA acts like a slow SCSI drive in wanting a block size of at least 8K to keep up with the disk). This output was produced by the following program: --- #!/bin/sh for b in 4096 8192 16384 do for f in $(($b / 8)) $(($b / 4)) $(($b / 2)) $b do echo ffs-$b-$f: >>/tmp/ztimes newfs -b $b -f $f /dev/rwd2s2a mount /dev/wd2s2a /d cd /d sync time tar zxpf $loc/z/dist/*2.2.9.tar.gz 2>>/tmp/ztimes cd /tmp time umount /d 2>>/tmp/ztimes mount /dev/wd2s2a /d cd /d sync time tar cf /dev/null * 2>>/tmp/ztimes cd /tmp umount /d done done for b in 1024 4096 do for f in $b do echo ext2fs-$b-$f: >>/tmp/ztimes # linux mkfs.ext2 -b $b /dev/rwd2s2a $((4819437 / ($b / 512))) # fsck.ext2 /dev/wd2s2a mount -t ext2fs /dev/wd2s2a /d cd /d sync time tar zxpf $loc/z/dist/*2.2.9.tar.gz 2>>/tmp/ztimes cd /tmp time umount /d 2>>/tmp/ztimes mount -t ext2fs /dev/wd2s2a /d cd /d sync time tar cf /dev/null * 2>>/tmp/ztimes cd /tmp umount /d done done --- Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Aug 13 6:25:17 1999 Delivered-To: freebsd-fs@freebsd.org Received: from gw-nl3.philips.com (gw-nl3.philips.com [192.68.44.35]) by hub.freebsd.org (Postfix) with ESMTP id A5BC714C25 for ; Fri, 13 Aug 1999 06:25:07 -0700 (PDT) (envelope-from Jos.Backus@nl.origin-it.com) Received: from smtprelay-nl1.philips.com (localhost.philips.com [127.0.0.1]) by gw-nl3.philips.com with ESMTP id PAA17537 for ; Fri, 13 Aug 1999 15:25:08 +0200 (MEST) (envelope-from Jos.Backus@nl.origin-it.com) Received: from smtprelay-eur1.philips.com(130.139.36.3) by gw-nl3.philips.com via mwrap (4.0a) id xma017532; Fri, 13 Aug 99 15:25:08 +0200 Received: from hal.mpn.cp.philips.com (hal.mpn.cp.philips.com [130.139.64.195]) by smtprelay-nl1.philips.com (8.9.3/8.8.5-1.2.2m-19990317) with SMTP id PAA25640 for ; Fri, 13 Aug 1999 15:25:07 +0200 (MET DST) Received: (qmail 13055 invoked by uid 666); 13 Aug 1999 13:25:29 -0000 Date: Fri, 13 Aug 1999 15:25:29 +0200 From: Jos Backus To: Bruce Evans Cc: phk@critter.freebsd.dk, vernick@bell-labs.com, freebsd-fs@FreeBSD.ORG Subject: Re: Help with understand file system performance Message-ID: <19990813152529.G12312@hal.mpn.cp.philips.com> Reply-To: Jos Backus References: <199908131313.XAA14647@godzilla.zeta.org.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.6i In-Reply-To: <199908131313.XAA14647@godzilla.zeta.org.au>; from Bruce Evans on Fri, Aug 13, 1999 at 11:13:21PM +1000 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Fri, Aug 13, 1999 at 11:13:21PM +1000, Bruce Evans wrote: [Poul-Henning wrote:] > >Try fiddling the newfs parameters. I see 17% speedup using: > > > > newfs -b 16384 -f 4096 -c 100 Too bad tunefs doesn't have those options :-) > ffs-4K-4K is only slightly faster than ffs-8K-1K now, presumably because > drive caches are larger and command overheads are relatively higher (the KA > acts like a slow SCSI drive in wanting a block size of at least 8K to keep > up with the disk). As an aside, AIX uses 4K blocks and doesn't support fragments. -- Jos Backus _/ _/_/_/ "Reliability means never _/ _/ _/ having to say you're sorry." _/ _/_/_/ -- D. J. Bernstein _/ _/ _/ _/ Jos.Backus@nl.origin-it.com _/_/ _/_/_/ use Std::Disclaimer; To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Aug 13 6:40:11 1999 Delivered-To: freebsd-fs@freebsd.org Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131]) by hub.freebsd.org (Postfix) with ESMTP id 0AE6A14E63 for ; Fri, 13 Aug 1999 06:40:03 -0700 (PDT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id PAA03326; Fri, 13 Aug 1999 15:39:25 +0200 (CEST) (envelope-from phk@critter.freebsd.dk) To: Jos Backus Cc: Bruce Evans , vernick@bell-labs.com, freebsd-fs@FreeBSD.ORG Subject: Re: Help with understand file system performance In-reply-to: Your message of "Fri, 13 Aug 1999 15:25:29 +0200." <19990813152529.G12312@hal.mpn.cp.philips.com> Date: Fri, 13 Aug 1999 15:39:25 +0200 Message-ID: <3324.934551565@critter.freebsd.dk> From: Poul-Henning Kamp Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message <19990813152529.G12312@hal.mpn.cp.philips.com>, Jos Backus writes: >On Fri, Aug 13, 1999 at 11:13:21PM +1000, Bruce Evans wrote: >[Poul-Henning wrote:] >> >Try fiddling the newfs parameters. I see 17% speedup using: >> > >> > newfs -b 16384 -f 4096 -c 100 > >Too bad tunefs doesn't have those options :-) > >> ffs-4K-4K is only slightly faster than ffs-8K-1K now, presumably because >> drive caches are larger and command overheads are relatively higher (the KA >> acts like a slow SCSI drive in wanting a block size of at least 8K to keep >> up with the disk). > >As an aside, AIX uses 4K blocks and doesn't support fragments. AIX uses jfs, which is entirely diffrenet. It may be time to abandon the concept of fragments. "cylinders" should be taken out of "cylindergroups". -- Poul-Henning Kamp FreeBSD coreteam member phk@FreeBSD.ORG "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Aug 13 11:55:39 1999 Delivered-To: freebsd-fs@freebsd.org Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135]) by hub.freebsd.org (Postfix) with ESMTP id B7FF814FE5 for ; Fri, 13 Aug 1999 11:55:32 -0700 (PDT) (envelope-from tlambert@usr09.primenet.com) Received: (from daemon@localhost) by smtp05.primenet.com (8.9.1/8.9.1) id LAA517950; Fri, 13 Aug 1999 11:53:14 -0700 Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp05.primenet.com, id smtpduBppUa; Fri Aug 13 11:53:10 1999 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id LAA22289; Fri, 13 Aug 1999 11:53:07 -0700 (MST) From: Terry Lambert Message-Id: <199908131853.LAA22289@usr09.primenet.com> Subject: Re: Help with understand file system performance To: Jos.Backus@nl.origin-it.com Date: Fri, 13 Aug 1999 18:53:06 +0000 (GMT) Cc: bde@zeta.org.au, phk@critter.freebsd.dk, vernick@bell-labs.com, freebsd-fs@FreeBSD.ORG In-Reply-To: <19990813152529.G12312@hal.mpn.cp.philips.com> from "Jos Backus" at Aug 13, 99 03:25:29 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > On Fri, Aug 13, 1999 at 11:13:21PM +1000, Bruce Evans wrote: > [Poul-Henning wrote:] > > >Try fiddling the newfs parameters. I see 17% speedup using: > > > > > > newfs -b 16384 -f 4096 -c 100 > > Too bad tunefs doesn't have those options :-) These are not tunable options, they are initial layout options. > > ffs-4K-4K is only slightly faster than ffs-8K-1K now, presumably because > > drive caches are larger and command overheads are relatively higher (the KA > > acts like a slow SCSI drive in wanting a block size of at least 8K to keep > > up with the disk). > > As an aside, AIX uses 4K blocks and doesn't support fragments. AIX uses JFS, which is a journaling file system. Journalling file systems can replay transactions forward, as well as rolling them backward (log structured FS's can only roll them backward). In addition, the very nature of a JFS is significantly different (e.g. to write one for FreeBSD, it would be necessary to cause VOP_ABORTOP to do what it's name says it does, instead of freeing up cn_pnbuf allocations that the caller should be freeing up anyway). Likewise, you can't get rid of the concept of cylinder/cylindergroup without damaging the hashing function, which prevents fragmentation. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Aug 13 13:51:54 1999 Delivered-To: freebsd-fs@freebsd.org Received: from smtp01.primenet.com (smtp01.primenet.com [206.165.6.131]) by hub.freebsd.org (Postfix) with ESMTP id 572E814EA3 for ; Fri, 13 Aug 1999 13:51:38 -0700 (PDT) (envelope-from tlambert@usr01.primenet.com) Received: (from daemon@localhost) by smtp01.primenet.com (8.8.8/8.8.8) id NAA21170; Fri, 13 Aug 1999 13:50:54 -0700 (MST) Received: from usr01.primenet.com(206.165.6.201) via SMTP by smtp01.primenet.com, id smtpd021051; Fri Aug 13 13:50:47 1999 Received: (from tlambert@localhost) by usr01.primenet.com (8.8.5/8.8.5) id NAA17293; Fri, 13 Aug 1999 13:50:39 -0700 (MST) From: Terry Lambert Message-Id: <199908132050.NAA17293@usr01.primenet.com> Subject: Re: Help with understand file system performance To: zzhang@cs.binghamton.edu (Zhihui Zhang) Date: Fri, 13 Aug 1999 20:50:39 +0000 (GMT) Cc: tlambert@primenet.com, phk@critter.freebsd.dk, roberto@keltia.freenix.fr, freebsd-fs@FreeBSD.ORG In-Reply-To: from "Zhihui Zhang" at Aug 12, 99 09:02:32 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > On Thu, 12 Aug 1999, Terry Lambert wrote: > > > The filesystem block allocation table in directories is unique, in > > that it is generally used as a convenience for locating physical > > blocks, rather than using the standard filesystem block access > > mechanisms, when reading or writing directories. > > Directory files have the same on-disk structure as regular files. Yes. But they are not accessed internally as if they were regular files. The only operation which is trated as "regular" is extending (and as of 4.4BSD, truncating back) the block allocations in directories. The directory manipulation code treats it as a series of blocks, and translates from the "regular file" aspect into BLKATOFF(). > However, they can never have holes and they can only be incremented at the > end of the file in device block chunks. No directory entry can cross the > device block boundary to guarantee the atomic update. Right. There is no such thing as a "sparse block allocation" in a directory, since BLKATOFF() assumes the existance of a block. Directory entries are physically prevented from crossing block boundaries in order to ensure atomic update. But this is an implementation detail, and it is not the only way one could ensure atomicity, so long as one were willing to reallocate (filesystem, not physical) blocks or frags in order to do the updates (i.e. you could arrange for a two stage commit; I did this in my Unicode FFS prototype, since even though a 256 character name would fit in 512b, there was no room left over for the metadata). > However, I do not know why you say the block map (direct and indirect > blocks) of a directory is only used as a convenience. I mean there is a > need to call VOP_BMAP() on a directory file. The routine ffs_blkatoff() > calls bread(), which in turn calls VOP_BMAP(). The in-core inode does have > several fields to facilitate the insertion of new directory entries. But > we still need the block map (block allocation table). Directory manipulations access blocks directly. You've no doubt noticed that the vast majority of system calls do _not_ require VOP_BMAP() calls for copyin/out operations on VM objects backed by the filesystem. The need to call VOP_BMAP() is an artifact of treating the directories as a list of blocks, rather than treating them as files. The "convenience" aspect is that they are files, but they are not used as such, and it's just because it is convenient that files are used as the underlying abstraction: directories are not naturally represented as files, and in fact, trying to make them conform to the normal file behaviour would result in breakage of the atomicity guarantee. > Directory files are also specical in that we can not write into them > with the write() system call as normal files. They use a special > routine to grow, i.e., ufs_direnter(). By the way, we can use read() > system call to read directory files as we do with normal files. The lack of the ability to write was mirrored by a lack of ability to read, as well, until this was changed, intentionally. Likewise, there was no ability to mmap directories (read only, of course), until that, too, was changed. These are both optimizations to speed certain programs, and are really antithetical to POSIX. In reality, if you have looked at the "cookie" code for VOP_READDIR() in NFS, FFS, and at least one other FS, you will see that the need for cookies is an artifact of the structure of the interface. An alternate interface would allow directory block abstraction seperate from the externalization of directory entries. The structure that is returned by getdents() is actually only coincidentally (albeit intentionally so) the same as the FFS on disk structure. See the 4.3/4.4 compatability translation code in the VOP_READDIR() in the FFS implementation. The upshot of this is that the ability to read or mmap for read directories is actually a very bad thing, from an interface perspective, since it promotes the writing of code that depends on data format interfaces. This is similar to the use of the KVM as a data interface. It is only coincidental, based on implementation (unintentionally so, this time) that the POSIX access time updates for files and the access time of directories (as POSIX mandates for getdents() operations) happen to coincide. If you look at the cookie mess, and the NFS server code wire format translation mess, I'm sure you will agree. You only need to ask yourself "how could NFS handle a VOP_READDIR() that came from an underlying FS that could pack more entries in a block than could be represented in a block in the external 'coincidental' format?" to prove to yourself that this is broken. > > There are a number of performance penalties for this, especially > > on large directories, where it is not possible to trigger sequential > > readahead through use of the getdents() system call sequentially > > accessing sequential 512b/physical_block_size extents. > > I do not understand this. The read-ahead mechanism should work on any > files. I thought the reorganization of diretory entries within a directory > block when you delete an entry is an inefficiency. > > Does this issue have anything to do with the VMIO directory issue > discussed earlier this year? No. It has to do with VOP_READDIR() not exhibiting behaviour which would trigger read-ahead, such as is triggered by READ, WRITE, GETPAGES, and PUTPAGES. > > The frag size can be tuned down below this (i.e. 1/4, 1/2, 1). > > > > The only case where 1024 bytes of physical disk would be used is at > > a filesystem block size of 8192 (or greater), which, divided by 8, > > gives 1024b (or greater). > > I did not realize this before. The maximum ratio is 8. So if the > filesystem block is 8192, the allocation unit (fragment size) can not be > 512 because 8192/512 > 8. Yes. There are only 8 bits available for representing frag allocations. > > This is called an encapsulated two stage commit, in database terms. > > > > For inodes, indirect blocks, and directory entry blocks, there is > > no two stage commit, because there is no indirection of their data > > contents. > > I guess you mean that their data are not managed by any higher level > metadata which must be updated together. Yes. Despite the fact that "higher level" metadata exists, since the implementation detail is that they are stored using "files", the actual implementation does not take advantage of this, either for triggering read-ahead, or for encapsulated commits of directory modifications, or for clustering (which could only occur on a restore from an archive, given the incremental nature of directory entries), or for any of a dozen other speed enhancements which are applied to normal files. This means that directories are, by their nature, rather slow. > Thanks for your help. > > -Zhihui Any time. 8-). It's an interesting discussion to engage in; there are (not implemented in FreeBSD) interesting soloutions to much of the performance issues that people raise against the FFS. The last time this issue came up that I remember had to do with depth-first creation and breadth-first traversal of the ports directory structure; I actually still maintain that this is a problem in the creation of the directory (i.e. the organization of the archive) more than it is a problem with the FS itself (a tool is only as good as the craftsman using it). If used properly, there really aren't a lot of performance problems that you can point to (sort of like cutting with vs. against the grain in a board). I am becoming convinced that an intermediate abstraction is really what is called for, to turn the bottom end into what is, in effect, nothing more than a flat, numeric namespace on top of a variable granularity block store. A nice topic for much research... 8-). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Aug 13 14:33:50 1999 Delivered-To: freebsd-fs@freebsd.org Received: from m4.c2.telstra-mm.net.au (m4.c2.telstra-mm.net.au [24.192.3.19]) by hub.freebsd.org (Postfix) with ESMTP id 3011914EAB for ; Fri, 13 Aug 1999 14:33:43 -0700 (PDT) (envelope-from a.reilly@lake.com.au) Received: from m5.c2.telstra-mm.net.au (m5.c2.telstra-mm.net.au [24.192.3.20]) by m4.c2.telstra-mm.net.au (8.8.6 (PHNE_14041)/8.8.6) with ESMTP id HAA27453 for ; Sat, 14 Aug 1999 07:33:48 +1000 (EST) X-BPC-Relay-Envelope-From: a.reilly@lake.com.au X-BPC-Relay-Envelope-To: X-BPC-Relay-Sender-Host: m5.c2.telstra-mm.net.au [24.192.3.20] X-BPC-Relay-Info: Message delivered directly. Received: from areilly.bpc-users.org (CPE-24-192-49-170.nsw.bigpond.net.au [24.192.49.170]) by m5.c2.telstra-mm.net.au (8.8.6 (PHNE_14041)/8.8.6) with SMTP id HAA12262 for ; Sat, 14 Aug 1999 07:33:47 +1000 (EST) Received: (qmail 39702 invoked by uid 1000); 13 Aug 1999 21:33:46 -0000 From: "Andrew Reilly" Date: Sat, 14 Aug 1999 07:33:46 +1000 To: Terry Lambert Cc: Zhihui Zhang , phk@critter.freebsd.dk, roberto@keltia.freenix.fr, freebsd-fs@FreeBSD.ORG Subject: Re: Help with understand file system performance Message-ID: <19990814073346.A38606@gurney.reilly.home> References: <199908132050.NAA17293@usr01.primenet.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4i In-Reply-To: <199908132050.NAA17293@usr01.primenet.com>; from Terry Lambert on Fri, Aug 13, 1999 at 08:50:39PM +0000 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Fri, Aug 13, 1999 at 08:50:39PM +0000, Terry Lambert wrote: > I am becoming convinced that an intermediate abstraction is really > what is called for, to turn the bottom end into what is, in effect, > nothing more than a flat, numeric namespace on top of a variable > granularity block store. A nice topic for much research... 8-). Isn't that what Andrew Tannenbaum had on Amoeba? Does anyone have any experience with that system? The numbers in his namespace were capabilities/crypto-cookies, if I remember rightly. -- Andrew To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Aug 13 18:53:32 1999 Delivered-To: freebsd-fs@freebsd.org Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133]) by hub.freebsd.org (Postfix) with ESMTP id 04EB115039; Fri, 13 Aug 1999 18:52:53 -0700 (PDT) (envelope-from tlambert@usr04.primenet.com) Received: (from daemon@localhost) by smtp03.primenet.com (8.9.3/8.9.3) id SAA08447; Fri, 13 Aug 1999 18:50:58 -0700 (MST) Received: from usr04.primenet.com(206.165.6.204) via SMTP by smtp03.primenet.com, id smtpdAAAHkaqzq; Fri Aug 13 18:50:53 1999 Received: (from tlambert@localhost) by usr04.primenet.com (8.8.5/8.8.5) id SAA23891; Fri, 13 Aug 1999 18:50:48 -0700 (MST) From: Terry Lambert Message-Id: <199908140150.SAA23891@usr04.primenet.com> Subject: Re: BSD XFS Port & BSD VFS Rewrite To: Matthew.Alton@anheuser-busch.com (Alton Matthew) Date: Sat, 14 Aug 1999 01:50:47 +0000 (GMT) Cc: Hackers@FreeBSD.ORG, fs@FreeBSD.ORG In-Reply-To: <0740CBD1D149D31193EB0008C7C56836EB8AFC@STLABCEXG012> from "Alton, Matthew" at Aug 5, 99 06:02:47 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > I am currently conducting a thorough study of the VFS subsystem > in preparation for an all-out effort to port SGI's XFS filesystem to > FreeBSD 4.x at such time as SGI gives up the code. Matt Dillon > has written in hackers- that the VFS subsystem is presently not > well understood by any of the active kernel code contributers and > that it will be rewritten later this year. This is obviously of great > concern to me in this port. It is of great concern to me that a rewrite, apparently because of non-understanding, is taking place at all. I would suggest that anyone planning on this rewrite should talk, in depth, with John Heidemann prior to engaging in such activity. John is very approachable, and is a deep thinker. Any rewrite that does not meet his original design goals for his stacking architecture is, I think, a Very Bad Idea(tm). > I greatly appreciate all assistance in answering the following > questions: > > 1) What are the perceived problems with the current VFS? > 2) What options are available to us as remedies? > 3) To what extent will existing FS code require revision in order > to be useful after the rewrite? > 4) Will Chapters 6,7,8 & 9 of "The Design and Implementation of > the 4.4BSD Operating System" still pertain after the rewrite? > 5) How important are questions 3 & 4 in the design of the new > VFS? > > I believe that the VFS is conceptually sound and that the existing > semantics should be strictly retained in the new code. Any new > functionality should be added in the form of entirely new kernel > routines and system calls, or possibly by such means as > converting the existing routines to the vararg format &etc. Here some of the problems I'm aware of, and my suggested remedies: 1. The interface is not reflexive, with regard to cn_pnbuf. Specifically, path buffers are allocated by the caller, but not freed by the caller, and various routines in each FS implementation are expected to deal with this. Each FS duplicates code, and such duplication is subject to error. Not to mention that it makes your kernel fat. 2. Advisory locks are hung off private backing objects. Advisory locks are passed into VOP_ADVLOCK in each FS instance, and then each FS applies this by hanging the locks off a list on a private backing object. For FFS, this is the in core inode. A more correct approach would be to hang the lock off the vnode. This effectively obviates the need for having a VOP_ADVLOCK at all, except for the NFS client FS, which will need to propagate lock requests across the net. The most efficient mechanism for this would be to institute a pass/fail response for VOP_ADVLOCK calls, with a default of "pass", and an actual implementation of the operand only in the NFS client FS. Again, each FS must duplicate the advisory locking code, at present, and such duplication is subject to error. 3. Object locks are implemented locally in many FS's. The VOP_LOCK interface is implemented via vop_stdlock() calls in many FS's. This is done using the "vfs_default" mechanism. In other FS's, it's implemented locally. The intent of the VOP_LOCK mechanism being implemented as a VOP at all was to allow it to be proxied to another machine over a network, using the original Heidemann design. This is also the reason for the use of descriptors for all VOP arguments, since they can be opaquely proxied to another machine via a general mechanism. Unlike NFS based network filesystems, this would allow you to add VOP's to both machines, without having to teach the transport about the new VOP for it to be usable remotely. Like the VOP_ADVLOCK, the need for VOP_LOCK is for proxy purposes, and it, too, should generate a pass/fail response, and be largely implemented in non-filesystem specific higher level code. Again, each FS which duplicates code for this function is subject to duplication errors. 4. The VOP_READIR interface is irrational. The VOP_READDIR interface returns its responses in "host cannonical format" (struct dirent, in sys/dirent.h). Internally, FFS operates on "directory entry blocks" that contain exactly these structures (an intentaional coincidence). The problem with this approach, is that it makes the getdents system call sensitive to file systems for which some of the information returned (e.g. d_fileno, d_reclen, d_type, d_namlen) are synthetic. What this means is that a native file system directory implementation single directory block must be able to fit into the buffer passed to the getdirentries(2) system call, or a directory listing is not a valid snapshot of the current state of the directory. It also vastly complicates directory traversal restarts (hence the ncookies and a_cookies arguments, since the NFS server requires the ability to restart traversal, mid-block, since the NFSv2 protocol returns directory entries one at a time). The "cookie" idea must be carried out faithfully, in an FS specific fashion, for each FS which is allowed to be NFS exported. This code duplication is subject to error, or worse, non-implementation due to its complexity. A more rational approach would be to split the operation into two seperate VOP's: one to acquire a snapshot of a set of FS specific directory entries of an arbitrary size, and the second to extract rentries into the user's buffer, in cannonical format. 5. The idea of "root" vs. "non-root" mounts is inherently bad. Right now, there are several operations, all wrapped into a single "mount" entry point. This is actually a partial transition to a more cannonically correct implemetnation. The reason for the "root" vs. "non-root" knowledge in the code has to do with several logical operations: 1) "Mounting" the filesystem; that is, getting the vnode for the device to be mounted, and doing any FS specific operations necessary to cause the correct in-core context to be established. 2) Covering the vnode at the mount point. This operation updates the vnode of the mount point so that traversals of the mount point will get you the root directory of the FS that was mounted instead of the directory that is covered by the mount. 3) Saving the "last mounted on" information. This is a clerical detail. Read-only FS's, and some read-write FS's, do not implement this. It is mostly a nicety for tools that manipulate FFS directly. 4) Initialize the FS stat information. Part of the in-core data for any FS is the mnt_stat data, which is what comes back from a VFS_STATFS() call The first operation is invariant. It must be done for all FS's, whether they are "root" or "non-root". The second operation is specific to "non-root" FS's. It could be moved to common, higher level code -- specifically, it could be moved into the mount system call. The third operation is also specific to "non-root" FS's. It could be discarded, or it could be moved to a seperate VFS operation, e.g. VFS_SETMNTINFO(). I would recommend moving it to a seperate VFSOP, instead of discarding it. The reason for this is that an intelligent person could reasonably decide to add the setting of this data in newfs and tunefs, and do away with /etc/fstab. The fourth operation is invariant. It must be done for all FS's, whether they are "root" or "non-root". We can now see that we have two discrete operations: 1) Placement of any FS, regardless of how it is intended to be used, into the list of mounted filesystems. 2) Mapping a filesystem from the list of mounted FS's into the directory hierarchy. The job of the per FS mount code should be to take a mount structure, the vnode of a device, the FS specific arguments, the mount point credentials, and the process requesting the mount, and _only_ do #1 and #4. The conversion of the root device into a vnode pointer, or a path to a device into a vnode pointer, is the job of upper level code -- specifically, the mount system call, and the common code for booting. This removes a large amount of complex code from each of the file systems, and centralizes the maintenance task into one set of code that either works for everyone, or no one (removing the duplication of code/introduction of errors issue). In addition, the lack of "root" specific code in many FS's VFS_MOUNT entry points is the reason that they can not be mounted as "/". This change would open it up, such that any FS that was supported by the kernel could be used as the root filesystem. 6. The "vfs_default" code damages stacking The intent of the stacking architecture was to have the default operation for any VOP unknown to an FS fall through to the lower level code, and fail if it was not implemented. The use of the "vfs_default" to make unimplemented VOP's fall through to code which implements function, while well intentioned, is misguided. Consider the case of a VOP proxy that proxies requests. These might be requests to another machine, as in the previous proxy example, or they might be requests to user space, to allow for easy developement of new filesystem layers. In addition, in order to get a default operation to actually fail, you have to intentionally create a failing VOP for that particular FS. Finally, the paradigm can not support new VOP's without a kernel recompilation. This means that in order to add to the list of VOP's known to the system when you add a new FS, you don't merely have to reallocate the in-core copy of the vnodeop_desc to include a new (failing) member, you have to create a default behaviour for it, and modify the default operations table. In other words, it's not extensible, as it was architected to be. 7. The struct nameidata (namei.h) is broken in conception. One issue that recurrs frequently, and remains unaddressed, is the issue of namespace abstraction. This issue is nowhere more apparent than in the VFAT and NTFS filesystems, where there are two namespaces: one 8.3, and the second, 16 bit Unicode. The problem is one of coherency, and one of reference, and is not easily resolved in the context of the current nameidata structure. Both NTFS and the VFAT FS try to cover this issue, both with varing degress of success. The problem is that there is no cannonical format that the kernel can use to communicate namespace data to FS's. Unlike VOP_READDIR, which has the abstract (though ill-implemented) struct dirent, there is no abstract representation of the data in a pathname buffer, which would allow you to treat path components as opaque entities. One potential remedy for this situation would be to cannonize any path into an ordered list of components. Ideally, this would be done in 16 bit Unicode (looking toward the future), but would minimally be seperate components with length counts to allow faster rejection of non-matching components, and frequent recalculation of length. 8. The filesystems have knowledge of the name cache. Entries into the name cache, and deletion of entries from the name cache, should be handled in FS independent code at a higher level. This can avoid expensive VFS_LOOKUP calls in many cases, and save marshalling arguments into and out of the descriptor structure, in addition to drastically reducing the function call overhead. Someone recently profiling FreeBSD's FS to detemine speed bottleneck (I believe it was Mike Smith, attempting to optimize for a ZD Labs benchmark) found that FreeBSD spends much of its time in namei(). 9. The implementation of namei() is POSIX non-compliant The implementation of namei() is by means of coroutine "recursion"; this is similar to the only recursion you can achieve in FORTRAN. The upshot of this is that the use of the "//" namespace escape allowed by POSIX can not be usefully implemented. This is because it is not possible to inherit a namespace escape deeper than a single path component for a stack of more than one layer in depth. This needs to be fixed, both for "natural" SMBFS support, and for other uses of the namespace escape (HTTP "tunnels", extended attribute and/or resource fork access in an OS/2 HPFS or Macintosh HFS implementation, etc.), including forward looking research. This is related to item 7. 10. Stacking is broken This is really an issue of not having a coherency protocol which can be applied between stacks of files. It is somewhat related to almost all of the above issues. The current thinking which has been forwarded by Matt and John is that a vnode should have an associated vm_object_t, and that coherency should be maintained that way. This thinking is flawed for a number of reasons: a. The main utility of this would be for an MFS implementation. While a "fast MFS" is a laudable goal, it isn't sufficient to drive this. b. A coherency protocol is required in any case, since a proxied VOP is not necessarily on the same machine or in the same VM space. This approach would disallow the possibility of a user space filesystem developement framework. c. There already exist aliases (VM implementation errors); intentionally adding aliases as an implementation detail will futher obfuscate them. Minimally, the VM system should pass a full branch path analysis based test procedure before they are introduced. Even then, I would argue that it would open up a large complexity space that would prevent us from ever being sure about problem resoloution again. d. Filesystems which need to transform data can never operate correctly, since they need to make local copies of the transformed content. This includes cryptographic, character set translation, compression, and similar stacking layers. Instead, I think the interface design issues (VOP_ADVLOCK, VOP_GETPAGES, VOP_PUTPAGES, VOP_READ, VOP_WRITE, et. al.) that drive the desire to implement coherency in this fashion be examined. I believe that an ideal soloution would be to never have the pages replicated at more than a single vnode. This would likewise solve the coherency problem, without the additional complexity. The issue would devolve into locating the real backing object, and potentially, translating extents. 11. The function call "footprint" of filesystems is too large Attempt the following: Compile up all of the files which make up an individual filesystem. You can take all of the files for the ufs/ffs objects and the vnode_if.o from a compiled kernel for this exercise. Now link them. Ignore the missing "main"; how many undefined functions are there? The problem you are seeing is the incursion of the VM system, and sloppy programming practices, into each VFS implementation. This footprint impacts filesystem portability, and is one reason, among many (including some of the above) that VFS modules are no longer very portable between BSD flavors. Minimally, the VFS incursions need to be macrotized, and not assume a unified VM and buffer cache (or a non-unified VM and buffer cache, as well, for that matter). This would improve portability considerably. In addition to this change, a function minimzation effort should take place. If the underlying interface utilized by VFS layers was not the kernel (for local media FS's, like FFS or NTFS), but instead a variable granularity block store with a numeric namespace, then the "top" and "bottom" interfaces could be identical. For now, however, some work can be done (and should be done) to reduce the function call footprint. This is important work, which can only aid developement of future work (such as a user space filesystem framework for use by developers and researchers). I hesitate to suggest this, but it might be reasonable to consider a struct containing externally referenced functions, which is registered into the FS via mount, and which is identical for all FS's. This would, likewise, promote the idea of a user space framework. Ideally, work would be done to port the Heidemann framework to Linux, so that their developers could be leveraged. Some FFS-specific problems are: 1. The directory code in the UFS layer is intertwined with the filespace code Ideally, one would be able to mount a filesystem as a flat numeric namespace (see #7, above), and then mount the idea of directory management over top of that. 2. The quota subsystem is too tightly integrated Quotas should be an abstract stacking layer that can be applied to any FS, instead of an FFS specific monstrosity. The current quota system is also limited to 16 bits for a number of values which, in FreeBSD, can be greater than 16 bits (e.g. UID's). The current quota system is also broken for Y2038. 3. The filesystem itself is broken for Y2038 The space which was historically reserved for the Y2038 fix (a 64 bit time_t) was absconeded with for subsecond resoloution. This change should be reverted, and fsck modified to re-zero the values, given a specific argument. The subsecond resoloution doesn't really matter, but if it is seen as an issue which needs to be addressed, the only value which could reasonably require this is the modification time, and there is sufficient free space in the inode to be able to provide for this (there are 2x32 bit spares). I have other suggestions, but the above covers the most obvious damage. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Aug 14 1:50: 0 1999 Delivered-To: freebsd-fs@freebsd.org Received: from gw-nl3.philips.com (gw-nl3.philips.com [192.68.44.35]) by hub.freebsd.org (Postfix) with ESMTP id DC4B914EDA for ; Sat, 14 Aug 1999 01:49:56 -0700 (PDT) (envelope-from Jos.Backus@nl.origin-it.com) Received: from smtprelay-nl1.philips.com (localhost.philips.com [127.0.0.1]) by gw-nl3.philips.com with ESMTP id KAA21578 for ; Sat, 14 Aug 1999 10:50:09 +0200 (MEST) (envelope-from Jos.Backus@nl.origin-it.com) Received: from smtprelay-eur1.philips.com(130.139.36.3) by gw-nl3.philips.com via mwrap (4.0a) id xma021571; Sat, 14 Aug 99 10:50:11 +0200 Received: from hal.mpn.cp.philips.com (hal.mpn.cp.philips.com [130.139.64.195]) by smtprelay-nl1.philips.com (8.9.3/8.8.5-1.2.2m-19990317) with SMTP id KAA25684 for ; Sat, 14 Aug 1999 10:50:06 +0200 (MET DST) Received: (qmail 28515 invoked by uid 666); 14 Aug 1999 08:50:29 -0000 Date: Sat, 14 Aug 1999 10:50:29 +0200 From: Jos Backus To: Terry Lambert Cc: bde@zeta.org.au, phk@critter.freebsd.dk, vernick@bell-labs.com, freebsd-fs@FreeBSD.ORG Subject: Re: Help with understand file system performance Message-ID: <19990814105029.A28461@hal.mpn.cp.philips.com> Reply-To: Jos Backus References: <19990813152529.G12312@hal.mpn.cp.philips.com> <199908131853.LAA22289@usr09.primenet.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.6i In-Reply-To: <199908131853.LAA22289@usr09.primenet.com>; from Terry Lambert on Fri, Aug 13, 1999 at 06:53:06PM +0000 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Fri, Aug 13, 1999 at 06:53:06PM +0000, Terry Lambert wrote: > These are not tunable options, they are initial layout options. I know, I was just thinking of Partition Magic which I had to use on my wife's computer the other night :) > AIX uses JFS, which is a journaling file system. A very different beast indeed (I used to admin some AIX boxes). This is what logfs was supposed to be like (correct me if I'm wrong). -- Jos Backus _/ _/_/_/ "Reliability means never _/ _/ _/ having to say you're sorry." _/ _/_/_/ -- D. J. Bernstein _/ _/ _/ _/ Jos.Backus@nl.origin-it.com _/_/ _/_/_/ use Std::Disclaimer; To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Aug 14 4:44:41 1999 Delivered-To: freebsd-fs@freebsd.org Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.26.10.9]) by hub.freebsd.org (Postfix) with ESMTP id C3CB1154D8 for ; Sat, 14 Aug 1999 04:44:32 -0700 (PDT) (envelope-from bde@godzilla.zeta.org.au) Received: (from bde@localhost) by godzilla.zeta.org.au (8.8.7/8.8.7) id VAA00836 for freebsd-fs@freebsd.org; Sat, 14 Aug 1999 21:43:31 +1000 Date: Sat, 14 Aug 1999 21:43:31 +1000 From: Bruce Evans Message-Id: <199908141143.VAA00836@godzilla.zeta.org.au> To: freebsd-fs@freebsd.org Subject: better ffs-4096-512 ... ext2fs-4096-4096 benchmarks Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org This tests filesystems with various block sizes in a few simple ways. The filesystems are now 1/3 filled with interesting data (a copy of /home/ncvs which takes about 750000000 bytes of tar output). I'm mainly interested in the read benchmark (tar cf /dev/null ncvs) so I didn't test -async or soft updates. `tarcp' is two tars in a pipe with a block size of 1MB (the details don't matter because the filesystems are on separate drives). The results are as I expected, except reading ext2fs-4096-4096 is now about 2.5 times faster than for the best ffs layout. The throughput of 7-8MB/sec is about 40% of the drive's throughput. This is surprisingly large for a filesystem with lots of small files (79772 files with average size 9500 bytes). Bruce ffs-4096-512: fsck /dev/rwd2e: 12.56 real 2.59 user 0.18 sys tarcp /d ncvs: 1509.69 real 4.76 user 74.37 sys umount /d: 2.00 real 0.00 user 0.29 sys fsck /dev/rwd2e: 52.68 real 3.24 user 0.93 sys tar cf /dev/null ncvs: 364.42 real 1.73 user 22.31 sys ffs-4096-1024: fsck /dev/rwd2e: 6.52 real 1.60 user 0.08 sys tarcp /d ncvs: 1489.87 real 4.64 user 72.98 sys umount /d: 1.81 real 0.00 user 0.32 sys fsck /dev/rwd2e: 45.52 real 2.22 user 0.81 sys tar cf /dev/null ncvs: 327.91 real 1.76 user 21.89 sys ffs-4096-2048: fsck /dev/rwd2e: 3.97 real 1.12 user 0.01 sys tarcp /d ncvs: 1411.08 real 4.31 user 71.62 sys umount /d: 1.06 real 0.00 user 0.31 sys fsck /dev/rwd2e: 38.90 real 1.54 user 0.90 sys tar cf /dev/null ncvs: 290.50 real 1.89 user 21.17 sys ffs-4096-4096: fsck /dev/rwd2e: 2.87 real 0.83 user 0.03 sys tarcp /d ncvs: 1421.98 real 4.62 user 72.29 sys umount /d: 1.49 real 0.00 user 0.32 sys fsck /dev/rwd2e: 40.72 real 1.31 user 0.80 sys tar cf /dev/null ncvs: 283.48 real 1.86 user 21.93 sys ffs-8192-1024: fsck /dev/rwd2e: 5.93 real 1.27 user 0.13 sys tarcp /d ncvs: 1444.24 real 5.07 user 148.23 sys umount /d: 0.51 real 0.00 user 0.30 sys fsck /dev/rwd2e: 38.46 real 1.93 user 0.84 sys tar cf /dev/null ncvs: 348.17 real 1.84 user 46.74 sys ffs-8192-2048: fsck /dev/rwd2e: 3.90 real 0.80 user 0.04 sys tarcp /d ncvs: 1404.55 real 5.26 user 130.78 sys umount /d: 0.93 real 0.00 user 0.32 sys fsck /dev/rwd2e: 35.04 real 1.43 user 0.71 sys tar cf /dev/null ncvs: 308.19 real 1.83 user 40.08 sys ffs-8192-4096: fsck /dev/rwd2e: 2.68 real 0.52 user 0.05 sys tarcp /d ncvs: 1379.23 real 5.41 user 123.14 sys umount /d: 1.13 real 0.00 user 0.31 sys fsck /dev/rwd2e: 34.02 real 1.14 user 0.70 sys tar cf /dev/null ncvs: 260.32 real 1.59 user 20.33 sys ffs-8192-8192: [deleted -- invalid due to insufficient inodes] ffs-16384-2048: fsck /dev/rwd2e: 3.78 real 0.69 user 0.03 sys tarcp /d ncvs: 1379.81 real 5.71 user 128.53 sys umount /d: 1.04 real 0.00 user 0.30 sys fsck /dev/rwd2e: 31.27 real 1.31 user 0.70 sys tar cf /dev/null ncvs: 294.19 real 2.18 user 34.96 sys ffs-16384-4096: fsck /dev/rwd2e: 2.46 real 0.41 user 0.01 sys tarcp /d ncvs: 1359.52 real 5.40 user 121.19 sys umount /d: 1.06 real 0.00 user 0.31 sys fsck /dev/rwd2e: 30.66 real 0.98 user 0.72 sys tar cf /dev/null ncvs: 272.84 real 1.86 user 32.19 sys ffs-16384-8192: [deleted -- invalid due to insufficient inodes] ffs-16384-16384: [deleted -- invalid due to insufficient inodes] ext2fs-1024-1024: fsck.ext2 /dev/wd2e: [deleted -- invalid du to missing -f] tarcp /d ncvs: 1519.08 real 4.71 user 77.18 sys umount /d: 3.73 real 0.00 user 0.32 sys fsck.ext2 /dev/wd2e: [deleted -- invalid du to missing -f] tar cf /dev/null ncvs: 231.99 real 2.13 user 33.79 sys ext2fs-4096-4096: fsck.ext2 /dev/wd2e: [deleted -- invalid du to missing -f] tarcp /d ncvs: 1163.38 real 4.62 user 65.75 sys umount /d: 1.71 real 0.00 user 0.33 sys fsck.ext2 /dev/wd2e: [deleted -- invalid du to missing -f] tar cf /dev/null ncvs: 101.68 real 1.81 user 23.96 sys #!/bin/sh for b in 4096 8192 16384 do for f in $(($b / 8)) $(($b / 4)) $(($b / 2)) $b do echo ffs-$b-$f: >>/tmp/ztimes newfs -b $b -f $f /dev/rwd2e echo -n "fsck /dev/rwd2e: " >>/tmp/ztimes sync time fsck /dev/rwd2e 2>>/tmp/ztimes mount /dev/wd2e /d cd /home echo -n "tarcp /d ncvs: " >>/tmp/ztimes sync time tarcp /d ncvs 2>>/tmp/ztimes echo -n "umount /d: " >>/tmp/ztimes time umount /d 2>>/tmp/ztimes echo -n "fsck /dev/rwd2e: " >>/tmp/ztimes sync time fsck /dev/rwd2e 2>>/tmp/ztimes mount /dev/wd2e /d cd /d echo -n "tar cf /dev/null ncvs: " >>/tmp/ztimes sync time tar cf /dev/null ncvs 2>>/tmp/ztimes cd /tmp umount /d done done for b in 1024 4096 do for f in $b do echo ext2fs-$b-$f: >>/tmp/ztimes # linux mkfs.ext2 -b $b /dev/rwd2e $((4754368 / ($b / 512))) sync echo -n "fsck.ext2 /dev/wd2e: " >>/tmp/ztimes time fsck.ext2 /dev/wd2e 2>>/tmp/ztimes mount -t ext2fs /dev/wd2e /d cd /home echo -n "tarcp /d ncvs: " >>/tmp/ztimes sync time tarcp /d ncvs 2>>/tmp/ztimes echo -n "umount /d: " >>/tmp/ztimes time umount /d 2>>/tmp/ztimes echo -n "fsck.ext2 /dev/wd2e: " >>/tmp/ztimes sync time fsck.ext2 /dev/wd2e 2>>/tmp/ztimes mount -t ext2fs /dev/wd2e /d cd /d echo -n "tar cf /dev/null ncvs: " >>/tmp/ztimes sync time tar cf /dev/null ncvs 2>>/tmp/ztimes cd /tmp umount /d done done To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Aug 14 5: 7:21 1999 Delivered-To: freebsd-fs@freebsd.org Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.26.10.9]) by hub.freebsd.org (Postfix) with ESMTP id CA9AE151FB for ; Sat, 14 Aug 1999 05:07:15 -0700 (PDT) (envelope-from bde@godzilla.zeta.org.au) Received: (from bde@localhost) by godzilla.zeta.org.au (8.8.7/8.8.7) id WAA02286; Sat, 14 Aug 1999 22:07:29 +1000 Date: Sat, 14 Aug 1999 22:07:29 +1000 From: Bruce Evans Message-Id: <199908141207.WAA02286@godzilla.zeta.org.au> To: bde@zeta.org.au, freebsd-fs@FreeBSD.ORG Subject: Re: better ffs-4096-512 ... ext2fs-4096-4096 benchmarks Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org >The results are as I expected, except reading ext2fs-4096-4096 is now >about 2.5 times faster than for the best ffs layout. The throughput of Actually, there are some much more surprising results: >ffs-4096-512: >fsck /dev/rwd2e: 12.56 real 2.59 user 0.18 sys >tarcp /d ncvs: 1509.69 real 4.76 user 74.37 sys ^^^^^ >umount /d: 2.00 real 0.00 user 0.29 sys >fsck /dev/rwd2e: 52.68 real 3.24 user 0.93 sys >tar cf /dev/null ncvs: 364.42 real 1.73 user 22.31 sys ^^^^^ The critical system times are about 70 and 20 seconds for ffs-4096-any. >ffs-8192-1024: >fsck /dev/rwd2e: 5.93 real 1.27 user 0.13 sys >tarcp /d ncvs: 1444.24 real 5.07 user 148.23 sys ^^^^^ >umount /d: 0.51 real 0.00 user 0.30 sys >fsck /dev/rwd2e: 38.46 real 1.93 user 0.84 sys >tar cf /dev/null ncvs: 348.17 real 1.84 user 46.74 sys ^^^^^ The critical system times are almost twice as large for ffs-8192-most. They should be smaller. >ffs-8192-4096: >fsck /dev/rwd2e: 2.68 real 0.52 user 0.05 sys >tarcp /d ncvs: 1379.23 real 5.41 user 123.14 sys >umount /d: 1.13 real 0.00 user 0.31 sys >fsck /dev/rwd2e: 34.02 real 1.14 user 0.70 sys >tar cf /dev/null ncvs: 260.32 real 1.59 user 20.33 sys ^^^^^ Here's one reasonable system time for ffs-8192-*. `tar cf /dev/null' of the original /home/ncvs takes 828.85 real, 1.89 user, 21.90 sys, so it accounts for about half of the real times for tarcp but not much of the system times. Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Aug 14 12:37:17 1999 Delivered-To: freebsd-fs@freebsd.org Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131]) by hub.freebsd.org (Postfix) with ESMTP id CF61A14CB9 for ; Sat, 14 Aug 1999 12:37:09 -0700 (PDT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id VAA12287 for ; Sat, 14 Aug 1999 21:35:03 +0200 (CEST) (envelope-from phk@critter.freebsd.dk) To: freebsd-fs@freebsd.org Subject: disk performance model Date: Sat, 14 Aug 1999 21:35:03 +0200 Message-ID: <12285.934659303@critter.freebsd.dk> From: Poul-Henning Kamp Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org I have spent a few hours trying to figure out a model for the access time of one of my disks. (I wondered to what extent and what way the length of a seek affected the time it took to perform it.) I set up a small program to read random places on the disk, timing each request, and logging the previos sectornumber, this sectornumber and time it took. The program ran in user space so there is some finite overhead included in the below model for the userland to kernel transition. I collected about 100k samples for various transfersizes. This information may or may not give any meaning/inspiration/insight in the current performance discussions... The disk is in a PII/400MHz system: ata1: master: setting up UDMA2 mode on PIIX4 chip OK ad1: ATA-4 disk at ata1 as master ad1: 17206MB (35239680 sectors), 34960 cyls, 16 heads, 63 S/T, 512 B/S ad1: piomode=4, dmamode=2, udmamode=2 ad1: 16 secs/int, 31 depth queue, DMA mode Model variables: size of transfer in sectors -> SZ number of sectors between last request and next one -> D Model: if (D < 1.25e7) Tseek = D ** .42 * 8.05e-6 + .00175 else Tseek = (D - 1.25e7) * 2.4e-10 + .00945 Trotation = random(0 ... .008.5) Taccess = Tseek + Trotation + 36e-9 * SZ Comments: It should be noted that the place where this model has the worst fit is where it is most interesting: values of D < 1.5e6 where the model underpredicts by up to a millisecond. Above 1.5e6 the model predicts better than +/- 500usec. The disk has larger media transfer rate rimwards than hubwards, but this doesn't manifest itself in the data for SZ < 100 I have no idea what this means in terms of UFS/FFS parameters... -- Poul-Henning Kamp FreeBSD coreteam member phk@FreeBSD.ORG "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Aug 14 13:32:36 1999 Delivered-To: freebsd-fs@freebsd.org Received: from frmug.org (frmug-gw.frmug.org [193.56.58.252]) by hub.freebsd.org (Postfix) with ESMTP id 097F714D4E for ; Sat, 14 Aug 1999 13:32:29 -0700 (PDT) (envelope-from roberto@keltia.freenix.fr) Received: (from uucp@localhost) by frmug.org (8.9.1/frmug-2.3/nospam) with UUCP id WAA02272 for freebsd-fs@FreeBSD.ORG; Sat, 14 Aug 1999 22:32:28 +0200 (CEST) (envelope-from roberto@keltia.freenix.fr) Received: by keltia.freenix.fr (Postfix, from userid 101) id 54E32870B; Sat, 14 Aug 1999 19:44:22 +0200 (CEST) Date: Sat, 14 Aug 1999 19:44:22 +0200 From: Ollivier Robert To: freebsd-fs@FreeBSD.ORG Subject: Re: Help with understand file system performance Message-ID: <19990814194422.A13802@keltia.freenix.fr> Mail-Followup-To: freebsd-fs@FreeBSD.ORG References: <19990813152529.G12312@hal.mpn.cp.philips.com> <199908131853.LAA22289@usr09.primenet.com> <19990814105029.A28461@hal.mpn.cp.philips.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii User-Agent: Mutt/0.95.5i In-Reply-To: <19990814105029.A28461@hal.mpn.cp.philips.com>; from Jos Backus on Sat, Aug 14, 1999 at 10:50:29AM +0200 X-Operating-System: FreeBSD 4.0-CURRENT/ELF ctm#5543 AMD-K6 MMX @ 200 MHz Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org According to Jos Backus: > A very different beast indeed (I used to admin some AIX boxes). This is what > logfs was supposed to be like (correct me if I'm wrong). If you mean LFS then no. LFS is a log file system (the entire FS is a log) whereas JFS is a journalling FS (i.e. a FS with a journal). They're quite different beasts. -- Ollivier ROBERT -=- FreeBSD: The Power to Serve! -=- roberto@keltia.freenix.fr FreeBSD keltia.freenix.fr 4.0-CURRENT #73: Sat Jul 31 15:36:05 CEST 1999 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Aug 14 14:57:55 1999 Delivered-To: freebsd-fs@freebsd.org Received: from gw-nl3.philips.com (gw-nl3.philips.com [192.68.44.35]) by hub.freebsd.org (Postfix) with ESMTP id 28B2B1525F for ; Sat, 14 Aug 1999 14:57:34 -0700 (PDT) (envelope-from Jos.Backus@nl.origin-it.com) Received: from smtprelay-nl1.philips.com (localhost.philips.com [127.0.0.1]) by gw-nl3.philips.com with ESMTP id XAA23267 for ; Sat, 14 Aug 1999 23:57:53 +0200 (MEST) (envelope-from Jos.Backus@nl.origin-it.com) Received: from smtprelay-eur1.philips.com(130.139.36.3) by gw-nl3.philips.com via mwrap (4.0a) id xma023264; Sat, 14 Aug 99 23:57:53 +0200 Received: from hal.mpn.cp.philips.com (hal.mpn.cp.philips.com [130.139.64.195]) by smtprelay-nl1.philips.com (8.9.3/8.8.5-1.2.2m-19990317) with SMTP id XAA09316 for ; Sat, 14 Aug 1999 23:57:53 +0200 (MET DST) Received: (qmail 36343 invoked by uid 666); 14 Aug 1999 21:58:16 -0000 Date: Sat, 14 Aug 1999 23:58:16 +0200 From: Jos Backus To: Ollivier Robert Cc: freebsd-fs@FreeBSD.ORG Subject: Re: Help with understand file system performance Message-ID: <19990814235816.A36250@hal.mpn.cp.philips.com> Reply-To: Jos Backus References: <19990813152529.G12312@hal.mpn.cp.philips.com> <199908131853.LAA22289@usr09.primenet.com> <19990814105029.A28461@hal.mpn.cp.philips.com> <19990814194422.A13802@keltia.freenix.fr> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.6i In-Reply-To: <19990814194422.A13802@keltia.freenix.fr>; from Ollivier Robert on Sat, Aug 14, 1999 at 07:44:22PM +0200 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Sat, Aug 14, 1999 at 07:44:22PM +0200, Ollivier Robert wrote: > If you mean LFS then no. LFS is a log file system (the entire FS is a log) > whereas JFS is a journalling FS (i.e. a FS with a journal). They're quite > different beasts. OK, point taken. As I understand it, jfs uses a log aka journal (usually hd8, type jfslog, in the std rootvg after installation) to prerecord metadata updates for the logical volumes in that volume group. I'm sure there's more to it but that's all I can remember now. Reading the paper by Margo Seltzer on LFS is still on my todo list. Cheers, -- Jos Backus _/ _/_/_/ "Reliability means never _/ _/ _/ having to say you're sorry." _/ _/_/_/ -- D. J. Bernstein _/ _/ _/ _/ Jos.Backus@nl.origin-it.com _/_/ _/_/_/ use Std::Disclaimer; To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message