From owner-freebsd-fs Mon Dec 17 14:51:15 2001 Delivered-To: freebsd-fs@freebsd.org Received: from priv-edtnes09-hme0.telusplanet.net (mtaout.telus.net [199.185.220.235]) by hub.freebsd.org (Postfix) with ESMTP id 1B65837B50B; Mon, 17 Dec 2001 14:51:06 -0800 (PST) Received: from fireball ([209.52.193.31]) by priv-edtnes09-hme0.telusplanet.net (InterMail vM.5.01.04.01 201-253-122-122-101-20011014) with SMTP id <20011217225103.FRAL28264.priv-edtnes09-hme0.telusplanet.net@fireball>; Mon, 17 Dec 2001 15:51:03 -0700 Message-ID: <001301c1874d$50ae0d20$02000003@tornado> From: "Dave Reyenga" To: , Cc: Subject: Instead of JFS, why not a whole new FS? Date: Mon, 17 Dec 2001 22:50:45 -0000 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4807.1700 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4807.1700 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org How about writing a new filesystem based on UFS? This would save all of the hassle that JFS would bring: licensing, porting time, etc. Of course, it would likely bust any compatibility desired. What I'm thinking is a filesystem that takes the current UFS and improves upon it. It could support larger partitions, more partitions in a slice, and perhaps a "Journal" partition (like the current "swap" partition) among other new features. What do others have to say about this? Are there any major flaws in my idea? It just seems to me that this would cut a lot of hassle. Those are just my $0.02. I know I've said it before, but I wasn't nearly as clear last time. -Craig To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 17 15: 0:39 2001 Delivered-To: freebsd-fs@freebsd.org Received: from rwcrmhc53.attbi.com (rwcrmhc53.attbi.com [204.127.198.39]) by hub.freebsd.org (Postfix) with ESMTP id EE01B37B632; Mon, 17 Dec 2001 15:00:16 -0800 (PST) Received: from InterJet.elischer.org ([12.232.206.8]) by rwcrmhc53.attbi.com (InterMail vM.4.01.03.27 201-229-121-127-20010626) with ESMTP id <20011217230016.REBT10701.rwcrmhc53.attbi.com@InterJet.elischer.org>; Mon, 17 Dec 2001 23:00:16 +0000 Received: from localhost (localhost.elischer.org [127.0.0.1]) by InterJet.elischer.org (8.9.1a/8.9.1) with ESMTP id OAA36312; Mon, 17 Dec 2001 14:55:18 -0800 (PST) Date: Mon, 17 Dec 2001 14:55:17 -0800 (PST) From: Julian Elischer To: Dave Reyenga Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, hiten@uk.FreeBSD.org Subject: Re: Instead of JFS, why not a whole new FS? In-Reply-To: <001301c1874d$50ae0d20$02000003@tornado> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org It is possible that Kirk may be thinking about doing this. He mumbled something about a new FS a while ago but it wasn't clear whether he was thinking of doing it, or he was just saying "someone will eventually do it". On Mon, 17 Dec 2001, Dave Reyenga wrote: > How about writing a new filesystem based on UFS? This would save all of the > hassle that JFS would bring: licensing, porting time, etc. Of course, it > would likely bust any compatibility desired. > > What I'm thinking is a filesystem that takes the current UFS and improves > upon it. It could support larger partitions, more partitions in a slice, and > perhaps a "Journal" partition (like the current "swap" partition) among > other new features. > > What do others have to say about this? Are there any major flaws in my idea? > It just seems to me that this would cut a lot of hassle. > > Those are just my $0.02. I know I've said it before, but I wasn't nearly as > clear last time. > > -Craig > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-hackers" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 17 15: 5: 5 2001 Delivered-To: freebsd-fs@freebsd.org Received: from web21107.mail.yahoo.com (web21107.mail.yahoo.com [216.136.227.109]) by hub.freebsd.org (Postfix) with SMTP id 7DF5C37B41E for ; Mon, 17 Dec 2001 15:04:19 -0800 (PST) Message-ID: <20011217230419.68884.qmail@web21107.mail.yahoo.com> Received: from [62.254.0.5] by web21107.mail.yahoo.com via HTTP; Mon, 17 Dec 2001 15:04:19 PST Date: Mon, 17 Dec 2001 15:04:19 -0800 (PST) From: Hiten Pandya Subject: Re: Instead of JFS, why not a whole new FS? To: dreyenga@telus.net Cc: freebsd-fs@freebsd.org, hackers@freebsd.org In-Reply-To: <001301c1874d$50ae0d20$02000003@tornado> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org --- Dave Reyenga wrote: > How about writing a new filesystem based on UFS? > This would save all of the > hassle that JFS would bring: licensing, porting > time, etc. Of course, it > would likely bust any compatibility desired. hi, first of all, a project called UFS2 has been started by Kirk McKusick on improving the existing UFS file system and improving 'softupdates' and other stuff in this file system. > What I'm thinking is a filesystem that takes the > current UFS and improves > upon it. It could support larger partitions, more > partitions in a slice, and > perhaps a "Journal" partition (like the current > "swap" partition) among > other new features. I dont know that this could be possible of having a 'Journal' partition, though I may be wrong. > What do others have to say about this? Are there any > major flaws in my idea? > It just seems to me that this would cut a lot of > hassle. One flaw in your idea is, that it would literally take longer to make this kind of file system on our current UFS source base. The reason is due to the code maturity level that UFS has reached of around 20 years. I think porting JFS will take less time than upgrading the current UFS, which as a matter of fact has already been started by Kirk McKusick himself. Regarding 'hassle', for me; nothing is a hassle as long as it can be acheived. If you are really interested in upgrading the current UFS, it would be good if you got in touch with Kirk McKusick himself. regards, =Hiten = ===== =Hiten = __________________________________________________ Do You Yahoo!? Check out Yahoo! Shopping and Yahoo! Auctions for all of your unique holiday gifts! Buy at http://shopping.yahoo.com or bid at http://auctions.yahoo.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 17 16: 8:24 2001 Delivered-To: freebsd-fs@freebsd.org Received: from monorchid.lemis.com (monorchid.lemis.com [192.109.197.75]) by hub.freebsd.org (Postfix) with ESMTP id 69B6937B41A; Mon, 17 Dec 2001 16:08:11 -0800 (PST) Received: by monorchid.lemis.com (Postfix, from userid 1004) id 14C4A786E3; Tue, 18 Dec 2001 10:38:09 +1030 (CST) Date: Tue, 18 Dec 2001 10:38:09 +1030 From: Greg Lehey To: Dave Reyenga Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, hiten@uk.FreeBSD.org Subject: Re: Instead of JFS, why not a whole new FS? Message-ID: <20011218103809.V14500@monorchid.lemis.com> References: <001301c1874d$50ae0d20$02000003@tornado> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <001301c1874d$50ae0d20$02000003@tornado> User-Agent: Mutt/1.3.23i Organization: The FreeBSD Project Phone: +61-8-8388-8286 Fax: +61-8-8388-8725 Mobile: +61-418-838-708 WWW-Home-Page: http://www.FreeBSD.org/ X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF 13 24 52 F8 6D A4 95 EF Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Monday, 17 December 2001 at 22:50:45 -0000, Dave Reyenga wrote: > How about writing a new filesystem based on UFS? If it's based on UFS, it's not a new file system. > This would save all of the hassle that JFS would bring: licensing, > porting time, etc. There are no hassles with licensing. You'd be balancing porting time against writing time. Guess which would take longer. > What I'm thinking is a filesystem that takes the current UFS and > improves upon it. It could support larger partitions, That's relatively trivial. The big issue is compatibility. > more partitions in a slice, That's relatively trivial. The big issue is compatibility. > and perhaps a "Journal" partition (like the current "swap" > partition) Well, I don't think the journal would be like swap. > among other new features. That's pretty much what IBM did. They called the result JFS. Greg -- See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Dec 17 18:29:52 2001 Delivered-To: freebsd-fs@freebsd.org Received: from avocet.prod.itd.earthlink.net (avocet.mail.pas.earthlink.net [207.217.120.50]) by hub.freebsd.org (Postfix) with ESMTP id 985C037B41A; Mon, 17 Dec 2001 18:29:47 -0800 (PST) Received: from pool0289.cvx40-bradley.dialup.earthlink.net ([216.244.43.34] helo=mindspring.com) by avocet.prod.itd.earthlink.net with esmtp (Exim 3.33 #1) id 16GA0n-0002CE-00; Mon, 17 Dec 2001 18:29:46 -0800 Message-ID: <3C1EAA1A.CA49932@mindspring.com> Date: Mon, 17 Dec 2001 18:29:46 -0800 From: Terry Lambert X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Dave Reyenga Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, hiten@uk.FreeBSD.org Subject: Re: Instead of JFS, why not a whole new FS? References: <001301c1874d$50ae0d20$02000003@tornado> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Dave Reyenga wrote: > > How about writing a new filesystem based on UFS? This would save all of the > hassle that JFS would bring: licensing, porting time, etc. Of course, it > would likely bust any compatibility desired. > > What I'm thinking is a filesystem that takes the current UFS and improves > upon it. It could support larger partitions, more partitions in a slice, and > perhaps a "Journal" partition (like the current "swap" partition) among > other new features. > > What do others have to say about this? Are there any major flaws in my idea? > It just seems to me that this would cut a lot of hassle. Any FS that shares code with an existing FS will not flush out the full list of problems associated with writing a new FS in the context of a FreeBSD system. For that reason, any UFS based system, including but not limited to FFS, LFS, EXT2FS, etc., is probably not a good example to use for an educational project. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Dec 18 8:13:13 2001 Delivered-To: freebsd-fs@freebsd.org Received: from patan.sun.com (patan.Sun.COM [192.18.98.43]) by hub.freebsd.org (Postfix) with ESMTP id 5B3AB37B405; Tue, 18 Dec 2001 08:13:07 -0800 (PST) Received: from canadamail1.Canada.Sun.COM ([129.155.5.100]) by patan.sun.com (8.9.3+Sun/8.9.3) with ESMTP id JAA02709; Tue, 18 Dec 2001 09:12:48 -0700 (MST) Received: from opcom-mail.canada.sun.com (scot.Canada.Sun.COM [129.155.8.107]) by canadamail1.Canada.Sun.COM (8.9.3+Sun/8.9.3/ENSMAIL,v2.1p1) with ESMTP id LAA01699; Tue, 18 Dec 2001 11:13:05 -0500 (EST) Received: from zonzorp.canada.sun.com (zonzorp.Canada.Sun.COM [129.155.6.21]) by opcom-mail.canada.sun.com (8.9.1b+Sun/8.9.1) with ESMTP id LAA12127; Tue, 18 Dec 2001 11:12:40 -0500 (EST) Received: from zonzorp (oz@localhost) by zonzorp.canada.sun.com (8.9.3+Sun/8.9.3) with ESMTP id LAA26047; Tue, 18 Dec 2001 11:11:01 -0500 (EST) Message-Id: <200112181611.LAA26047@zonzorp.canada.sun.com> X-Mailer: exmh version 2.5 07/13/2001 with nmh-1.0 To: Terry Lambert Cc: freebsd-fs@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: Instead of JFS, why not a whole new FS? In-Reply-To: Message from Terry Lambert of "Mon, 17 Dec 2001 18:29:46 PST." <3C1EAA1A.CA49932@mindspring.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Tue, 18 Dec 2001 11:11:00 -0500 From: "ozan s. yigit" Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org > Any FS that shares code with an existing FS will not flush out > the full list of problems associated with writing a new FS in > the context of a FreeBSD system. how about an implementation of plan9's kfs? it is fairly simple, with dentries similar to unix inodes, eg. typedef struct { char name[NAMELEN]; short uid; short gid; ushort mode; short wuid; Qid qid; long size; long dblock[NDBLOCK]; /* 6 */ long iblock; long diblock; long atime; long mtime; } Dentry; and perhaps would make a good educational implementation. sources for plan9's own is in plan9/sys/src/cmd/disk, if one needs to take a look. the document "the plan9 file server" by thompson gives some detail. oz --- ozan s. yigit staff engineer, sun microsystems/es http://www.cs.yorku.ca/~oz ozan.yigit@sun.com || +1 [905] 415 2878 --- narrowness of imagination leads to narrowness of experience. [corollary to rob] To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Dec 18 11:18:52 2001 Delivered-To: freebsd-fs@freebsd.org Received: from elvis.mu.org (elvis.mu.org [216.33.66.196]) by hub.freebsd.org (Postfix) with ESMTP id 5195D37B405; Tue, 18 Dec 2001 11:18:50 -0800 (PST) Received: by elvis.mu.org (Postfix, from userid 1192) id DA9FE81E0C; Tue, 18 Dec 2001 13:18:44 -0600 (CST) Date: Tue, 18 Dec 2001 13:18:44 -0600 From: Alfred Perlstein To: Kirk McKusick Cc: fs@freebsd.org Subject: fast fsck for snapshots Message-ID: <20011218131844.E59831@elvis.mu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org In theory if one were to periodically check a running filesystem's inodes for softdeps then update the superblock to point out the oldest file with pending softdeps at startup one would only have to scan all the inodes with mtimes > superblock update time. Then one should be able to free the blocks not claimed by those inodes. Wouldn't this signifigantly cut down on the amount of time required to fsck the snapshot? I think one of the problems is that inodes are "scrubbed" when flushed to disk as deleted files, one would have to write out the mtime so that fsck could pick up recently deleted files. Does FFS depend on the indirect blocks being "scrubbed" as well? Good idea, or am I just too cafinated at the moment? :) -- -Alfred Perlstein [alfred@freebsd.org] 'Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom.' http://www.morons.org/rants/gpl-harmful.php3 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Dec 18 19:28: 5 2001 Delivered-To: freebsd-fs@freebsd.org Received: from omta02.mta.everyone.net (sitemail2.everyone.net [216.200.145.36]) by hub.freebsd.org (Postfix) with ESMTP id 4E0F137B405 for ; Tue, 18 Dec 2001 19:27:59 -0800 (PST) Received: from sitemail.everyone.net (reports [216.200.145.62]) by omta02.mta.everyone.net (Postfix) with ESMTP id 3A05F1C4F15 for ; Tue, 18 Dec 2001 19:27:59 -0800 (PST) Received: by sitemail.everyone.net (Postfix, from userid 99) id 23F5136F9; Tue, 18 Dec 2001 19:27:59 -0800 (PST) Content-Type: text/plain Content-Disposition: inline Content-Transfer-Encoding: 7bit Mime-Version: 1.0 X-Mailer: MIME-tools 5.41 (Entity 5.404) Date: Tue, 18 Dec 2001 19:27:59 -0800 (PST) From: Rohit Grover To: freebsd-fs@freebsd.org Subject: upper limit on # of vnops? Reply-To: rohit@gojuryu.com X-Originating-Ip: [65.194.57.194] Message-Id: <20011219032759.23F5136F9@sitemail.everyone.net> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Hello, I am using Freebsd4.3-RELEASE and wish to add a few vnode ops. Is there an upper limit on the number of vnode ops supported by the VFS layer in Freebsd? I am having some trouble going beyond a certain small number of new operations. Any help would be appreciated. rohit. _____________________________________________________________ http://www.gojuryu.com . What Karate Do was meant to be. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Dec 18 22:23:42 2001 Delivered-To: freebsd-fs@freebsd.org Received: from falcon.prod.itd.earthlink.net (falcon.mail.pas.earthlink.net [207.217.120.74]) by hub.freebsd.org (Postfix) with ESMTP id 36D0C37B405 for ; Tue, 18 Dec 2001 22:23:38 -0800 (PST) Received: from pool0514.cvx21-bradley.dialup.earthlink.net ([209.179.194.4] helo=mindspring.com) by falcon.prod.itd.earthlink.net with esmtp (Exim 3.33 #1) id 16Ga8e-0005TP-00; Tue, 18 Dec 2001 22:23:36 -0800 Message-ID: <3C203267.43543107@mindspring.com> Date: Tue, 18 Dec 2001 22:23:35 -0800 From: Terry Lambert X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: rohit@gojuryu.com Cc: freebsd-fs@freebsd.org Subject: Re: upper limit on # of vnops? References: <20011219032759.23F5136F9@sitemail.everyone.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Rohit Grover wrote: > I am using Freebsd4.3-RELEASE and wish to add a few vnode ops. > Is there an upper limit on the number of vnode ops supported > by the VFS layer in Freebsd? No. But there are a number of artificial constraints on when and how they may be added. > I am having some trouble going beyond a certain small number of > new operations. Any help would be appreciated. Most likely, you do not need to add operations, and you should be hooking your changes into fcntl(), etc.. In the unlikely event that you need to add some ops, you should be aware of the artificial limitations: 1) When the vnode_if.h and vnode_if.c code is generated from /sys/kern/vnode_if.src by /sys/kern/vnode_if.pl, the number of NOPs permitted is fixed, by virtue of the fixed size VOP descriptor array. 2) When the VFS system is first initialized, it takes an existing filesystem instance, and refactors it in order to get the total number of VOPs. This is arguably more correct than what it did perviously (counted the VOPs in the FFS code, for a mandatory instance of FFS), but the limit is real, and can't be exceeded. Basically, this adds some recompilation requirements that are much less obvious than they should be, if you are using modification of /sys/kern/vnode_if.src to add the new VOPs. The best suggestion, if you are using this method, is to delete and recreate the compilation files, rather than expecting the dependencies to work if you add VOPs. 3) You can not add VOPs to the table at run time. The best you can currently do is to replace placeholder VOPs with new VOPs. If you have placeholder VOPs, and you do this (see the end of the VOP descriptor array in the generated vnode_if.c in the kernel compilation directory), you are limited to the number of placeholders that exist. If you look at the system call extension code, you will see that it has this same limitation. 4) If you want to correct this, you will need to refactor all existing FS instances when you add a VOP (or VOPs). To do this, you will need to recreate the instance structures for the existing FS instances, and you will need to replace/extend the existing VOP list, as it is in vnode_if.c. The vnode_if.h changes, which provide the wrappers are less important (you can manually add those to only the code that uses them). The main thing you will have to do is to ensure that all references to the generated list are by pointer, and then reallocate and copy the list, and then add your VOPs to the end of the list, following extension. Since VOP calls are made through this list, you will need to take the FS instance structures, which are allocated at mount time, and reallocate them, copying the old in, and maintaining defaults. 5) Because of PHK's "default vops" stuff, you will need to refactor the instances, as well, rather than simply copying them, so that the correct defaults are maintained; in the original design, there was no such thing as "default vops", and such refactoring would not have been necessary (though you would still have to reallocate and do the prefix copy of the previous VOPs, if the VOP vector list changed; but the default of "not supported" would have been correct, particularly for intermediate stacking layers, where it would become a "pass through"). 6) If you intend to support stacking, you will have to refactor the stacks, as well. This may be tricky. The correct thing to do when creating a stack is to push all NOP layers down in the instance version, which would (effectively) cut the intermediate layer transitions out of the assembled call graph. Effectively, this means that when you add VOPs, particularly VOPs for which there are non-pass-through defaults (another thing that interferes with stacking, as in the original design, all defaults were pass through), you will need to reconstruct the list. 7) For most of the above reasons, that means that when you are adding VOPs at runtime, you will want to complete refactor all existing mount instances, such as they are (I say it this way because, though it is unlikely, if you were to be using one of the proxy layers -- either network or user space -- that UCLA CS students did in John Heidemann's classes, then you would find it impossible, since you can not control the defaults on the other side of the proxy... consider a proxy from a local consume that knows about the new VOP to a remote stacking layer that doesn't, back to a local media FS that does, and the fact that you want the VOP to go all the way through and back, without harm, but it is out of range of the decriptor list on the remote node because of the "default vops" handling). All in all, it would be much, much easier for you if you did one of: A) Use fcntl() in the FS, instead, and don't invent new VOPs. OR: B) Add the VOPs to the /sys/kern/vnode_if.src, and totally recreate the compilation directory, in order that your VOPs will be apriori known to the system. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Dec 19 11:45:36 2001 Delivered-To: freebsd-fs@freebsd.org Received: from repulse.cnchost.com (repulse.concentric.net [207.155.248.4]) by hub.freebsd.org (Postfix) with ESMTP id AE3A837B419; Wed, 19 Dec 2001 11:45:28 -0800 (PST) Received: from bitblocks.com (adsl-209-204-185-216.sonic.net [209.204.185.216]) by repulse.cnchost.com id OAA04975; Wed, 19 Dec 2001 14:45:20 -0500 (EST) [ConcentricHost SMTP Relay 1.14] Message-ID: <200112191945.OAA04975@repulse.cnchost.com> To: Terry Lambert Cc: Andrea Campi , freebsd-arch@FreeBSD.ORG, freebsd-fs@freebsd.org Reply-To: freebsd-fs@freebsd.org Subject: Re: Real world Root Resizing (was Re: Proposed auto-sizing patch ... In-reply-to: Your message of "Wed, 12 Dec 2001 10:36:19 PST." <3C17A3A3.A439BE21@mindspring.com> Date: Wed, 19 Dec 2001 11:45:21 -0800 From: Bakul Shah Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org [sorry, I should have sent my original message to -fs instead of -arch] Andrea Campi wrote: > #include > > I was able to simple boot to single user and growfs my / without any magic. > I *might* have changed it to read-only just for safety but I don't think so You are a smarter person than I! I believed the growfs man page (it only works on unmounted file system) but should've realized it would work on a readonly mount provided you reboot right after. But I admit, I didn't trust growfs to be bug free which is why I first made a mirror copy of the root partition. Terry Lambert writes: > You could imagine a brute force tool to do this: back up to tape, > newfs, and restore from tape. You can tar cf to another filesystem and tar xf for the special case of a small root filesystem. > A better tool would allow you to defragment an existing FS, or even > run in the background at boot, and defragment only if necessary (some > inequality threshold on per cylinder group fill amounts, perhaps). > > An even better tool might allow you to "defragment" a large disk, at > the same time declaring the end of that disk "off limits". Doing > that would let you actually free up cylinder groups at the end of a > disk -- and shrink partitions, as well as expand them. I wonder if one can devise a syscall interface to do this safely without requiring detailed knowledge of the FS layout and replicating a lot of FS code in user mode. * For shrinking a partition you need a syscall to limit disk block allocation. Something like int fs_alloc(const char* mountpoint, size_t offset, size_t limit); This would do all allocation the [offset..limit) range until the next call. Even if you grew a file outside this range, the new blocks will be allocated here. A filesystem that does not implement this functionality returns ENOSYS. offset and limit are in disk blocksize unit but may need to be rounded up to some FS specific parameter (such as cylinder group size for FFS). * For defragmenting you need a way to move file data. Something like int frealloc(fd, offset, count, addr) offset & count must be multiples of disk block size. addr is a hint as to where these blocks should be moved. The call fails if the suggested new blocks are in use. The FS code atomically (at syscall level) moves specified blocks to the new area. * You also need to be able to get to various freelists. I can't see how defragmentation can be done without some knowledge of FS layout but perhaps most of the details can be abstracted out well enough that the same interface can be used for different FSes. You would run this on a quiescent system but there is no need to unmount the FS or even bring the system down to single user. Placement of files can also be changed once you have this interface. One idea is to sample file access time. Files that gets read frequently can be moved to reduce seek time. Files with similar access time can be clustered and so on. What would be better than sampling atime is keeping read stats in each inode: each time a file is read and the atime is to be updated, increment a small counter (but make it `stick' when it reaches max). This counter is zeroed when the stats are gathered by a user program. I am not holding my breath though. Comments? -- bakul To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Dec 19 14:26: 2 2001 Delivered-To: freebsd-fs@freebsd.org Received: from omta02.mta.everyone.net (sitemail2.everyone.net [216.200.145.36]) by hub.freebsd.org (Postfix) with ESMTP id 4158337B623 for ; Wed, 19 Dec 2001 14:25:39 -0800 (PST) Received: from sitemail.everyone.net (reports [216.200.145.62]) by omta02.mta.everyone.net (Postfix) with ESMTP id C80B51C379C for ; Wed, 19 Dec 2001 14:25:38 -0800 (PST) Received: by sitemail.everyone.net (Postfix, from userid 99) id ACBCD36F9; Wed, 19 Dec 2001 14:25:38 -0800 (PST) Content-Type: text/plain Content-Disposition: inline Content-Transfer-Encoding: 7bit Mime-Version: 1.0 X-Mailer: MIME-tools 5.41 (Entity 5.404) Date: Wed, 19 Dec 2001 14:25:38 -0800 (PST) From: Rohit Grover To: freebsd-fs@freebsd.org Subject: Re: upper limit on # of vnops? Reply-To: rohit@gojuryu.com X-Originating-Ip: [65.194.57.194] Message-Id: <20011219222538.ACBCD36F9@sitemail.everyone.net> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org >3) You can not add VOPs to the table at run time. The best > you can currently do is to replace placeholder VOPs with > new VOPs. If you have placeholder VOPs, and you do this > (see the end of the VOP descriptor array in the generated > vnode_if.c in the kernel compilation directory), you are > limited to the number of placeholders that exist. If you > look at the system call extension code, you will see that > it has this same limitation. I wasn't aware of this constraint until now. I was trying to add vnode_ops using a loadable module. You're right, Freebsd 4.3-RELEASE doesn'nt support dynamic addition of vnode ops. The following code (taken from vfs_opv_recalc()) proves the point. .... for (i = 0; i < vnodeopv_num; i++) { opv = vnodeopv_descs[i]; opv_desc_vector_p = opv->opv_desc_vector_p; if (*opv_desc_vector_p) FREE(*opv_desc_vector_p, M_VNODE); MALLOC(*opv_desc_vector_p, vop_t **, vfs_opv_numops * sizeof(vop_t *), M_VNODE, M_WAITOK); .... I also found out that the reason I was able to add a few vops until now was that the MALLOC (in vfs_opv_recalc() above) was reallocating the memory freed by FREE(). This was made possible by the fact that vfs_opv_numops was under a power-of-2. As soon as I added the 64th vop_t, the vop vectors for all currently active vnodes were freed in vfs_opv_recalc() and the system paniced in a wierd place. thanks for your help Terry. rohit. _____________________________________________________________ http://www.gojuryu.com . What Karate Do was meant to be. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Dec 19 17:24:47 2001 Delivered-To: freebsd-fs@freebsd.org Received: from mail.cablespeed.com (mail.cablespeed.com [206.112.192.76]) by hub.freebsd.org (Postfix) with SMTP id D189837B417 for ; Wed, 19 Dec 2001 17:24:39 -0800 (PST) Received: (qmail 24330 invoked by uid 0); 20 Dec 2001 01:24:39 -0000 Received: from unknown (HELO cablespeed.com) (216.45.72.227) by mail.cablespeed.com with SMTP; 20 Dec 2001 01:24:39 -0000 Message-ID: <3C213DD6.3CAD0C3C@cablespeed.com> Date: Wed, 19 Dec 2001 20:24:38 -0500 From: Chuck McCrobie X-Mailer: Mozilla 4.72 [en] (X11; I; FreeBSD 4.4-STABLE i386) X-Accept-Language: en MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Real world Root Resizing (was Re: Proposed auto-sizing patch ... References: <200112191945.OAA04975@repulse.cnchost.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Bakul Shah wrote: > > I wonder if one can devise a syscall interface to do this > safely without requiring detailed knowledge of the FS layout > and replicating a lot of FS code in user mode. > > * For shrinking a partition you need a syscall to limit > disk block allocation. Something like > > int fs_alloc(const char* mountpoint, size_t offset, size_t limit); > > This would do all allocation the [offset..limit) range > until the next call. Even if you grew a file outside this > range, the new blocks will be allocated here. A filesystem > that does not implement this functionality returns ENOSYS. > offset and limit are in disk blocksize unit but may need to > be rounded up to some FS specific parameter (such as > cylinder group size for FFS). > > * For defragmenting you need a way to move file data. > Something like > > int frealloc(fd, offset, count, addr) > > offset & count must be multiples of disk block size. > addr is a hint as to where these blocks should be moved. > The call fails if the suggested new blocks are in use. > > The FS code atomically (at syscall level) moves specified > blocks to the new area. > Windows 2000 provides a "MOVE FILE DATA" IOCTL to the file system. The file system is supposed to move the referenced file data to the specified location. The location is specified by disk lbn. The "MOVE FILE DATA" may specify a location which is now occupied (but wasn't before). The file system is supposed to ignore the request in that case. > * You also need to be able to get to various freelists. > Windows 2000 also provides a "GET SPACE BITMAP" IOCTL to the file system. The file system is supposed to return an up-to-date bitmap describing the allocation of space in the partition. > I can't see how defragmentation can be done without some > knowledge of FS layout but perhaps most of the details can be > abstracted out well enough that the same interface can be > used for different FSes. > I guess making a file physically contiguous might be a good start. I think the FFS cluster code attempts to keep files contiguous... Perhaps extracting out or exposing generic logic for the FFS code would work. Would it be possible to also move around inodes? My understanding of the idea behind "dir pref" is to keep inodes of files in the same directory contiguous. Do other pieces (NFS?) keep track of inodes by their location (or does inode number imply location?). That is, does moving a inode from one location to another break things higher up? > You would run this on a quiescent system but there is no need > to unmount the FS or even bring the system down to single > user. > > Placement of files can also be changed once you have this > interface. One idea is to sample file access time. Files > that gets read frequently can be moved to reduce seek time. > Files with similar access time can be clustered and so on. > What would be better than sampling atime is keeping read > stats in each inode: each time a file is read and the atime > is to be updated, increment a small counter (but make it > `stick' when it reaches max). This counter is zeroed when > the stats are gathered by a user program. I am not holding > my breath though. > > Comments? > > -- bakul > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message -- -- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Dec 22 3:33:14 2001 Delivered-To: freebsd-fs@freebsd.org Received: from elvis.mu.org (elvis.mu.org [216.33.66.196]) by hub.freebsd.org (Postfix) with ESMTP id DAE8537B417; Sat, 22 Dec 2001 03:33:11 -0800 (PST) Received: by elvis.mu.org (Postfix, from userid 1192) id 6BA5081E0C; Sat, 22 Dec 2001 05:33:06 -0600 (CST) Date: Sat, 22 Dec 2001 05:33:06 -0600 From: Alfred Perlstein To: mckusick@freebsd.org Cc: fs@freebsd.org Subject: fsck and predictive readahead? Message-ID: <20011222053306.Y48837@elvis.mu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org I'm wondering if fsck uses any sort of tricks to do read-ahead to prefect data for pass1 and pass2. If not does anyone thing it might speed things up? We could use a reasonably simple child process (or team of them) to read into anonymous mmap areas shared between the master and child to do this. Any ideas, any hints on where the code would fit best? -- -Alfred Perlstein [alfred@freebsd.org] 'Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom.' http://www.morons.org/rants/gpl-harmful.php3 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Dec 22 5:53:15 2001 Delivered-To: freebsd-fs@freebsd.org Received: from salmon.maths.tcd.ie (salmon.maths.tcd.ie [134.226.81.11]) by hub.freebsd.org (Postfix) with SMTP id C1B5537B405; Sat, 22 Dec 2001 05:53:12 -0800 (PST) Received: from walton.maths.tcd.ie by salmon.maths.tcd.ie with SMTP id ; 22 Dec 2001 13:53:11 +0000 (GMT) To: Alfred Perlstein Cc: mckusick@freebsd.org, fs@freebsd.org Subject: Re: fsck and predictive readahead? In-Reply-To: Your message of "Sat, 22 Dec 2001 05:33:06 CST." <20011222053306.Y48837@elvis.mu.org> Date: Sat, 22 Dec 2001 13:53:11 +0000 From: Ian Dowse Message-ID: <200112221353.aa41047@salmon.maths.tcd.ie> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org In message <20011222053306.Y48837@elvis.mu.org>, Alfred Perlstein writes: >I'm wondering if fsck uses any sort of tricks to do read-ahead >to prefect data for pass1 and pass2. > >If not does anyone thing it might speed things up? I've wondered about this also. Since fsck spends virtually all of its time waiting for disk reads, doing most kinds of speculative disk reads would only slow things down. However, there is some potential for re-ordering the reads to reduce seeking and to allow data to be read in larger chunks. Pass 1 involves quite a lot of disk seeking because it goes off and retrieves all indirection blocks (blocks of block numbers) for any inodes that have them. Otherwise pass 1 would be a simple linear scan through all inodes. It would be possible to defer the reading of indirection blocks and then read them in order (having 2nd- and 3rd-level indirection blocks complicates this). I think I tried a simple form of this a few years ago, but the speedup was only marginal. I believe I also tried changing fsck's bread() to read larger blocks when contiguous reads were detected, again with no significant improvements. For pass 2, the directories are sorted by the block number of their first block, so there is very little seeking. Some speed improvement might be possible by doing a larger read when a few directory blocks are close together on the disk. An interesting exercise would be to modify fsck to print out a list of the offset and length for every disk read it performs. Then sort that list, coalesce contiguous reads, and see how long it takes the disk to read the new list as compared to the original. Such perfect sorting is obviously not feasable in practice, but it would give some idea of the potential for improvements. Ian To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Dec 22 13: 8:26 2001 Delivered-To: freebsd-fs@freebsd.org Received: from elvis.mu.org (elvis.mu.org [216.33.66.196]) by hub.freebsd.org (Postfix) with ESMTP id 4E0A337B41A; Sat, 22 Dec 2001 13:08:20 -0800 (PST) Received: by elvis.mu.org (Postfix, from userid 1192) id C590B81E0C; Sat, 22 Dec 2001 15:08:14 -0600 (CST) Date: Sat, 22 Dec 2001 15:08:14 -0600 From: Alfred Perlstein To: Ian Dowse Cc: mckusick@freebsd.org, fs@freebsd.org Subject: Re: fsck and predictive readahead? Message-ID: <20011222150814.Z48837@elvis.mu.org> References: <20011222053306.Y48837@elvis.mu.org> <200112221353.aa41047@salmon.maths.tcd.ie> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <200112221353.aa41047@salmon.maths.tcd.ie>; from iedowse@maths.tcd.ie on Sat, Dec 22, 2001 at 01:53:11PM +0000 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org * Ian Dowse [011222 07:53] wrote: > In message <20011222053306.Y48837@elvis.mu.org>, Alfred Perlstein writes: > >I'm wondering if fsck uses any sort of tricks to do read-ahead > >to prefect data for pass1 and pass2. > > > >If not does anyone thing it might speed things up? > > I've wondered about this also. Since fsck spends virtually all of > its time waiting for disk reads, doing most kinds of speculative > disk reads would only slow things down. However, there is some > potential for re-ordering the reads to reduce seeking and to allow > data to be read in larger chunks. > > Pass 1 involves quite a lot of disk seeking because it goes off and > retrieves all indirection blocks (blocks of block numbers) for any > inodes that have them. Otherwise pass 1 would be a simple linear > scan through all inodes. It would be possible to defer the reading > of indirection blocks and then read them in order (having 2nd- and > 3rd-level indirection blocks complicates this). I think I tried a > simple form of this a few years ago, but the speedup was only > marginal. I believe I also tried changing fsck's bread() to read > larger blocks when contiguous reads were detected, again with no > significant improvements. > > For pass 2, the directories are sorted by the block number of their > first block, so there is very little seeking. Some speed improvement > might be possible by doing a larger read when a few directory blocks > are close together on the disk. > > An interesting exercise would be to modify fsck to print out a list > of the offset and length for every disk read it performs. Then sort > that list, coalesce contiguous reads, and see how long it takes the > disk to read the new list as compared to the original. Such perfect > sorting is obviously not feasable in practice, but it would give > some idea of the potential for improvements. The problem you didn't address with all these changes was stalls due to disk IO. /usr/src/sbin/fsck_ffs # time ./fsck_ffs -d -n /vol/spare ** /dev/ad0s1g (NO WRITE) ** Last Mounted on /vol/spare ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 303580 files, 12865338 used, 9183427 free (51827 frags, 1141450 blocks, 0.2% fragmentation) ./fsck_ffs -d -n /vol/spare 24.50s user 4.72s system 19% cpu 2:30.73 total No matter how you order the IO, fsck is going to have to wait for read(2) to return. If we can offload that waiting to a child process we may be able to fix this. Is there any detailed commenting on the sources available, they are quite readable, but still very terse. A more in depth explanation of each function would really help. Do you know of a paper, manpage or do you have the time to sprinkle some commentary into the code? -- -Alfred Perlstein [alfred@freebsd.org] 'Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom.' http://www.morons.org/rants/gpl-harmful.php3 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message