From owner-freebsd-fs@FreeBSD.ORG Wed Apr 23 10:12:35 2003 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id BD9AE37B401; Wed, 23 Apr 2003 10:12:35 -0700 (PDT) Received: from mail.tel.fer.hr (zg03-238.dialin.iskon.hr [213.191.135.239]) by mx1.FreeBSD.org (Postfix) with ESMTP id D107443FAF; Wed, 23 Apr 2003 10:12:32 -0700 (PDT) (envelope-from zec@tel.fer.hr) Received: from marko-tp (marko@[192.168.201.107]) by mail.tel.fer.hr (8.12.6/8.12.6) with ESMTP id h3NH9i1X000256; Wed, 23 Apr 2003 19:09:44 +0200 (CEST) (envelope-from zec@tel.fer.hr) From: Marko Zec To: Ian Dowse , Terry Lambert Date: Wed, 23 Apr 2003 19:12:12 +0200 User-Agent: KMail/1.5 References: <200304200730.aa34354@salmon.maths.tcd.ie> In-Reply-To: <200304200730.aa34354@salmon.maths.tcd.ie> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-2" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200304231912.12333.zec@tel.fer.hr> cc: freebsd-fs@FreeBSD.ORG cc: David Schultz cc: freebsd-stable@FreeBSD.ORG cc: Kirk McKusick Subject: Re: PATCH: Forcible delaying of UFS (soft)updates X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Apr 2003 17:12:36 -0000 On Sunday 20 April 2003 08:30, Ian Dowse wrote: > In message <3EA03FF1.280B6810@mindspring.com>, Terry Lambert writes: > >David Schultz wrote: > >> As for the ATA delayed write feature, I don't believe it will > >> guarantee consistency. > > > >It doesn't. I checked, after voicing my suspions of it. > > Yes, write ordering and hence FS consistency is not guaranteed; my > original point was just that the situation regarding FS consistency > with ATA delayed writes is not significantly worse than that with > the default behaviour of having ATA write cacheing enabled. In fact, > if the OS is modified to perform writes in batches then the two > cases are almost identical: in one case the disk collects a batch > of writes, possibly reorders them, and writes them out in one burst; > in the other case the OS sends a burst of writes, the disk possibly > reorders them and writes them out. For reference I've included below > what IBM say about the delayed write feature in their disk > documentation. > > BTW, to answer a point Marko mentioned, I don't consider the delayed > write behaviour to be nearly as bad as a null fsync(), because you > are very unlikely to completely lose a file that has been modified, > saved and then fsync()'d. If the write/rename/fsync all happen while > the disk is spun down then the old version of the file is still > intact on the media if the power fails. With a null fsync(), there > can be a considerable window where the disk contains just a zero-length > file. > > I completely accept that there is more flexibility at the OS side > to control which writes get delayed and by how much, and that an > OS-side implementation would be extremely useful. However I think > it would require further work to develop a good implementation. For > example, the current proposed patch effectively assumes that there > is only one disk in the system since `stratcalls' is a global > variable (e.g., I believe that reading from an ATA flash device > would trigger a flush to any real ATA disks in the system). It would > also be useful if the solution was not specific to ATA devices and > had per-device control over the behaviour. > > I guess my point of view is more that doing this right at the OS > side is hard, and ATA delayed write is an unobtrusive neat feature > that does mostly the right thing at the cost of only a marginal > increase in the risk of data loss for typical uses. Despite me being in favor of OS controlled delayed synching from the moment I posted my initial patch, the more I think now of the advantages of the ATA firmware controlled delayed writing approach, the more I like it. Still, after all the discussions I do not want to claim that OS controlled model now become ultimately bad. I simply have to agree with Ian that in order to improve the quality of the original patch from the proof-of-concept level to the production quality for broad range of hardware configurations and application scenarios, the patch should be extended to polute many more chunks of code scattered all around the source tree. In contrast to that, the ATA controlled delaying approach limits the changes only to the ATA driver, while accomplishing nearly the same or completely equivalent functionality. The only thing that worries me regarding the ATA firmware controlled delaying approach is the moment of the system shutdown. If the admin forgets to disable write delaying, will the firmware force flushing of cached dirty sectors in RAM to the disk before poweroff occurs? Cheers, Marko From owner-freebsd-fs@FreeBSD.ORG Wed Apr 23 20:33:57 2003 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6B31937B401 for ; Wed, 23 Apr 2003 20:33:57 -0700 (PDT) Received: from afields.ca (afields.ca [216.194.67.132]) by mx1.FreeBSD.org (Postfix) with ESMTP id 09CFA43FBD for ; Wed, 23 Apr 2003 20:33:56 -0700 (PDT) (envelope-from afields@afields.ca) Received: from afields.ca (afields.ca [216.194.67.132]) by afields.ca (8.12.6/8.12.6) with ESMTP id h3O3XrLV004858; Wed, 23 Apr 2003 23:33:53 -0400 (EDT) (envelope-from afields@afields.ca) Received: (from afields@localhost) by afields.ca (8.12.6/8.12.6/Submit) id h3O3XrMb004857; Wed, 23 Apr 2003 23:33:53 -0400 (EDT) (envelope-from afields) Date: Wed, 23 Apr 2003 23:33:53 -0400 From: Allan Fields To: Wout Mertens Message-ID: <20030424033353.GA4596@afields.ca> References: <20030424001211.GB15070@gatekeeper.gatekeeper> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4i cc: freebsd-fs@freebsd.org cc: fist@ground.cs.columbia.edu Subject: Re: [FiST] Re: Overlayfs for FiST? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: Allan Fields List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Apr 2003 03:33:57 -0000 Hi, In my opinion... (O.K. I'll take a dive into this.) This all sounds a lot like (Free)BSD's unionfs. I have tried using unionfs for various tasks, including some related to security. It works quite well, but I noticed there are definitely complicated cases where a complex hierarchy of overlay would be required for it to become practical to use overlays. UnionFS seems a successful proof of concept that demonstrates how overlays can work in simple situations, and in how namespaces can be joined using vnode stacking. Doing overlays in fist would be great: hopefully then, as the BSD templates further mature, an overlayfs could be used in multiplatform environments, much as cryptfs could be. My concern would be that the code quality for any target platform would not be left aside for greater portability. Fist can likely produce very optimal code, if the ports are closely worked into the source bases of the target operating systems. I've been on-and-off trying FreeBSD templates, and it is very good to see the progress thus far. I spent a few nights 1-2 months ago trying to get the make files to integrate with the BSD makefiles under 5.0-RELEASE. Some progress, but nothing usable yet. The 4.x-RELEASE templates are better off since the last update. Another issue with overlays is implementing the mechanisms to migrate changes between different working copies, or layers. At that point, it would even seem to be related to revisioning. Here, authority and immutability also would seem to be applicable, in that the filestore could depending on trust assessment (for instance by ACL), assign different authority to an update. Some of my ideas: - union mounts should have a filtering mask, which determines which layer the changes are effected on. Parameters could be anything which fist potentially allows: filenames, userid, timestamp, size, accesses, acl, etc... - union mounts should allow 3 and more filesystems to be stacked 1 2 3 ... (coalesced into one namespace) where they can occupy different sections of the namespace for instance and should allow dynamic configuration of the stack after the initial union mount point is set. This would imply that there can be a relative weight to a store both in retrieval and storage. (Here is where fan-outs would come into play I would imagine. With the introduction in FreeBSD of GEOM, which I haven't had the chance to fully explore, at least at the device level, things are getting more advanced.) Let me know if these are the types of things you had in mind. If these concepts are one in the same, from the standpoint of theory; I'm not certain which terminology I would prefer. The term union seems to suggest that the namespaces are being brought together to create a combined system composed of numerous member filesystems. While, the term overlay seems to suggest that sections of the filesystem are being ignored and emphasizes intersection in, and overriding factors of the combined system instead. Perhaps both apply. One thing is for sure, this problem goes deeper than simple overlays. It has to do with more than combining two sources when exploring the roots of the issue. Not to diminish the role of the join in namespace. On Wed, Apr 23, 2003 at 23:50:33PM +0200, Wout Mertens wrote: > Hi Erez, > > On Wed, 23 Apr 2003, Erez Zadok wrote: > > > I'll CC the fist list (where this message is suitable) for other people's > > comments. > > > > In message , Wout Mertens writes: > > > Hi there, > > > > > > I'm trying to boot a Linux 2.4 thin client from a readonly nfs root, and I > > > keep going back to my childhood dream, a filesystem that you can overlay > > > over another filesystem and that keeps the changes you make to it. The idea of using NFS for both the base image and overlay, is a good example of the types of applications possible. To what extent this intersects with existing network filesystems might be of interest. > > > > > > The idea would be that the filesystem would keep track of additions, > > > renames, deletions, permissions and so forth, but not touching the > > > filesystem below it. If this is then done with a tmpfs backing store, you > > > get a nonpersistent fs. > > > > > > Right now I solve the problem by copying all files to tmpfs, but this is > > > wasteful. > > > > > > So I was wondering if you have implementation hints, maybe you considered > > > the same things, or you have a half-finished .fist file lying around... > > > > > > Thanks! > > > > > > Wout. > > > > So if I understand you right, you want the f/s to read from one source, but > > when writing, it should write to another location, right? > > > > Do you just want to keep the latest update to files that have been modified, > > or a historical detailed log of all activity (perhaps one that can be rolled > > back). The latter of course is more complex. > > > > Once a file is modified and written, what happens if you try to re-read it? > > Do you get the original unmodified version, or the one just written? The > > latter is a special case of a write-through cachefs (such as Solaris's) but > > one which doesn't write through any changes. > > I want to be able to perform all file operations on files in a certain > filesystem, where the changes are kept somewhere else. In my specific > case, I start fresh and I'll throw away the changes afterwards, but they > could also be kept. Activity log is not necessary in my case. > > It is related to a write-through cachefs, but an important difference is > that deletions, attr changes, etc. should also be handled. cachefs is much > simpler to implement, I would think. > > Maybe I'm being too complicated and the best way would be to just keep the > block level changes on the raw device but not apply them, but then that > wouldn't work for nfs, my goal filesystem, which has no raw device. Makes me think of revisioning in the filesystem. Even if you didn't "commit" changes to a file, they could still exist under a different revision name. There has been, and will be much conversation on this topic (inevitably), as it becomes apparent at various stages that it was both a good thing and a bad thing that a VMS-like model wasn't adopted. > Besides, then you wouldn't be able to see what the changes were, useful > for sandboxed stuff. (Although you should need per-user-visible mounts as > well then) One less exciting application is: the idea of using multiple layers of storage to maintain working copies of data. For instance using an overlay for special purpose attribute/meta-data directories files to avoid filesystem pollution. It's frustrating to have to clean-up after tools that leave their messy attribute directories all over a filesystem. The only other ways to eliminate them are to set very restrictive permissions and risk breaking something, use a special copy of the data in an isolated location (sandbox), or spend the time fixing each potential offender. Luckily build tools and source repositories avoid this type of problem where possible by placing their working files in an object tree in the first place. The concept of an attribute itself, to me: is somewhat risky, since if abused, it's no longer an attribute. What constitutes meta-data anyway? I don't accept that "dot directories are out, so don't worry about them" dictum. It's been abused, so that .dir has lost it's meaning almost entirely: just look at your home directory to find out why. I have for example: .kde/ .w3m/ .netscape/ .procmail/ .emacs/ almost none of this is "special data"! alias ls='ls -a' Under FreeBSD for instance: netatalk is one packages that spews random directories to back proprietary Mac attributes, and assumes that since the attributes are in .thing directories, it's OK to add them at any point. > > It seems to me that one way or another, you'll be needing a fan-out file > > stackable system: one that can have one branch that it treats as read-only > > (say, nfs), and another branch that's a writable directory (perhaps even a > > local disk based f/s). > > I agree. The hard part is deciding on a nice way of keeping the changes. > Possibly something with a subdirectory per type of change, with regular > files replacing/adding to the original filesystem just being in the > corresponding directory on the backing store. > > Original > |--a/ > | `--b.txt > `--c/ > |--d/ > `--e.txt > > Backing Store > |--a/ > | |--b.txt (newer file) > | `--f.txt > |--c/ > | `--.deletions > | `--d/ > | `--e.txt (0-length file) > `--g.txt Reminds me of "white-out" entries in BSD. There is a good paper on the union fs by Jan-Simon Pendry, and of course McKusick's book covers this as well. > > Overlay > |--a/ > | |--b.txt (newer file) > | `--f.txt > |--c/ > `--g.txt > > > True fan-out file system support in fist has been on my todo list for > > several years. It's not an easy task: many OS design assumptions are easily > > broken and have to be addressed. We recently did a prototype two-branch > > read-only unionfs (fan-out) in linux 2.4; we hope to polish it up, add full > > write support, multiple branches, and more, then make it available by > > summer's end. > > Looking forward to that :) > > Cheers, > > Wout. > _______________________________________________ > FiST mailing list > FiST@lists.cs.columbia.edu > http://lists.cs.columbia.edu/mailman/listinfo/fist Allan Fields From owner-freebsd-fs@FreeBSD.ORG Thu Apr 24 03:40:45 2003 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E421B37B401 for ; Thu, 24 Apr 2003 03:40:45 -0700 (PDT) Received: from smtp02.syd.iprimus.net.au (smtp02.syd.iprimus.net.au [210.50.76.52]) by mx1.FreeBSD.org (Postfix) with ESMTP id 706DB43F3F for ; Thu, 24 Apr 2003 03:40:45 -0700 (PDT) (envelope-from tim@robbins.dropbear.id.au) Received: from dilbert.robbins.dropbear.id.au (210.50.216.159) by smtp02.syd.iprimus.net.au (7.0.012) id 3E8A160000435102 for freebsd-fs@freebsd.org; Thu, 24 Apr 2003 20:40:43 +1000 Received: by dilbert.robbins.dropbear.id.au (Postfix, from userid 1000) id 2B0C7C90D; Thu, 24 Apr 2003 20:40:39 +1000 (EST) Date: Thu, 24 Apr 2003 20:40:39 +1000 From: Tim Robbins To: freebsd-fs@freebsd.org Message-ID: <20030424204039.A66020@dilbert.robbins.dropbear.id.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i Subject: quot(8) on UFS2 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Apr 2003 10:40:46 -0000 Someone mentioned on one of the lists that the quot(8) utility doesn't work properly on UFS2 filesystems. I think this is because the fs_sblockloc test in quot() is incorrect. Are there any objections to this patch? I'd like to commit it before 5.1 is released because UFS2 has been made the default FS. --- //depot/user/tjr/freebsd-tjr/src/usr.sbin/quot/quot.c 2003/04/21 22:23:11 +++ //depot/user/tjr/freebsd-tjr/src/usr.sbin/quot/quot.c 2003/04/24 03:34:52 @@ -563,7 +563,7 @@ fs = (struct fs *)superblock; if ((fs->fs_magic == FS_UFS1_MAGIC || (fs->fs_magic == FS_UFS2_MAGIC && - fs->fs_sblockloc == numfrags(fs, sblock_try[i]))) && + fs->fs_sblockloc == sblock_try[i])) && fs->fs_bsize <= MAXBSIZE && fs->fs_bsize >= sizeof(struct fs)) break; Tim From owner-freebsd-fs@FreeBSD.ORG Fri Apr 25 13:12:29 2003 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D9ECC37B404 for ; Fri, 25 Apr 2003 13:12:29 -0700 (PDT) Received: from filer.fsl.cs.sunysb.edu (filer.fsl.cs.sunysb.edu [130.245.126.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id C93A543FCB for ; Fri, 25 Apr 2003 13:12:28 -0700 (PDT) (envelope-from ezk@fsl.cs.sunysb.edu) Received: from agora.fsl.cs.sunysb.edu (IDENT:awTDpk2cpuhzALVufLNK5she6O82OFvj@agora.fsl.cs.sunysb.edu [130.245.126.12])h3PKHRgr031158; Fri, 25 Apr 2003 15:17:27 -0500 Received: from agora.fsl.cs.sunysb.edu (IDENT:KzYbvDg7Db1MUirW1pU6j0t0htXu6czi@localhost.localdomain [127.0.0.1]) h3PKBlgt011641; Fri, 25 Apr 2003 16:11:47 -0400 Received: (from ezk@localhost) by agora.fsl.cs.sunysb.edu (8.12.8/8.12.8/Submit) id h3PKBkUQ011637; Fri, 25 Apr 2003 16:11:46 -0400 Date: Fri, 25 Apr 2003 16:11:46 -0400 Message-Id: <200304252011.h3PKBkUQ011637@agora.fsl.cs.sunysb.edu> From: Erez Zadok To: Allan Fields In-reply-to: Your message of "Wed, 23 Apr 2003 23:33:53 EDT." <20030424033353.GA4596@afields.ca> X-MailKey: Erez Zadok cc: freebsd-fs@freebsd.org cc: Wout Mertens cc: fist@cs.columbia.edu Subject: Re: [FiST] Re: Overlayfs for FiST? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Apr 2003 20:12:30 -0000 In message <20030424033353.GA4596@afields.ca>, Allan Fields writes: > Hi, > > In my opinion... (O.K. I'll take a dive into this.) > > This all sounds a lot like (Free)BSD's unionfs. I have tried using > unionfs for various tasks, including some related to security. It Functionally, yes. But IMHO, the fbsd unionfs made the mistake of using a single stack design for the union. It doesn't scale well, and it gives you a limited ability to control the unioning-related algorithms: you have an implied linear search list. I think that a true fan-out stacking template will be a lot more flexible: a file system ->method() would be able to _directly_ access any of the nodes immediately below it, in any order. Once such an infrastructure is in place, you find that the various file systems that can be created are permutations on a number of policies that one might decide on: - if on error at branch N you move to search branch N+1, you got yourself a failover file system. - if upon lookup you select one of N branches by a given algorithm (say, random) you get a load-balancing file system. - if upon file-system-modifying operations you schedule the event to proceed onto all N branches, you have a replication file system. - if, to process certain file system methods (esp. readdir), you invoke it on all N branches, you get a unioning file system. The beauty of this is that I believe the fan-out template infrastructure can remain pretty fixed (modulo getting it to work in the first place :-) And the types of file systems will be determined by (implementation) policy decisions, not necessarily hard-coded into the fan-out template. Erez.