From owner-freebsd-fs Mon Aug 16 5:30:43 1999 Delivered-To: freebsd-fs@freebsd.org Received: from worf.qntm.com (worf.qntm.com [146.174.250.100]) by hub.freebsd.org (Postfix) with ESMTP id DA28C14F83 for ; Mon, 16 Aug 1999 05:30:29 -0700 (PDT) (envelope-from Stephen.Byan@quantum.com) Received: from mail3.qntm.com by worf.qntm.com with ESMTP (1.40.112.12/16.2) id AA110606569; Mon, 16 Aug 1999 05:29:29 -0700 Received: from milcmima.qntm.com (milcmima.qntm.com [146.174.18.61]) by mail3.qntm.com (8.8.6/8.8.6) with ESMTP id FAA06209; Mon, 16 Aug 1999 05:29:36 -0700 (PDT) Received: by milcmima.qntm.com with Internet Mail Service (5.5.2448.0) id ; Mon, 16 Aug 1999 05:29:26 -0700 Message-Id: <8133266FE373D11190CD00805FA768BF02EE9D26@SHRCMSG1> From: Stephen Byan To: "'Terry Lambert'" , zzhang@cs.binghamton.edu Cc: phk@critter.freebsd.dk, roberto@keltia.freenix.fr, freebsd-fs@FreeBSD.ORG Subject: RE: Help with understand file system performance Date: Mon, 16 Aug 1999 05:29:23 -0700 Mime-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2448.0) Content-Type: text/plain; charset="iso-8859-1" Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Terry Lambert wrote: >I am becoming convinced that an intermediate abstraction is really >what is called for, to turn the bottom end into what is, in effect, >nothing more than a flat, numeric namespace on top of a variable >granularity block store. A nice topic for much research... 8-). There's an effort to create such a beast as part of CMU's Network Attached Secure Disk research , , and develop and implement it as a disk drive interface, as part of NSIC's Network Attached Storage Device working group , then standardize it through ANSI T10 as a SCSI-4 command-set. If the file system development community has something to say to the drive vendors, now is the time to do it. Personally, I'd be vocal about atomicity requirements. FWIW, the next NSIC NASD public meeting is tomorrow, Aug 17, at the Clarion Hotel in Millbrae, CA (i.e. at the San Francisco airport). Regards, -Steve Steve Byan Design Engineer Quantum Corporation MS 1-3/E23 333 South Street Shrewsbury, MA 01545 voice: (508) 770-3414 fax: (508) 770-2604 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Aug 16 13:50:28 1999 Delivered-To: freebsd-fs@freebsd.org Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17]) by hub.freebsd.org (Postfix) with ESMTP id DF2F8156AB; Mon, 16 Aug 1999 13:49:57 -0700 (PDT) (envelope-from wrstuden@marcy.nas.nasa.gov) Received: from localhost (wrstuden@localhost) by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id NAA20782; Mon, 16 Aug 1999 13:48:16 -0700 (PDT) Date: Mon, 16 Aug 1999 13:48:16 -0700 (PDT) From: Bill Studenmund Reply-To: Bill Studenmund To: Terry Lambert Cc: Alton Matthew , Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-Reply-To: <199908140150.SAA23891@usr04.primenet.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Sat, 14 Aug 1999, Terry Lambert wrote: > > I am currently conducting a thorough study of the VFS subsystem > > in preparation for an all-out effort to port SGI's XFS filesystem to > > FreeBSD 4.x at such time as SGI gives up the code. Matt Dillon > > has written in hackers- that the VFS subsystem is presently not > > well understood by any of the active kernel code contributers and > > that it will be rewritten later this year. This is obviously of great > > concern to me in this port. > > It is of great concern to me that a rewrite, apparently because of > non-understanding, is taking place at all. That concerns me too. Many aspects of the 4.4 vnode interface were there for specific reasons. Even if they were hack solutions, to re-write them because of a lack of understanding is dangerous as the new code will likely run into the same problems as before. :-) Also, it behooves all the *BSD's to not get too divergent. Sharing code between us all helps all. Given that I'm working on the kernel side of a data migration file system using NetBSD, I can assure you there are things which FreeBSD would get access to more easily the more-similar the two VFS interface are. :-) > I would suggest that anyone planning on this rewrite should talk, > in depth, with John Heidemann prior to engaging in such activity. > John is very approachable, and is a deep thinker. Any rewrite > that does not meet his original design goals for his stacking > architecture is, I think, a Very Bad Idea(tm). > > > > I greatly appreciate all assistance in answering the following > > questions: > > > > 1) What are the perceived problems with the current VFS? > > 2) What options are available to us as remedies? > > 3) To what extent will existing FS code require revision in order > > to be useful after the rewrite? > > 4) Will Chapters 6,7,8 & 9 of "The Design and Implementation of > > the 4.4BSD Operating System" still pertain after the rewrite? > > 5) How important are questions 3 & 4 in the design of the new > > VFS? > > > > I believe that the VFS is conceptually sound and that the existing > > semantics should be strictly retained in the new code. Any new > > functionality should be added in the form of entirely new kernel > > routines and system calls, or possibly by such means as > > converting the existing routines to the vararg format &etc. > > Here some of the problems I'm aware of, and my suggested remedies: > > 1. The interface is not reflexive, with regard to cn_pnbuf. > > Specifically, path buffers are allocated by the caller, but > not freed by the caller, and various routines in each FS > implementation are expected to deal with this. > > Each FS duplicates code, and such duplication is subject > to error. Not to mention that it makes your kernel fat. Yep, that's not good. > 2. Advisory locks are hung off private backing objects. > > Advisory locks are passed into VOP_ADVLOCK in each FS > instance, and then each FS applies this by hanging the > locks off a list on a private backing object. For FFS, > this is the in core inode. > > A more correct approach would be to hang the lock off the > vnode. This effectively obviates the need for having a > VOP_ADVLOCK at all, except for the NFS client FS, which > will need to propagate lock requests across the net. The > most efficient mechanism for this would be to institute > a pass/fail response for VOP_ADVLOCK calls, with a default > of "pass", and an actual implementation of the operand only > in the NFS client FS. I agree that it's better for all fs's to share this functionality as much as possible. I'd vote againsts your implimentation suggestion for VOP_ADVLOCK on an efficiency concern. If we actually make a VOP call, that should be the end of the story. I.e either add a vnode flag to indicate pas/fail-ness, or add a genfs/std call to handle the problem. I'd actually vote for the latter. Hang the byte-range locking off of the vnode, and add a genfs_advlock() or vop_stdadvlock() routine (depending on OS flavor) to handle the call. That way all fs's that can share code, and the callers need only call VO_ADVLOCK() - no other logic. NetBSD actually needs this to get unionfs to work. Do you want to talk privately about it? > Again, each FS must duplicate the advisory locking code, > at present, and such duplication is subject to error. Agreed. > 3. Object locks are implemented locally in many FS's. > > The VOP_LOCK interface is implemented via vop_stdlock() > calls in many FS's. This is done using the "vfs_default" > mechanism. In other FS's, it's implemented locally. > > The intent of the VOP_LOCK mechanism being implemented > as a VOP at all was to allow it to be proxied to another > machine over a network, using the original Heidemann > design. This is also the reason for the use of descriptors > for all VOP arguments, since they can be opaquely proxied to > another machine via a general mechanism. Unlike NFS based > network filesystems, this would allow you to add VOP's to > both machines, without having to teach the transport about > the new VOP for it to be usable remotely. Just for a point of comparison, I recently got almost all the NetBSD fs's to use common code. After our -Lite2 merge, all fs's were either calling the lock manager, or using genfs_nolock() (a version for non-locking fs's). Now there's a struct lock * and struct lock in struct vnode. The fs exports its locking behavior via the struct lock *. For most fs's, the struct lock * points to the struct lock, and genfs_lock() feeds that to the lock manager. But we've kept the ability to do something different (like call over the network) alive. If the struct lock * is NULL, you have to call VOP_LOCK on that fs. Note that this difference only matters for layered fs's - everything else should be calling VOP_LOCK() and letting the dispatch code figure out the right thing to do. > Like the VOP_ADVLOCK, the need for VOP_LOCK is for proxy > purposes, and it, too, should generate a pass/fail response, > and be largely implemented in non-filesystem specific > higher level code. To an extent, that's that the exported struct lock * does, though the only clients are layered filesystems. Everyone else calls VOP_LOCK. :-) > Again, each FS which duplicates code for this function is > subject to duplication errors. Agreed. > 4. The VOP_READIR interface is irrational. > > The VOP_READDIR interface returns its responses in "host > cannonical format" (struct dirent, in sys/dirent.h). > Internally, FFS operates on "directory entry blocks" that > contain exactly these structures (an intentaional coincidence). > > The problem with this approach, is that it makes the getdents > system call sensitive to file systems for which some of the > information returned (e.g. d_fileno, d_reclen, d_type, d_namlen) > are synthetic. What this means is that a native file system > directory implementation single directory block must be able > to fit into the buffer passed to the getdirentries(2) system > call, or a directory listing is not a valid snapshot of the > current state of the directory. > > It also vastly complicates directory traversal restarts (hence > the ncookies and a_cookies arguments, since the NFS server > requires the ability to restart traversal, mid-block, since > the NFSv2 protocol returns directory entries one at a time). > > The "cookie" idea must be carried out faithfully, in an FS > specific fashion, for each FS which is allowed to be NFS > exported. This code duplication is subject to error, or > worse, non-implementation due to its complexity. > > A more rational approach would be to split the operation > into two seperate VOP's: one to acquire a snapshot of a set > of FS specific directory entries of an arbitrary size, and > the second to extract rentries into the user's buffer, in > cannonical format. Sounds interesting... > 5. The idea of "root" vs. "non-root" mounts is inherently bad. > > Right now, there are several operations, all wrapped into > a single "mount" entry point. This is actually a partial > transition to a more cannonically correct implemetnation. > > The reason for the "root" vs. "non-root" knowledge in the > code has to do with several logical operations: > > 1) "Mounting" the filesystem; that is, getting the > vnode for the device to be mounted, and doing any > FS specific operations necessary to cause the > correct in-core context to be established. > > 2) Covering the vnode at the mount point. > > This operation updates the vnode of the mount > point so that traversals of the mount point will > get you the root directory of the FS that was > mounted instead of the directory that is covered > by the mount. > > 3) Saving the "last mounted on" information. > > This is a clerical detail. Read-only FS's, and > some read-write FS's, do not implement this. It > is mostly a nicety for tools that manipulate FFS > directly. > > 4) Initialize the FS stat information. > > Part of the in-core data for any FS is the mnt_stat > data, which is what comes back from a VFS_STATFS() > call You forgot: 5) Update export lists If you call the mount routine with no device name (args.fspec == 0) and with MNT_UPDATE, you get routed to the vfs_export routine > The first operation is invariant. It must be done for all > FS's, whether they are "root" or "non-root". > > The second operation is specific to "non-root" FS's. It > could be moved to common, higher level code -- specifically, > it could be moved into the mount system call. I thought it was? Admitedly the only reference code I have is the ntfs code in the NetBSD kernel. But given how full of #ifdef (__FreeBSD__)'s it is, I thought it'd be an ok reference. > The third operation is also specific to "non-root" FS's. It > could be discarded, or it could be moved to a seperate VFS > operation, e.g. VFS_SETMNTINFO(). I would recommend moving > it to a seperate VFSOP, instead of discarding it. The reason > for this is that an intelligent person could reasonably decide > to add the setting of this data in newfs and tunefs, and do > away with /etc/fstab. > > The fourth operation is invariant. It must be done for all > FS's, whether they are "root" or "non-root". For comparison, NetBSD has a mount entry point, and a mountroot entry point. But all the other ick is there too. > We can now see that we have two discrete operations: > > 1) Placement of any FS, regardless of how it is intended > to be used, into the list of mounted filesystems. > > 2) Mapping a filesystem from the list of mounted FS's > into the directory hierarchy. 3) Updating export information. > The job of the per FS mount code should be to take a mount > structure, the vnode of a device, the FS specific arguments, > the mount point credentials, and the process requesting the > mount, and _only_ do #1 and #4. > > The conversion of the root device into a vnode pointer, or > a path to a device into a vnode pointer, is the job of upper > level code -- specifically, the mount system call, and the > common code for booting. My one concern about this is you've assumed that the user is mounting a device onto a filesystem. Layered filesystems won't do that. nullfs, umaptfs, and unionfs will want a directory. The hierarchical storage system I'm working on will want a file. kernfs, procfs, and an fs which I haven't checked into the NetBSD tree don't really need the extra parameter. Supporting all these different cases would be a hassle for upstream code. > This removes a large amount of complex code from each of > the file systems, and centralizes the maintenance task into > one set of code that either works for everyone, or no one > (removing the duplication of code/introduction of errors > issue). Might I suggest a common library of routines which different mount routines can call? That way we'd get code sharing while letting the fs make decisions about what it expects of the input arguments. I've been looking forward to ripping the export updating out of the mount call. It'd be nice if we could rototill both FreeBSD & NetBSD's mount interfaces the same way at the same time. :-) > In addition, the lack of "root" specific code in many FS's > VFS_MOUNT entry points is the reason that they can not be > mounted as "/". This change would open it up, such that any > FS that was supported by the kernel could be used as the > root filesystem. > > 6. The "vfs_default" code damages stacking > > The intent of the stacking architecture was to have the > default operation for any VOP unknown to an FS fall through > to the lower level code, and fail if it was not implemented. > > The use of the "vfs_default" to make unimplemented VOP's > fall through to code which implements function, while well > intentioned, is misguided. > > Consider the case of a VOP proxy that proxies requests. These > might be requests to another machine, as in the previous > proxy example, or they might be requests to user space, to > allow for easy developement of new filesystem layers. > > In addition, in order to get a default operation to actually > fail, you have to intentionally create a failing VOP for that > particular FS. > > Finally, the paradigm can not support new VOP's without a > kernel recompilation. This means that in order to add to > the list of VOP's known to the system when you add a new FS, > you don't merely have to reallocate the in-core copy of the > vnodeop_desc to include a new (failing) member, you have to > create a default behaviour for it, and modify the default > operations table. In other words, it's not extensible, as > it was architected to be. This problem is FreeBSD-specific. Your analysis seems sound. > 7. The struct nameidata (namei.h) is broken in conception. > > One issue that recurrs frequently, and remains unaddressed, > is the issue of namespace abstraction. > > This issue is nowhere more apparent than in the VFAT and NTFS > filesystems, where there are two namespaces: one 8.3, and the > second, 16 bit Unicode. > > The problem is one of coherency, and one of reference, and > is not easily resolved in the context of the current nameidata > structure. Both NTFS and the VFAT FS try to cover this issue, > both with varing degress of success. > > The problem is that there is no cannonical format that the > kernel can use to communicate namespace data to FS's. Unlike > VOP_READDIR, which has the abstract (though ill-implemented) > struct dirent, there is no abstract representation of the > data in a pathname buffer, which would allow you to treat > path components as opaque entities. > > One potential remedy for this situation would be to cannonize > any path into an ordered list of components. Ideally, this > would be done in 16 bit Unicode (looking toward the future), > but would minimally be seperate components with length counts > to allow faster rejection of non-matching components, and > frequent recalculation of length. NetBSD's name cache is a bit different from FreeBSD's, and might win here. We have just VOP_LOOKUP, which calls the cache lookup routine, rather than both a VOP_LOOKUP and a VOP_CACHEDLOOKUP. Jaromir Dolecek has been discussing adding a canonicalized component name to the cache entries. That way the VOP_LOOKUP routine gets called, canonicalizes the name as it sees fit (say making it all upper case) if it chooses to, and hands off to the cache lookup routine. The advantage is that each fs can chose its on canonicalization, if it wants to. For instance, ffs won't do anything (it's case sensetive), while other case-insensitive fs's will do different things. > 8. The filesystems have knowledge of the name cache. > > Entries into the name cache, and deletion of entries from > the name cache, should be handled in FS independent code > at a higher level. This can avoid expensive VFS_LOOKUP calls > in many cases, and save marshalling arguments into and out of > the descriptor structure, in addition to drastically reducing > the function call overhead. > > Someone recently profiling FreeBSD's FS to detemine speed > bottleneck (I believe it was Mike Smith, attempting to > optimize for a ZD Labs benchmark) found that FreeBSD spends > much of its time in namei(). I'm interested in what you suggest, because I'd expect all *BSD's could use a more efficient namei. But I'm concerned that pushing too much into upper-level routines would remove the fs's ability to make policy decisions. > 9. The implementation of namei() is POSIX non-compliant > > The implementation of namei() is by means of coroutine > "recursion"; this is similar to the only recursion you can > achieve in FORTRAN. > > The upshot of this is that the use of the "//" namespace > escape allowed by POSIX can not be usefully implemented. > This is because it is not possible to inherit a namespace > escape deeper than a single path component for a stack of > more than one layer in depth. > > This needs to be fixed, both for "natural" SMBFS support, > and for other uses of the namespace escape (HTTP "tunnels", > extended attribute and/or resource fork access in an OS/2 > HPFS or Macintosh HFS implementation, etc.), including > forward looking research. > > This is related to item 7. I'm sorry. This point didn't parse. Could you give an example? I don't see how the namei recursion method prevents catching // as a namespace escape. > 10. Stacking is broken > > This is really an issue of not having a coherency protocol > which can be applied between stacks of files. It is somewhat > related to almost all of the above issues. > > The current thinking which has been forwarded by Matt and > John is that a vnode should have an associated vm_object_t, > and that coherency should be maintained that way. > > This thinking is flawed for a number of reasons: > > a. The main utility of this would be for an MFS > implementation. While a "fast MFS" is a > laudable goal, it isn't sufficient to drive this. > > b. A coherency protocol is required in any case, > since a proxied VOP is not necessarily on the > same machine or in the same VM space. This > approach would disallow the possibility of a > user space filesystem developement framework. > > c. There already exist aliases (VM implementation > errors); intentionally adding aliases as an > implementation detail will futher obfuscate them. > Minimally, the VM system should pass a full > branch path analysis based test procedure before > they are introduced. Even then, I would argue > that it would open up a large complexity space > that would prevent us from ever being sure about > problem resoloution again. > > d. Filesystems which need to transform data can > never operate correctly, since they need to > make local copies of the transformed content. > This includes cryptographic, character set > translation, compression, and similar stacking > layers. > > Instead, I think the interface design issues (VOP_ADVLOCK, > VOP_GETPAGES, VOP_PUTPAGES, VOP_READ, VOP_WRITE, et. al.) > that drive the desire to implement coherency in this > fashion be examined. I believe that an ideal soloution > would be to never have the pages replicated at more than a > single vnode. This would likewise solve the coherency > problem, without the additional complexity. The issue > would devolve into locating the real backing object, and > potentially, translating extents. As NetBSD's UBC work is moving in a similar direction, and I'm interested in working on a compressing fs, I'm interested in the solution you propose. > 11. The function call "footprint" of filesystems is too large > > Attempt the following: > > Compile up all of the files which make up an > individual filesystem. You can take all of > the files for the ufs/ffs objects and the > vnode_if.o from a compiled kernel for this > exercise. > > Now link them. Ignore the missing "main"; how > many undefined functions are there? > > The problem you are seeing is the incursion of the VM > system, and sloppy programming practices, into each VFS > implementation. > > This footprint impacts filesystem portability, and is > one reason, among many (including some of the above) that > VFS modules are no longer very portable between BSD > flavors. > > Minimally, the VFS incursions need to be macrotized, and > not assume a unified VM and buffer cache (or a non-unified > VM and buffer cache, as well, for that matter). This would > improve portability considerably. Sounds good. :-) > In addition to this change, a function minimzation effort > should take place. > > If the underlying interface utilized by VFS layers was not > the kernel (for local media FS's, like FFS or NTFS), but > instead a variable granularity block store with a numeric > namespace, then the "top" and "bottom" interfaces could be > identical. For now, however, some work can be done (and > should be done) to reduce the function call footprint. > This is important work, which can only aid developement > of future work (such as a user space filesystem framework > for use by developers and researchers). > > I hesitate to suggest this, but it might be reasonable to > consider a struct containing externally referenced functions, > which is registered into the FS via mount, and which is > identical for all FS's. This would, likewise, promote the > idea of a user space framework. > > Ideally, work would be done to port the Heidemann framework > to Linux, so that their developers could be leveraged. > > > > Some FFS-specific problems are: > > 1. The directory code in the UFS layer is intertwined with the > filespace code > > Ideally, one would be able to mount a filesystem as a flat > numeric namespace (see #7, above), and then mount the idea > of directory management over top of that. > > 2. The quota subsystem is too tightly integrated > > Quotas should be an abstract stacking layer that can be > applied to any FS, instead of an FFS specific monstrosity. It should certainly be possible to add a quota layer on top of any leaf fs. That way you could de-couple quotas. :-) > The current quota system is also limited to 16 bits for a > number of values which, in FreeBSD, can be greater than > 16 bits (e.g. UID's). > > The current quota system is also broken for Y2038. > > 3. The filesystem itself is broken for Y2038 > > The space which was historically reserved for the Y2038 fix > (a 64 bit time_t) was absconeded with for subsecond resoloution. > > This change should be reverted, and fsck modified to re-zero > the values, given a specific argument. > > The subsecond resoloution doesn't really matter, but if it is > seen as an issue which needs to be addressed, the only value > which could reasonably require this is the modification time, > and there is sufficient free space in the inode to be able > to provide for this (there are 2x32 bit spares). I think all the *BSD's need to do the same thing here. :-) One other suggestion I've heard is to split the 64 bits we have for time into 44 bits for seconds, and 20 bits for microseconds. That's more than enough modification resolution, and also pushes things to past year 500,000 AD. Versioning the indoe would cover this easily. > I have other suggestions, but the above covers the most obvious > damage. Well taken. Take care, Bill To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Aug 16 14:18:44 1999 Delivered-To: freebsd-fs@freebsd.org Received: from smtp01.primenet.com (smtp01.primenet.com [206.165.6.131]) by hub.freebsd.org (Postfix) with ESMTP id E270A14BD5; Mon, 16 Aug 1999 14:18:29 -0700 (PDT) (envelope-from tlambert@usr09.primenet.com) Received: (from daemon@localhost) by smtp01.primenet.com (8.8.8/8.8.8) id OAA24762; Mon, 16 Aug 1999 14:18:56 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp01.primenet.com, id smtpd024727; Mon Aug 16 14:18:47 1999 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id OAA04940; Mon, 16 Aug 1999 14:18:45 -0700 (MST) From: Terry Lambert Message-Id: <199908162118.OAA04940@usr09.primenet.com> Subject: Re: BSD XFS Port & BSD VFS Rewrite To: wrstuden@nas.nasa.gov Date: Mon, 16 Aug 1999 21:18:45 +0000 (GMT) Cc: tlambert@primenet.com, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG In-Reply-To: from "Bill Studenmund" at Aug 16, 99 01:48:16 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > 2. Advisory locks are hung off private backing objects. > > > > Advisory locks are passed into VOP_ADVLOCK in each FS > > instance, and then each FS applies this by hanging the > > locks off a list on a private backing object. For FFS, > > this is the in core inode. > > > > A more correct approach would be to hang the lock off the > > vnode. This effectively obviates the need for having a > > VOP_ADVLOCK at all, except for the NFS client FS, which > > will need to propagate lock requests across the net. The > > most efficient mechanism for this would be to institute > > a pass/fail response for VOP_ADVLOCK calls, with a default > > of "pass", and an actual implementation of the operand only > > in the NFS client FS. > > I agree that it's better for all fs's to share this functionality as much > as possible. > > I'd vote againsts your implimentation suggestion for VOP_ADVLOCK on an > efficiency concern. If we actually make a VOP call, that should be the > end of the story. I.e either add a vnode flag to indicate pas/fail-ness, > or add a genfs/std call to handle the problem. > > I'd actually vote for the latter. Hang the byte-range locking off of the > vnode, and add a genfs_advlock() or vop_stdadvlock() routine (depending on > OS flavor) to handle the call. That way all fs's that can share code, and > the callers need only call VO_ADVLOCK() - no other logic. OK. Here's the problem with that: NFS client locks in a stacked FS on top the the NFS client FS. Specifically, you need to seperate the idea of asserting a lock against the local vnode, asserting the lock via NFS locking, and coelescing the local lock list, after both have succeeded, or reverting the local assertion, should the remote assertion fail. This is particularly important for transformative layers, specifically cryptographic or compressing layers. A similar issue exists for character sets, e.g. a Unicode enabled OS NFS mounting via NFS an ISO 8859-1 filesystem, and having to do the directory (de)bloat on the fly. > NetBSD actually needs this to get unionfs to work. Do you want to talk > privately about it? If you want. FreeBSD needs it for unionfs and nullfs, so it's something that would be worth airing. I think you could say that no locking routine was an approval of the uuper level lock. This lets you bail on all FS's except NFS, where you have to deal with the approve/reject from the remote host. The problem with this on FreeBSD is the VFS_default stuff, which puts a non-NULL interface on all FS's for all VOP's. > > 3. Object locks are implemented locally in many FS's. > > > > The VOP_LOCK interface is implemented via vop_stdlock() > > calls in many FS's. This is done using the "vfs_default" > > mechanism. In other FS's, it's implemented locally. > > > > The intent of the VOP_LOCK mechanism being implemented > > as a VOP at all was to allow it to be proxied to another > > machine over a network, using the original Heidemann > > design. This is also the reason for the use of descriptors > > for all VOP arguments, since they can be opaquely proxied to > > another machine via a general mechanism. Unlike NFS based > > network filesystems, this would allow you to add VOP's to > > both machines, without having to teach the transport about > > the new VOP for it to be usable remotely. > > Just for a point of comparison, I recently got almost all the NetBSD fs's > to use common code. After our -Lite2 merge, all fs's were either calling > the lock manager, or using genfs_nolock() (a version for non-locking > fs's). Now there's a struct lock * and struct lock in struct vnode. The fs > exports its locking behavior via the struct lock *. For most fs's, the > struct lock * points to the struct lock, and genfs_lock() feeds that to > the lock manager. > > But we've kept the ability to do something different (like call over the > network) alive. If the struct lock * is NULL, you have to call VOP_LOCK on > that fs. Note that this difference only matters for layered fs's - > everything else should be calling VOP_LOCK() and letting the dispatch code > figure out the right thing to do. Yes, this NULL is the same NULL I suggested for advisory locks, above. FreeBSD has moved to more common code, but it's all call-down based because of the vfs_default stuff again. > > 5. The idea of "root" vs. "non-root" mounts is inherently bad. > > > > Right now, there are several operations, all wrapped into > > a single "mount" entry point. This is actually a partial > > transition to a more cannonically correct implemetnation. > > > > The reason for the "root" vs. "non-root" knowledge in the > > code has to do with several logical operations: > > > > 1) "Mounting" the filesystem; that is, getting the > > vnode for the device to be mounted, and doing any > > FS specific operations necessary to cause the > > correct in-core context to be established. > > > > 2) Covering the vnode at the mount point. > > > > This operation updates the vnode of the mount > > point so that traversals of the mount point will > > get you the root directory of the FS that was > > mounted instead of the directory that is covered > > by the mount. > > > > 3) Saving the "last mounted on" information. > > > > This is a clerical detail. Read-only FS's, and > > some read-write FS's, do not implement this. It > > is mostly a nicety for tools that manipulate FFS > > directly. > > > > 4) Initialize the FS stat information. > > > > Part of the in-core data for any FS is the mnt_stat > > data, which is what comes back from a VFS_STATFS() > > call > > You forgot: > > 5) Update export lists > > If you call the mount routine with no device name > (args.fspec == 0) and with MNT_UPDATE, you get > routed to the vfs_export routine This must be the job of the upper level code, so that there is a single control point for export information, instead of spreading it throughout ead FS's mount entry point. > > The first operation is invariant. It must be done for all > > FS's, whether they are "root" or "non-root". > > > > The second operation is specific to "non-root" FS's. It > > could be moved to common, higher level code -- specifically, > > it could be moved into the mount system call. > > I thought it was? Admitedly the only reference code I have is the ntfs > code in the NetBSD kernel. But given how full of #ifdef (__FreeBSD__)'s it > is, I thought it'd be an ok reference. No. Basically, what you would have is the equivalent of a variable length "mounted volume" table, from which mappings (and exports, based on the mappings) are externalized into the namespace. > > The third operation is also specific to "non-root" FS's. It > > could be discarded, or it could be moved to a seperate VFS > > operation, e.g. VFS_SETMNTINFO(). I would recommend moving > > it to a seperate VFSOP, instead of discarding it. The reason > > for this is that an intelligent person could reasonably decide > > to add the setting of this data in newfs and tunefs, and do > > away with /etc/fstab. > > > > The fourth operation is invariant. It must be done for all > > FS's, whether they are "root" or "non-root". > > For comparison, NetBSD has a mount entry point, and a mountroot entry > point. But all the other ick is there too. Right. It should just have a "mount" entry point, and the rest of the stuff moves to higher level code, called by the mount system call, and the mountroot stuff during boot, to externalize the root volume at the top of the hierarchy. An ideal world would mount a / that had a /dev under it, and then do transparent mounts over top of that. > > We can now see that we have two discrete operations: > > > > 1) Placement of any FS, regardless of how it is intended > > to be used, into the list of mounted filesystems. > > > > 2) Mapping a filesystem from the list of mounted FS's > > into the directory hierarchy. > > 3) Updating export information. Built into the higher level code, same place as #2. > > The job of the per FS mount code should be to take a mount > > structure, the vnode of a device, the FS specific arguments, > > the mount point credentials, and the process requesting the > > mount, and _only_ do #1 and #4. > > > > The conversion of the root device into a vnode pointer, or > > a path to a device into a vnode pointer, is the job of upper > > level code -- specifically, the mount system call, and the > > common code for booting. > > My one concern about this is you've assumed that the user is mounting a > device onto a filesystem. No. Vnoide, not bdevvp. The bdevvp stuff is for the boot time stuff in the upper level code, and only applies to the root volume. > Layered filesystems won't do that. nullfs, > umaptfs, and unionfs will want a directory. The hierarchical storage > system I'm working on will want a file. kernfs, procfs, and an fs which I > haven't checked into the NetBSD tree don't really need the extra > parameter. Supporting all these different cases would be a hassle for > upstream code. > > > This removes a large amount of complex code from each of > > the file systems, and centralizes the maintenance task into > > one set of code that either works for everyone, or no one > > (removing the duplication of code/introduction of errors > > issue). > > Might I suggest a common library of routines which different mount > routines can call? That way we'd get code sharing while letting the fs > make decisions about what it expects of the input arguments. This is the "footprint" problem, all over again. Reject/accept (or "accept if no VOP") seems more elegant, and also reduces footprint. > I've been looking forward to ripping the export updating out of the mount > call. It'd be nice if we could rototill both FreeBSD & NetBSD's mount > interfaces the same way at the same time. :-) 8-). > > 7. The struct nameidata (namei.h) is broken in conception. > > > > One issue that recurrs frequently, and remains unaddressed, > > is the issue of namespace abstraction. > > > > This issue is nowhere more apparent than in the VFAT and NTFS > > filesystems, where there are two namespaces: one 8.3, and the > > second, 16 bit Unicode. > > > > The problem is one of coherency, and one of reference, and > > is not easily resolved in the context of the current nameidata > > structure. Both NTFS and the VFAT FS try to cover this issue, > > both with varing degress of success. > > > > The problem is that there is no cannonical format that the > > kernel can use to communicate namespace data to FS's. Unlike > > VOP_READDIR, which has the abstract (though ill-implemented) > > struct dirent, there is no abstract representation of the > > data in a pathname buffer, which would allow you to treat > > path components as opaque entities. > > > > One potential remedy for this situation would be to cannonize > > any path into an ordered list of components. Ideally, this > > would be done in 16 bit Unicode (looking toward the future), > > but would minimally be seperate components with length counts > > to allow faster rejection of non-matching components, and > > frequent recalculation of length. > > NetBSD's name cache is a bit different from FreeBSD's, and might win here. > We have just VOP_LOOKUP, which calls the cache lookup routine, rather than > both a VOP_LOOKUP and a VOP_CACHEDLOOKUP. > > Jaromir Dolecek has been discussing adding a canonicalized component name > to the cache entries. That way the VOP_LOOKUP routine gets called, > canonicalizes the name as it sees fit (say making it all upper case) if > it chooses to, and hands off to the cache lookup routine. The advantage is > that each fs can chose its on canonicalization, if it wants to. For > instance, ffs won't do anything (it's case sensetive), while other > case-insensitive fs's will do different things. Can you push a Unicode name down from an appropriate system call? I don't see any way to deal with an NT FS for characters outside ISO 8859-1, otherwise. 8-(. > > 9. The implementation of namei() is POSIX non-compliant > > > > The implementation of namei() is by means of coroutine > > "recursion"; this is similar to the only recursion you can > > achieve in FORTRAN. > > > > The upshot of this is that the use of the "//" namespace > > escape allowed by POSIX can not be usefully implemented. > > This is because it is not possible to inherit a namespace > > escape deeper than a single path component for a stack of > > more than one layer in depth. > > > > This needs to be fixed, both for "natural" SMBFS support, > > and for other uses of the namespace escape (HTTP "tunnels", > > extended attribute and/or resource fork access in an OS/2 > > HPFS or Macintosh HFS implementation, etc.), including > > forward looking research. > > > > This is related to item 7. > > I'm sorry. This point didn't parse. Could you give an example? > > I don't see how the namei recursion method prevents catching // as a > namespace escape. //apple-resource-fork/intermediate_dir/some_other_dir/file_with_fork You can't inherit the fact that you are looking at the resource fork in the terminal component, ONLY. > > Instead, I think the interface design issues (VOP_ADVLOCK, > > VOP_GETPAGES, VOP_PUTPAGES, VOP_READ, VOP_WRITE, et. al.) > > that drive the desire to implement coherency in this > > fashion be examined. I believe that an ideal soloution > > would be to never have the pages replicated at more than a > > single vnode. This would likewise solve the coherency > > problem, without the additional complexity. The issue > > would devolve into locating the real backing object, and > > potentially, translating extents. > > As NetBSD's UBC work is moving in a similar direction, and I'm interested > in working on a compressing fs, I'm interested in the solution you > propose. Matt Dillion is apparently the person doing the work here. It seems I am out of date on the current thinking, as the vm_object_t apprach has apparently been discarded. > > 2. The quota subsystem is too tightly integrated > > > > Quotas should be an abstract stacking layer that can be > > applied to any FS, instead of an FFS specific monstrosity. > > It should certainly be possible to add a quota layer on top of any leaf > fs. That way you could de-couple quotas. :-) Yes, assuming stacking works in the first place... > > 3. The filesystem itself is broken for Y2038 > > > > The space which was historically reserved for the Y2038 fix > > (a 64 bit time_t) was absconeded with for subsecond resoloution. > > > > This change should be reverted, and fsck modified to re-zero > > the values, given a specific argument. > > > > The subsecond resoloution doesn't really matter, but if it is > > seen as an issue which needs to be addressed, the only value > > which could reasonably require this is the modification time, > > and there is sufficient free space in the inode to be able > > to provide for this (there are 2x32 bit spares). > > I think all the *BSD's need to do the same thing here. :-) > > One other suggestion I've heard is to split the 64 bits we have for time > into 44 bits for seconds, and 20 bits for microseconds. That's more than > enough modification resolution, and also pushes things to past year > 500,000 AD. Versioning the indoe would cover this easily. Ugh. But possible... Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Aug 16 16:28:42 1999 Delivered-To: freebsd-fs@freebsd.org Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17]) by hub.freebsd.org (Postfix) with ESMTP id 24F1714DA5; Mon, 16 Aug 1999 16:28:28 -0700 (PDT) (envelope-from wrstuden@marcy.nas.nasa.gov) Received: from localhost (wrstuden@localhost) by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id QAA03157; Mon, 16 Aug 1999 16:04:11 -0700 (PDT) Date: Mon, 16 Aug 1999 16:04:11 -0700 (PDT) From: Bill Studenmund Reply-To: Bill Studenmund To: Terry Lambert Cc: Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-Reply-To: <199908162118.OAA04940@usr09.primenet.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Mon, 16 Aug 1999, Terry Lambert wrote: > > > 2. Advisory locks are hung off private backing objects. > > I'd vote againsts your implimentation suggestion for VOP_ADVLOCK on an > > efficiency concern. If we actually make a VOP call, that should be the > > end of the story. I.e either add a vnode flag to indicate pas/fail-ness, > > or add a genfs/std call to handle the problem. > > > > I'd actually vote for the latter. Hang the byte-range locking off of the > > vnode, and add a genfs_advlock() or vop_stdadvlock() routine (depending on > > OS flavor) to handle the call. That way all fs's that can share code, and > > the callers need only call VO_ADVLOCK() - no other logic. > > OK. Here's the problem with that: NFS client locks in a stacked > FS on top the the NFS client FS. Ahh, but it'd be the fs's decision to map genfs_advlock()/vop_stdadvlock() to its vop_advlock_desc entry or not. In this case, NFS wouldn't want to do that. Though it would mean growing the fs footprint. > Specifically, you need to seperate the idea of asserting a lock > against the local vnode, asserting the lock via NFS locking, and > coelescing the local lock list, after both have succeeded, or > reverting the local assertion, should the remote assertion fail. Right. But my thought was that you'd be calling an NFS routine, so it could do the right thing. > > NetBSD actually needs this to get unionfs to work. Do you want to talk > > privately about it? > > If you want. FreeBSD needs it for unionfs and nullfs, so it's > something that would be worth airing. > > I think you could say that no locking routine was an approval of > the uuper level lock. This lets you bail on all FS's except NFS, > where you have to deal with the approve/reject from the remote > host. The problem with this on FreeBSD is the VFS_default stuff, > which puts a non-NULL interface on all FS's for all VOP's. I'm not familiar with the VFS_default stuff. All the vop_default_desc routines in NetBSD point to error routines. > Yes, this NULL is the same NULL I suggested for advisory locks, > above. I'm not sure. The struct lock * is only used by layered filesystems, so they can keep track both of the underlying vnode lock, and if needed their own vnode lock. For advisory locks, would we want to keep track both of locks on our layer and the layer below? Don't we want either one or the other? i.e. layers bypass to the one below, or deal with it all themselves. > > > 5. The idea of "root" vs. "non-root" mounts is inherently bad. > > You forgot: > > > > 5) Update export lists > > > > If you call the mount routine with no device name > > (args.fspec == 0) and with MNT_UPDATE, you get > > routed to the vfs_export routine > > This must be the job of the upper level code, so that there is > a single control point for export information, instead of spreading > it throughout ead FS's mount entry point. I agree it should be detangled, but think it should remain the fs's job to choose to call vfs_export. Otherwise an fs can't impliment its own export policies. :-) > > I thought it was? Admitedly the only reference code I have is the ntfs > > code in the NetBSD kernel. But given how full of #ifdef (__FreeBSD__)'s it > > is, I thought it'd be an ok reference. > > No. We've lost the context, but what I was trying to say was that I thought the marking-the-vnode-as-mounted-on bit was done in the mount syscall at present. At least that's what http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_syscalls.c?rev=1.130 seems to be doing. > Basically, what you would have is the equivalent of a variable > length "mounted volume" table, from which mappings (and exports, > based on the mappings) are externalized into the namespace. Ahh, sounds like you're talking about a new formalism.. > Right. It should just have a "mount" entry point, and the rest > of the stuff moves to higher level code, called by the mount system > call, and the mountroot stuff during boot, to externalize the root > volume at the top of the hierarchy. > > An ideal world would mount a / that had a /dev under it, and then > do transparent mounts over top of that. That would be quite a different place than we have now. ;-) > > > The conversion of the root device into a vnode pointer, or > > > a path to a device into a vnode pointer, is the job of upper > > > level code -- specifically, the mount system call, and the > > > common code for booting. > > > > My one concern about this is you've assumed that the user is mounting a > > device onto a filesystem. > > No. Vnoide, not bdevvp. The bdevvp stuff is for the boot time stuff > in the upper level code, and only applies to the root volume. Maybe I mis-parsed. I thought you were talking about parsing the first mount option (in mount /dev/disk there, the /dev/disk option) into a vnode. The concern below is that different fs's have different ideas as to what that node should be. Some want it a device node which no one else is using (most leaf fs's), while some others want a directory (nullfs, etc), some want a file or device (the HSM system I'm working on) while others don't care (in mount -t kernfs /kern /kern , the first kern doesn't matter at all). But all is well with different support routines which the mount_foo() routine can call. > > Layered filesystems won't do that. nullfs, > > umaptfs, and unionfs will want a directory. The hierarchical storage > > system I'm working on will want a file. kernfs, procfs, and an fs which I > > haven't checked into the NetBSD tree don't really need the extra > > parameter. Supporting all these different cases would be a hassle for > > upstream code. > > > > > This removes a large amount of complex code from each of > > > the file systems, and centralizes the maintenance task into > > > one set of code that either works for everyone, or no one > > > (removing the duplication of code/introduction of errors > > > issue). > > > > Might I suggest a common library of routines which different mount > > routines can call? That way we'd get code sharing while letting the fs > > make decisions about what it expects of the input arguments. > > This is the "footprint" problem, all over again. Reject/accept (or > "accept if no VOP") seems more elegant, and also reduces footprint. Very true. The problem is that the current VFS system was designed as a black box. It gets handed all calls, and it gets to decide policy, and do everything on its own. We're now basically discussing ways of having the plethora of fs's we now have do things the same way. :-) > > > 7. The struct nameidata (namei.h) is broken in conception. > > Can you push a Unicode name down from an appropriate system call? > > I don't see any way to deal with an NT FS for characters outside > ISO 8859-1, otherwise. 8-(. Hmmm. I think the real problem is that the kernel(s) is(are) not at all designed well for different laguages. > > > 9. The implementation of namei() is POSIX non-compliant > > > > > > The implementation of namei() is by means of coroutine > > > "recursion"; this is similar to the only recursion you can > > > achieve in FORTRAN. > > > > > > The upshot of this is that the use of the "//" namespace > > > escape allowed by POSIX can not be usefully implemented. > > > This is because it is not possible to inherit a namespace > > > escape deeper than a single path component for a stack of > > > more than one layer in depth. > > > > > > This needs to be fixed, both for "natural" SMBFS support, > > > and for other uses of the namespace escape (HTTP "tunnels", > > > extended attribute and/or resource fork access in an OS/2 > > > HPFS or Macintosh HFS implementation, etc.), including > > > forward looking research. > > > > > > This is related to item 7. > > > > I'm sorry. This point didn't parse. Could you give an example? > > > > I don't see how the namei recursion method prevents catching // as a > > namespace escape. > > > //apple-resource-fork/intermediate_dir/some_other_dir/file_with_fork > > You can't inherit the fact that you are looking at the resource fork > in the terminal component, ONLY. Yep, there's no easy way to do that now.. The one thing which comes to mind is to have lookup() rip out the first component and save it in the namei struct. Though the devil's advocate in me points out that this difficulty is not inherent in the recursion setup, but in how lookup() is designed. :-) > > > Quotas should be an abstract stacking layer that can be > > > applied to any FS, instead of an FFS specific monstrosity. > > > > It should certainly be possible to add a quota layer on top of any leaf > > fs. That way you could de-couple quotas. :-) > > Yes, assuming stacking works in the first place... Except for a minor buglet with device nodes, stacking works in NetBSD at present. :-) > > One other suggestion I've heard is to split the 64 bits we have for time > > into 44 bits for seconds, and 20 bits for microseconds. That's more than > > enough modification resolution, and also pushes things to past year > > 500,000 AD. Versioning the indoe would cover this easily. > > Ugh. But possible... I agree it's ugly, but it has the advantage that it doesn't grow the on-disk inode. A lot of flks have designs on the remaining 64 bits free. :-) Take care, Bill To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Aug 16 19:33: 6 1999 Delivered-To: freebsd-fs@freebsd.org Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132]) by hub.freebsd.org (Postfix) with ESMTP id E37FB14D15; Mon, 16 Aug 1999 19:32:53 -0700 (PDT) (envelope-from tlambert@usr02.primenet.com) Received: (from daemon@localhost) by smtp02.primenet.com (8.8.8/8.8.8) id TAA19029; Mon, 16 Aug 1999 19:31:20 -0700 (MST) Received: from usr02.primenet.com(206.165.6.202) via SMTP by smtp02.primenet.com, id smtpd019018; Mon Aug 16 19:31:18 1999 Received: (from tlambert@localhost) by usr02.primenet.com (8.8.5/8.8.5) id TAA08526; Mon, 16 Aug 1999 19:31:16 -0700 (MST) From: Terry Lambert Message-Id: <199908170231.TAA08526@usr02.primenet.com> Subject: Re: BSD XFS Port & BSD VFS Rewrite To: wrstuden@nas.nasa.gov Date: Tue, 17 Aug 1999 02:31:16 +0000 (GMT) Cc: tlambert@primenet.com, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG In-Reply-To: from "Bill Studenmund" at Aug 16, 99 04:04:11 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > > > 2. Advisory locks are hung off private backing objects. > > > > > > I'd vote againsts your implimentation suggestion for VOP_ADVLOCK on an > > > efficiency concern. If we actually make a VOP call, that should be the > > > end of the story. I.e either add a vnode flag to indicate pas/fail-ness, > > > or add a genfs/std call to handle the problem. > > > > > > I'd actually vote for the latter. Hang the byte-range locking off of the > > > vnode, and add a genfs_advlock() or vop_stdadvlock() routine (depending on > > > OS flavor) to handle the call. That way all fs's that can share code, and > > > the callers need only call VO_ADVLOCK() - no other logic. > > > > OK. Here's the problem with that: NFS client locks in a stacked > > FS on top the the NFS client FS. > > Ahh, but it'd be the fs's decision to map genfs_advlock()/vop_stdadvlock() > to its vop_advlock_desc entry or not. In this case, NFS wouldn't want to > do that. > > Though it would mean growing the fs footprint. Nope; that's not really the problem. The problem is if I have two local processes that get into a race in order to obtain a remote lock. Because the remote lock is not asserted, there's no way to ensure that the order of service for the request is the same as the order of request -- consider cooperating programs, like sendmail and pine or elm (or whatever). The only way to resolve this is to ensure that the cooperating programs on the same system are lockstepped: at the client. The only way to do this is to assert the lock locally, then remotely, if the local assertion succeeds. In the case of our cooperating local processes, this resolves the race condition (depending on F_SETLCK/F_SETLCKW, they behave as if the locks were local. Which is what you want. > > Specifically, you need to seperate the idea of asserting a lock > > against the local vnode, asserting the lock via NFS locking, and > > coelescing the local lock list, after both have succeeded, or > > reverting the local assertion, should the remote assertion fail. > > Right. But my thought was that you'd be calling an NFS routine, so it > could do the right thing. The problem is that the local lock doesn't belong to NFS. Even if it did (I think this would be an error for a remotely mounted "whiteout" in a "translucent" local FS), the problem is that in doing the local assertion, you will intrinsically coeelesce locks. Now if the lock mode you are requesting overlaps a previous lock, and the modes are not exactly the same, there's no way to back out the local promotion or demotion without a coelesce. This doesn't resolve the most complex cases you could contrive, with multiple stacking layers that don't support a distributed coherency protocol for locks for two or more players, but it handles the local vs. NFS issues acceptably. > > > NetBSD actually needs this to get unionfs to work. Do you want to talk > > > privately about it? > > > > If you want. FreeBSD needs it for unionfs and nullfs, so it's > > something that would be worth airing. > > > > I think you could say that no locking routine was an approval of > > the uuper level lock. This lets you bail on all FS's except NFS, > > where you have to deal with the approve/reject from the remote > > host. The problem with this on FreeBSD is the VFS_default stuff, > > which puts a non-NULL interface on all FS's for all VOP's. > > I'm not familiar with the VFS_default stuff. All the vop_default_desc > routines in NetBSD point to error routines. In FreeBSD, they now point to default routines that are *not* error routines. This is the problem. I admit the change was very well intentioned, since it made the code a hell of a lot more readable, but choosing between readable and additional function, I take function over form (I think the way I would have "fixed" the readability is by making the operations that result in the descriptor set for a mounted FS instance be both discrete, and named for their specific function). > > Yes, this NULL is the same NULL I suggested for advisory locks, > > above. > > I'm not sure. The struct lock * is only used by layered filesystems, so > they can keep track both of the underlying vnode lock, and if needed their > own vnode lock. For advisory locks, would we want to keep track both of > locks on our layer and the layer below? Don't we want either one or the > other? i.e. layers bypass to the one below, or deal with it all > themselves. I think you want the lock on the intermediate layer: basically, on every vnode that has data associated with it that is unique to a layer. Let's not forget, also, that you can expose a layer into the namespace in one place, and expose it covered under another layer, at another. If you locked down to the backing object, then the only issue you would be left with is one or more intermediate backing objects. For a layer with an intermediate backing object, I'm prepared to declare it "special", and proxy the operation down to any inferior backing object (e.g. a union FS that adds files from two FS's together, rather than just directoriy entry lists). I think such layers are the exception, not the rule. > > > > 5. The idea of "root" vs. "non-root" mounts is inherently bad. > > > You forgot: > > > > > > 5) Update export lists > > > > > > If you call the mount routine with no device name > > > (args.fspec == 0) and with MNT_UPDATE, you get > > > routed to the vfs_export routine > > > > This must be the job of the upper level code, so that there is > > a single control point for export information, instead of spreading > > it throughout ead FS's mount entry point. > > I agree it should be detangled, but think it should remain the fs's job to > choose to call vfs_export. Otherwise an fs can't impliment its own export > policies. :-) I think that export policies are the realm of /etc/exports. The problem with each FS implementing its own policy, is that this is another place that copyinstr() gets called, when it shouldn't. > > > I thought it was? Admitedly the only reference code I have is the ntfs > > > code in the NetBSD kernel. But given how full of #ifdef (__FreeBSD__)'s it > > > is, I thought it'd be an ok reference. > > > > No. > > We've lost the context, but what I was trying to say was that I thought > the marking-the-vnode-as-mounted-on bit was done in the mount syscall at > present. At least that's what > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_syscalls.c?rev=1.130 > seems to be doing. > > > Basically, what you would have is the equivalent of a variable > > length "mounted volume" table, from which mappings (and exports, > > based on the mappings) are externalized into the namespace. > > Ahh, sounds like you're talking about a new formalism.. Right. The "covering" operation is not the same as the "marking as covered" operation. Both need to be at the higher level. If you wanted to get gross, you could say that it was a volume table, and the use POSIX namespace escapes, such as "//DISK/2/..." to access each disk as its own "/". This sounds gross, but if you had 4M extents on a very large disk, it would be nearly ideal for installing software: each package would get its own "disk", and you could share "packages" instead of "FS"'s. For something like mobile computing, consider a package as a shared resource. You have your presentation package which you mount off your local net, you fly to New York to present, and you mount the same presentation package from the network where you are a guest. Forget paths, installation, and all that crap. > > Right. It should just have a "mount" entry point, and the rest > > of the stuff moves to higher level code, called by the mount system > > call, and the mountroot stuff during boot, to externalize the root > > volume at the top of the hierarchy. > > > > An ideal world would mount a / that had a /dev under it, and then > > do transparent mounts over top of that. > > That would be quite a different place than we have now. ;-) Not really. Julian Elisher had code that mounted a /devfs under / automatically, before the user was ever allowed to see /. As a result, the FS that you were left with was indistinguishable from what I describe. The only real difference is that, as a translucent mount over /devfs, the one I describe would be capable of implementing persistant changes to the /devfs, as whiteouts. I don't think this is really that desirable, but some people won't accept a devfs that doesn't have traditional persistance semantics (e.g. "chmod" vs. modifying a well known kernel data structure as an administrative operation). I guess the other difference is that you don't have to worry about large minor numbers when you are bringing up a new platform via NFS from an old platform that can't support large minors in its FS at all. ;-). > > > > The conversion of the root device into a vnode pointer, or > > > > a path to a device into a vnode pointer, is the job of upper > > > > level code -- specifically, the mount system call, and the > > > > common code for booting. > > > > > > My one concern about this is you've assumed that the user is mounting a > > > device onto a filesystem. > > > > No. Vnode, not bdevvp. The bdevvp stuff is for the boot time stuff > > in the upper level code, and only applies to the root volume. > > Maybe I mis-parsed. I thought you were talking about parsing the first > mount option (in mount /dev/disk there, the /dev/disk option) into a > vnode. The concern below is that different fs's have different ideas as to > what that node should be. Some want it a device node which no one else is > using (most leaf fs's), while some others want a directory (nullfs, etc), > some want a file or device (the HSM system I'm working on) while others > don't care (in mount -t kernfs /kern /kern , the first kern doesn't matter > at all). But all is well with different support routines which the > mount_foo() routine can call. I would resolve this by passing a standard option to the mount code in user space. For root mounts, a vnode is passed down. For other mounts, the vnode is parsed and passed if the option is specified. I think that you will only be able to find rare examples of FS's that don't take device names as arguments. But for those, you don't specify the option, and it gets "NULL", and whatever local options you specify. The point is that, for FS's that can be both root and sub-root, the mount code doesn't have to make the decision, it can be punted to higher level code, in one place, where the code can be centrally maintained and kept from getting "stale" when things change out from under it. > > > Might I suggest a common library of routines which different mount > > > routines can call? That way we'd get code sharing while letting the fs > > > make decisions about what it expects of the input arguments. > > > > This is the "footprint" problem, all over again. Reject/accept (or > > "accept if no VOP") seems more elegant, and also reduces footprint. > > Very true. The problem is that the current VFS system was designed as a > black box. It gets handed all calls, and it gets to decide policy, and do > everything on its own. We're now basically discussing ways of having the > plethora of fs's we now have do things the same way. :-) I don't think so. I like to think in terms of "VFS consumer" and "VFS producer". The implied semantics are the provenanace of the "VFS consumer". A good example of this is to look at another VFS consumer, the NFS server. It really doesn't want implied semantics, and, in fact, wants to have a set of semantics (server locking information) sent in through a seperate communications channel. The way things are right now, as a VFS consumer, the NFS server is a second class citizen. One could imagine an AppleTalk or SMB server in the kernel, as well, also VFS consumers. And one could imagine doing VFS operations against files _from within the kernel_ (say in a "quota" stacking layer, or a resource fork/extended attributes stacking layer). The point is, you want to stop implying some semantics for these consumers. Where you draw the line is where you imply sematics via call-down, or via reject/accept. If you don't want them implied all the time, for all consumers, then they belong in the system call layer; othersise, they belong in the VFS layer doing the implementation. There's an abstraction here: is the VFS stacking layer you are talking about one that implements semantics? For an ACL stacking layer, your answer is yes. But for an NFS server stacked on a VFS? Or a namespace hiding layer? > > > > 7. The struct nameidata (namei.h) is broken in conception. > > > > Can you push a Unicode name down from an appropriate system call? > > > > I don't see any way to deal with an NT FS for characters outside > > ISO 8859-1, otherwise. 8-(. > > Hmmm. I think the real problem is that the kernel(s) is(are) not at all > designed well for different laguages. Well, if you make the path component descriptor into an opaque object, you can pass it down to the point you get to someone who understands the encapsulated data. The interpretation is a rendesvous -- an agreement -- between the source providing the data, and the target interpreting it. > > > > 9. The implementation of namei() is POSIX non-compliant > > > > > > > > The implementation of namei() is by means of coroutine > > > > "recursion"; this is similar to the only recursion you can > > > > achieve in FORTRAN. [ ... ] > > > > > > I'm sorry. This point didn't parse. Could you give an example? > > > > > > I don't see how the namei recursion method prevents catching // as a > > > namespace escape. > > > > > > //apple-resource-fork/intermediate_dir/some_other_dir/file_with_fork > > > > You can't inherit the fact that you are looking at the resource fork > > in the terminal component, ONLY. > > Yep, there's no easy way to do that now.. The one thing which comes to > mind is to have lookup() rip out the first component and save it in the > namei struct. > > Though the devil's advocate in me points out that this difficulty is not > inherent in the recursion setup, but in how lookup() is designed. :-) If it were a parameter, "namespace", to the function, it'd work, too. The problem is that you really want to install "namespace handlers" for these escapes, probably on a per FS basis. The only way I can see this working is to place the namespace into the path descriptor _seperately_ from the path components (however they get parsed out by that namespace). This shows the evils of "copyinstr()" in the full light of day: I can't have a "//unicode/..." name space escape, unless I assume ISO-8859-1, like the NTFS currently does, or unless I engage in some unnatural act with my "..." following the escape (e.g. UTF-8). > > > > Quotas should be an abstract stacking layer that can be > > > > applied to any FS, instead of an FFS specific monstrosity. > > > > > > It should certainly be possible to add a quota layer on top of any leaf > > > fs. That way you could de-couple quotas. :-) > > > > Yes, assuming stacking works in the first place... > > Except for a minor buglet with device nodes, stacking works in NetBSD at > present. :-) Have you tried Heidemann's student's stacking layers? There is one encryption, and one per-file compression with namespace hiding, that I think it would be hard pressed to keep up with. But I'll give it the benefit of the doubt. 8-). > > > One other suggestion I've heard is to split the 64 bits we have for time > > > into 44 bits for seconds, and 20 bits for microseconds. That's more than > > > enough modification resolution, and also pushes things to past year > > > 500,000 AD. Versioning the indoe would cover this easily. > > > > Ugh. But possible... > > I agree it's ugly, but it has the advantage that it doesn't grow the > on-disk inode. A lot of flks have designs on the remaining 64 bits free. > :-) Well, so long as we can resolve the issue for a long, long time; I plan on being around to have to put up with the bugs, if I can wrangle it... 8-). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Aug 17 7:18: 2 1999 Delivered-To: freebsd-fs@freebsd.org Received: from sv01.cet.co.jp (sv01.cet.co.jp [210.171.56.2]) by hub.freebsd.org (Postfix) with ESMTP id 0C5D9156C4; Tue, 17 Aug 1999 07:17:54 -0700 (PDT) (envelope-from michaelh@cet.co.jp) Received: from localhost (michaelh@localhost) by sv01.cet.co.jp (8.9.3/8.9.3) with SMTP id OAA17678; Tue, 17 Aug 1999 14:18:07 GMT Date: Tue, 17 Aug 1999 23:18:06 +0900 (JST) From: Michael Hancock To: Terry Lambert Cc: wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-Reply-To: <199908170231.TAA08526@usr02.primenet.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > I'm not familiar with the VFS_default stuff. All the vop_default_desc > > routines in NetBSD point to error routines. > > In FreeBSD, they now point to default routines that are *not* error > routines. This is the problem. I admit the change was very well > intentioned, since it made the code a hell of a lot more readable, > but choosing between readable and additional function, I take function > over form (I think the way I would have "fixed" the readability is by > making the operations that result in the descriptor set for a mounted > FS instance be both discrete, and named for their specific function). As I recall most of FBSD's default routines are also error routines, if the exceptions were a problem it would would be trivial to fix. I think fixing resource allocation/deallocation for things like vnodes, cnbufs, and locks are a higher priority for now. There are examples such as in detached threading where it might make sense for the detached child to be responsible for releasing resources allocated to it by the parent, but in stacking this model is very messy and unnatural. This is why the purpose of VOP_ABORTOP appears to be to release cnbufs but this is really just an ugly side effect. With stacking the code that allocates should be the code that deallocates. Substitute, "code" with "layer" to be more correct. I fixed a lot of the vnode and locking cases, unfortunately the ones that remain are probably ugly cases where you have to reacquire locks that had to be unlocked somewhere in the executing layer. See VOP_RENAME for an example. Compare the number of WILLRELEs in vnode_if.src in FreeBSD and NetBSD, ideally there'd be none. Regards, Mike Hancock To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Aug 17 9:20:30 1999 Delivered-To: freebsd-fs@freebsd.org Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17]) by hub.freebsd.org (Postfix) with ESMTP id 9B5BE1503B; Tue, 17 Aug 1999 09:20:27 -0700 (PDT) (envelope-from wrstuden@marcy.nas.nasa.gov) Received: from localhost (wrstuden@localhost) by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id JAA08787; Tue, 17 Aug 1999 09:20:29 -0700 (PDT) Date: Tue, 17 Aug 1999 09:20:29 -0700 (PDT) From: Bill Studenmund To: Michael Hancock Cc: Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Tue, 17 Aug 1999, Michael Hancock wrote: > As I recall most of FBSD's default routines are also error routines, if > the exceptions were a problem it would would be trivial to fix. > > I think fixing resource allocation/deallocation for things like vnodes, > cnbufs, and locks are a higher priority for now. There are examples such > as in detached threading where it might make sense for the detached child > to be responsible for releasing resources allocated to it by the parent, > but in stacking this model is very messy and unnatural. This is why the > purpose of VOP_ABORTOP appears to be to release cnbufs but this is really > just an ugly side effect. With stacking the code that allocates should be > the code that deallocates. Substitute, "code" with "layer" to be more > correct. > > I fixed a lot of the vnode and locking cases, unfortunately the ones that > remain are probably ugly cases where you have to reacquire locks that had > to be unlocked somewhere in the executing layer. See VOP_RENAME for an > example. Compare the number of WILLRELEs in vnode_if.src in FreeBSD and > NetBSD, ideally there'd be none. I've compared the two, and making the NetBSD number match the FreeBSD number is one of my goals. :-) Any suggestions, or just plod&fix? Take care, Bill To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Aug 17 9:59:43 1999 Delivered-To: freebsd-fs@freebsd.org Received: from sv01.cet.co.jp (sv01.cet.co.jp [210.171.56.2]) by hub.freebsd.org (Postfix) with ESMTP id 998E414F32; Tue, 17 Aug 1999 09:59:33 -0700 (PDT) (envelope-from michaelh@cet.co.jp) Received: from localhost (michaelh@localhost) by sv01.cet.co.jp (8.9.3/8.9.3) with SMTP id RAA18271; Tue, 17 Aug 1999 17:00:02 GMT Date: Wed, 18 Aug 1999 02:00:02 +0900 (JST) From: Michael Hancock To: Bill Studenmund Cc: Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Tue, 17 Aug 1999, Bill Studenmund wrote: > I've compared the two, and making the NetBSD number match the FreeBSD > number is one of my goals. :-) > > Any suggestions, or just plod&fix? It can be very cumbersome tracking down references being bumped by vref/VREF and other operations. Among the uncompleted operations are VOPs that pre-release the returned vpp to the caller. I think in VOP_MKNOD this was done as a convenience and you might have to add code to handle device vp aliases correctly. Just remember the rule, the allocating layer must be the layer that deallocates. Regards, Mike To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Aug 17 13:45:48 1999 Delivered-To: freebsd-fs@freebsd.org Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17]) by hub.freebsd.org (Postfix) with ESMTP id CDD94157FD; Tue, 17 Aug 1999 13:44:24 -0700 (PDT) (envelope-from wrstuden@marcy.nas.nasa.gov) Received: from localhost (wrstuden@localhost) by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id NAA27035; Tue, 17 Aug 1999 13:44:34 -0700 (PDT) Date: Tue, 17 Aug 1999 13:44:34 -0700 (PDT) From: Bill Studenmund Reply-To: Bill Studenmund To: Terry Lambert Cc: Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-Reply-To: <199908170231.TAA08526@usr02.primenet.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Tue, 17 Aug 1999, Terry Lambert wrote: > > > > > 2. Advisory locks are hung off private backing objects. > > I'm not sure. The struct lock * is only used by layered filesystems, so > > they can keep track both of the underlying vnode lock, and if needed their > > own vnode lock. For advisory locks, would we want to keep track both of > > locks on our layer and the layer below? Don't we want either one or the > > other? i.e. layers bypass to the one below, or deal with it all > > themselves. > > I think you want the lock on the intermediate layer: basically, on > every vnode that has data associated with it that is unique to a > layer. Let's not forget, also, that you can expose a layer into > the namespace in one place, and expose it covered under another > layer, at another. If you locked down to the backing object, then > the only issue you would be left with is one or more intermediate > backing objects. Right. That exported struct lock * makes locking down to the lowest-level file easy - you just feed it to the lock manager, and you're locking the same lock the lowest level fs uses. You then lock all vnodes stacked over this one at the same time. Otherwise, you just call VOP_LOCK below and then lock yourself. > For a layer with an intermediate backing object, I'm prepared to > declare it "special", and proxy the operation down to any inferior > backing object (e.g. a union FS that adds files from two FS's > together, rather than just directoriy entry lists). I think such > layers are the exception, not the rule. Actually isn't the only problem when you have vnode fan-in (union FS)? i.e. a plain compressing layer should not introduce vnode locking problems. > I think that export policies are the realm of /etc/exports. > > The problem with each FS implementing its own policy, is that this > is another place that copyinstr() gets called, when it shouldn't. Well, my thought was that, like with current code, most every fs would just call vfs_export() when it's presented an export operation. But by retaining the option of having the fs do its own thing, we can support different export semantics if desired. > Right. The "covering" operation is not the same as the "marking as > covered" operation. Both need to be at the higher level. > Not really. Julian Elisher had code that mounted a /devfs under > / automatically, before the user was ever allowed to see /. As a > result, the FS that you were left with was indistinguishable from > what I describe. > > The only real difference is that, as a translucent mount over /devfs, > the one I describe would be capable of implementing persistant changes > to the /devfs, as whiteouts. I don't think this is really that > desirable, but some people won't accept a devfs that doesn't have > traditional persistance semantics (e.g. "chmod" vs. modifying a > well known kernel data structure as an administrative operation). That wouldn't be hard to do. :-) > I guess the other difference is that you don't have to worry about > large minor numbers when you are bringing up a new platform via > NFS from an old platform that can't support large minors in its FS > at all. ;-). True. :-) > I would resolve this by passing a standard option to the mount code > in user space. For root mounts, a vnode is passed down. For other > mounts, the vnode is parsed and passed if the option is specified. Or maybe add a field to vfsops. This info says what the mount call will expect (I want a block device, a regular file, a directory, etc), so it fits. :-) Also, if we leave it to userland, what happens if someone writes a program which calls sys_mount with something the fs doesn't expect. :-) > I think that you will only be able to find rare examples of FS's > that don't take device names as arguments. But for those, you > don't specify the option, and it gets "NULL", and whatever local > options you specify. I agree I can't see a leaf fs not taking a device node. But layered fs's certainly will want something else. :-) > The point is that, for FS's that can be both root and sub-root, > the mount code doesn't have to make the decision, it can be punted > to higher level code, in one place, where the code can be centrally > maintained and kept from getting "stale" when things change out > from under it. True. And with good comments we can catch the times when the centrally located code changes & brakes an assumption made by the fs. :-) > > Except for a minor buglet with device nodes, stacking works in NetBSD at > > present. :-) > > Have you tried Heidemann's student's stacking layers? There is one > encryption, and one per-file compression with namespace hiding, that > I think it would be hard pressed to keep up with. But I'll give it > the benefit of the doubt. 8-). Nope. The problem is that while stacking (null, umap, and overlay fs's) work, we don't have the coherency issues worked out so that upper layers can cache data. i.e. so that the lower fs knows it has to ask the uper layers to give pages back. :-) But multiple ls -lR's work fine. :-) > > I agree it's ugly, but it has the advantage that it doesn't grow the > > on-disk inode. A lot of flks have designs on the remaining 64 bits free. > > :-) > > Well, so long as we can resolve the issue for a long, long time; > I plan on being around to have to put up with the bugs, if I can > wrangle it... 8-). :-) I bet by then (559447 AD) we won't be using ffs, so the problem will be moot. :-) Take care, Bill To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Aug 17 14: 6:21 1999 Delivered-To: freebsd-fs@freebsd.org Received: from sv01.cet.co.jp (sv01.cet.co.jp [210.171.56.2]) by hub.freebsd.org (Postfix) with ESMTP id AD45B157DF; Tue, 17 Aug 1999 14:06:01 -0700 (PDT) (envelope-from michaelh@cet.co.jp) Received: from localhost (michaelh@localhost) by sv01.cet.co.jp (8.9.3/8.9.3) with SMTP id VAA18756; Tue, 17 Aug 1999 21:05:08 GMT Date: Wed, 18 Aug 1999 06:05:08 +0900 (JST) From: Michael Hancock To: Bill Studenmund Cc: Terry Lambert , Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > Have you tried Heidemann's student's stacking layers? There is one > > encryption, and one per-file compression with namespace hiding, that > > I think it would be hard pressed to keep up with. But I'll give it > > the benefit of the doubt. 8-). > > Nope. The problem is that while stacking (null, umap, and overlay fs's) > work, we don't have the coherency issues worked out so that upper layers > can cache data. i.e. so that the lower fs knows it has to ask the uper > layers to give pages back. :-) But multiple ls -lR's work fine. :-) Interesting, have you read the Heidemann paper that outlines a solution that uses a cache manager? You can probably find it somewhere here, http://www.isi.edu/~johnh/SOFTWARE/UCLA_STACKING/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Aug 17 14:12:15 1999 Delivered-To: freebsd-fs@freebsd.org Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17]) by hub.freebsd.org (Postfix) with ESMTP id 736DA157F4; Tue, 17 Aug 1999 14:12:11 -0700 (PDT) (envelope-from wrstuden@marcy.nas.nasa.gov) Received: from localhost (wrstuden@localhost) by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id OAA29074; Tue, 17 Aug 1999 14:12:22 -0700 (PDT) Date: Tue, 17 Aug 1999 14:12:22 -0700 (PDT) From: Bill Studenmund To: Michael Hancock Cc: Terry Lambert , Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wed, 18 Aug 1999, Michael Hancock wrote: > Interesting, have you read the Heidemann paper that outlines a solution > that uses a cache manager? > > You can probably find it somewhere here, > http://www.isi.edu/~johnh/SOFTWARE/UCLA_STACKING/ Nope. I've read his dissertation, and his discussion of the lock management inspired the struct lock * work I did for NetBSD (we use the address of the lock, not the vnode, but other than that it's the same). Thanks for the ref! Take care, Bill To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Aug 17 14:17:25 1999 Delivered-To: freebsd-fs@freebsd.org Received: from sv01.cet.co.jp (sv01.cet.co.jp [210.171.56.2]) by hub.freebsd.org (Postfix) with ESMTP id 1E41115818; Tue, 17 Aug 1999 14:17:12 -0700 (PDT) (envelope-from michaelh@cet.co.jp) Received: from localhost (michaelh@localhost) by sv01.cet.co.jp (8.9.3/8.9.3) with SMTP id VAA18787; Tue, 17 Aug 1999 21:14:47 GMT Date: Wed, 18 Aug 1999 06:14:47 +0900 (JST) From: Michael Hancock To: Bill Studenmund Cc: Terry Lambert , Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org I forgot I had some old diffs that may be of help, http://www.freebsd.org/~mch/vop1a.diff You'll notice that just about everywhere that I moved vput() to the appropriate layer a path component buffer was also freed in the wrong place. John Dyson put these buffers in zones so the free routine probably looks very different than in netbsd. zfree(namei_zone, cnp->cn_pnbuf); - vput(dvp); Regards, Mike To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Aug 17 14:49:45 1999 Delivered-To: freebsd-fs@freebsd.org Received: from gatekeeper.tsc.tdk.com (gatekeeper.tsc.tdk.com [207.113.159.21]) by hub.freebsd.org (Postfix) with ESMTP id 17C23157F3; Tue, 17 Aug 1999 14:49:39 -0700 (PDT) (envelope-from gdonl@tsc.tdk.com) Received: from sunrise.gv.tsc.tdk.com (root@sunrise.gv.tsc.tdk.com [192.168.241.191]) by gatekeeper.tsc.tdk.com (8.8.8/8.8.8) with ESMTP id OAA15932; Tue, 17 Aug 1999 14:48:45 -0700 (PDT) (envelope-from gdonl@tsc.tdk.com) Received: from salsa.gv.tsc.tdk.com (salsa.gv.tsc.tdk.com [192.168.241.194]) by sunrise.gv.tsc.tdk.com (8.8.5/8.8.5) with ESMTP id OAA21621; Tue, 17 Aug 1999 14:48:44 -0700 (PDT) Received: (from gdonl@localhost) by salsa.gv.tsc.tdk.com (8.8.5/8.8.5) id OAA02073; Tue, 17 Aug 1999 14:48:39 -0700 (PDT) From: Don Lewis Message-Id: <199908172148.OAA02073@salsa.gv.tsc.tdk.com> Date: Tue, 17 Aug 1999 14:48:39 -0700 In-Reply-To: Terry Lambert "Re: BSD XFS Port & BSD VFS Rewrite" (Aug 16, 9:18pm) X-Mailer: Mail User's Shell (7.2.6 alpha(3) 7/19/95) To: Terry Lambert , wrstuden@nas.nasa.gov Subject: Re: BSD XFS Port & BSD VFS Rewrite Cc: Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Aug 16, 9:18pm, Terry Lambert wrote: } Subject: Re: BSD XFS Port & BSD VFS Rewrite } > I don't see how the namei recursion method prevents catching // as a } > namespace escape. } } } //apple-resource-fork/intermediate_dir/some_other_dir/file_with_fork } } You can't inherit the fact that you are looking at the resource fork } in the terminal component, ONLY. I don't think this is a good example. How would you access the resource fork of a file relative to the current directory? IMHO, the necessary goop needs to go at the end of the path name. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Aug 17 15:46:54 1999 Delivered-To: freebsd-fs@freebsd.org Received: from xylan.com (postal.xylan.com [208.8.0.248]) by hub.freebsd.org (Postfix) with ESMTP id 0B11F14CC7; Tue, 17 Aug 1999 15:46:48 -0700 (PDT) (envelope-from wes@softweyr.com) Received: from mailhub.xylan.com by xylan.com (8.8.7/SMI-SVR4 (xylan-mgw 2.2 [OUT])) id PAA13293; Tue, 17 Aug 1999 15:44:35 -0700 (PDT) Received: from utah.XYLAN.COM by mailhub.xylan.com (SMI-8.6/SMI-SVR4 (mailhub 2.1 [HUB])) id PAA13692; Tue, 17 Aug 1999 15:38:34 -0700 Received: from softweyr.com by utah.XYLAN.COM (SMI-8.6/SMI-SVR4 (xylan utah [SPOOL])) id QAA27793; Tue, 17 Aug 1999 16:44:30 -0600 Message-ID: <37B9E5CE.8E7B8AFD@softweyr.com> Date: Tue, 17 Aug 1999 16:44:30 -0600 From: Wes Peters Organization: Softweyr LLC X-Mailer: Mozilla 4.5 [en] (X11; U; FreeBSD 3.1-RELEASE i386) X-Accept-Language: en MIME-Version: 1.0 To: Don Lewis Cc: Terry Lambert , wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite References: <199908172148.OAA02073@salsa.gv.tsc.tdk.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Don Lewis wrote: > > On Aug 16, 9:18pm, Terry Lambert wrote: > } Subject: Re: BSD XFS Port & BSD VFS Rewrite > > } > I don't see how the namei recursion method prevents catching // as a > } > namespace escape. > } > } > } //apple-resource-fork/intermediate_dir/some_other_dir/file_with_fork > } > } You can't inherit the fact that you are looking at the resource fork > } in the terminal component, ONLY. > > I don't think this is a good example. How would you access the resource > fork of a file relative to the current directory? IMHO, the necessary > goop needs to go at the end of the path name. Pick a separator character that nobody in their right mind would use in a file path. "\" strikes me as a good candidate. ;^) -- "Where am I, and what am I doing in this handbasket?" Wes Peters Softweyr LLC http://softweyr.com/ wes@softweyr.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 5:56:24 1999 Delivered-To: freebsd-fs@freebsd.org Received: from critter.freebsd.dk (wandering-wizard.cybercity.dk [212.242.41.238]) by hub.freebsd.org (Postfix) with ESMTP id 7B30C14E6B; Wed, 18 Aug 1999 05:56:17 -0700 (PDT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id JAA00832; Wed, 18 Aug 1999 09:32:52 +0200 (CEST) (envelope-from phk@critter.freebsd.dk) To: Bill Studenmund Cc: Terry Lambert , Alton Matthew , Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-reply-to: Your message of "Mon, 16 Aug 1999 13:48:16 PDT." Date: Wed, 18 Aug 1999 09:32:52 +0200 Message-ID: <830.934961572@critter.freebsd.dk> From: Poul-Henning Kamp Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message , Bill Studenmund writes: >On Sat, 14 Aug 1999, Terry Lambert wrote: >> > I am currently conducting a thorough study of the VFS subsystem >> > in preparation for an all-out effort to port SGI's XFS filesystem to >> > FreeBSD 4.x at such time as SGI gives up the code. Matt Dillon >> > has written in hackers- that the VFS subsystem is presently not >> > well understood by any of the active kernel code contributers and >> > that it will be rewritten later this year. This is obviously of great >> > concern to me in this port. >> >> It is of great concern to me that a rewrite, apparently because of >> non-understanding, is taking place at all. > >That concerns me too. Many aspects of the 4.4 vnode interface were there >for specific reasons. Even if they were hack solutions, to re-write them >because of a lack of understanding is dangerous as the new code will >likely run into the same problems as before. :-) Matt doesn't represent the FreeBSD project, and even if he rewrites the VFS subsystem so he can understand it, his rewrite would face considerable resistance on its way into FreeBSD. I don't think there is reason to rewrite it, but there certainly are areas that need fixing. >> The use of the "vfs_default" to make unimplemented VOP's >> fall through to code which implements function, while well >> intentioned, is misguided. I beg to differ. The only difference is that we pass through multiple layers before we hit the bottom of the stack. There is no loss of functionality but significant gain of clarity and modularity. Adding a new VOP entails the same thing as it has always done. >> 3. The filesystem itself is broken for Y2038 >> >> The space which was historically reserved for the Y2038 fix >> (a 64 bit time_t) was absconeded with for subsecond resoloution. >> >> This change should be reverted, and fsck modified to re-zero >> the values, given a specific argument. That would break make(1) on contemporary machines. >One other suggestion I've heard is to split the 64 bits we have for time >into 44 bits for seconds, and 20 bits for microseconds. That's more than >enough modification resolution, and also pushes things to past year >500,000 AD. Versioning the indoe would cover this easily. This would be misguided, and given the current speed of evolution lead to other problems far before 2038. Both struct timespec and struct timeval are major mistakes, they make arithmetic on timestamps an expensive operation. Timestamps should be stored as integers using an fix-point notations, for instance 64bits with 32bit fractional seconds (the NTP timestamp), or in the future 128/48. Extending from 64 to 128bits would be a cheap shift and increased precision and range could go hand in hand. If we don't want to extend the size of the timestamps before 2038, (and we should not only look at filesystems here), then the correct fix will be to move the epoch and use the inode version to mark this fact. -- Poul-Henning Kamp FreeBSD coreteam member phk@FreeBSD.ORG "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 10:19: 3 1999 Delivered-To: freebsd-fs@freebsd.org Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132]) by hub.freebsd.org (Postfix) with ESMTP id C52FF14C80; Wed, 18 Aug 1999 10:18:58 -0700 (PDT) (envelope-from tlambert@usr02.primenet.com) Received: (from daemon@localhost) by smtp02.primenet.com (8.8.8/8.8.8) id KAA01747; Wed, 18 Aug 1999 10:16:59 -0700 (MST) Received: from usr02.primenet.com(206.165.6.202) via SMTP by smtp02.primenet.com, id smtpd001627; Wed Aug 18 10:16:51 1999 Received: (from tlambert@localhost) by usr02.primenet.com (8.8.5/8.8.5) id KAA12220; Wed, 18 Aug 1999 10:16:46 -0700 (MST) From: Terry Lambert Message-Id: <199908181716.KAA12220@usr02.primenet.com> Subject: Re: BSD XFS Port & BSD VFS Rewrite To: michaelh@cet.co.jp (Michael Hancock) Date: Wed, 18 Aug 1999 17:16:46 +0000 (GMT) Cc: tlambert@primenet.com, wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG In-Reply-To: from "Michael Hancock" at Aug 17, 99 11:18:06 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > > I'm not familiar with the VFS_default stuff. All the vop_default_desc > > > routines in NetBSD point to error routines. > > > > In FreeBSD, they now point to default routines that are *not* error > > routines. This is the problem. I admit the change was very well > > intentioned, since it made the code a hell of a lot more readable, > > but choosing between readable and additional function, I take function > > over form (I think the way I would have "fixed" the readability is by > > making the operations that result in the descriptor set for a mounted > > FS instance be both discrete, and named for their specific function). > > As I recall most of FBSD's default routines are also error routines, if > the exceptions were a problem it would would be trivial to fix. You would have to de-collapse several VOP lists that have been pre-collapsed. The pre-collapse is also an issue for stacking, since the collapse is supposed to be late bound to the stacking operation itself. This lets you revisit it later when you need to add a new VOP into the system, so that there's a NULL pointer in the VOP slot for older FS's, in case you stack on top of them. This is particularly true of an FS stacked on an FS stacked on a proxy layer. > I think fixing resource allocation/deallocation for things like vnodes, > cnbufs, and locks are a higher priority for now. There are examples such > as in detached threading where it might make sense for the detached child > to be responsible for releasing resources allocated to it by the parent, > but in stacking this model is very messy and unnatural. This is why the > purpose of VOP_ABORTOP appears to be to release cnbufs but this is really > just an ugly side effect. With stacking the code that allocates should be > the code that deallocates. Substitute, "code" with "layer" to be more > correct. Yes. That's actually maintenance, not rewrite, and I think it's very important to address. I'm rather pleased with the way the NFS stuff has turned out (so far), and I was the one calling for a return to first principles (i.e. a rewrite from the specification). > I fixed a lot of the vnode and locking cases, unfortunately the ones that > remain are probably ugly cases where you have to reacquire locks that had > to be unlocked somewhere in the executing layer. See VOP_RENAME for an > example. Compare the number of WILLRELEs in vnode_if.src in FreeBSD and > NetBSD, ideally there'd be none. The way I handled this in the rename case on my hacking box was by adding a flag to the namei() call. You could call this flag the same as WILLRELE, but it had inverse semantics. Really, this is another issue of reflexivity being absent from an interface. You really don't want asymmetric interfaces (VOP_LOCK is an example, in many cases, based on internal use in the FFS). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 10:27:25 1999 Delivered-To: freebsd-fs@freebsd.org Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131]) by hub.freebsd.org (Postfix) with ESMTP id 58B2B14C80; Wed, 18 Aug 1999 10:27:20 -0700 (PDT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id TAA01171; Wed, 18 Aug 1999 19:24:04 +0200 (CEST) (envelope-from phk@critter.freebsd.dk) To: Terry Lambert Cc: michaelh@cet.co.jp (Michael Hancock), wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-reply-to: Your message of "Wed, 18 Aug 1999 17:16:46 -0000." <199908181716.KAA12220@usr02.primenet.com> Date: Wed, 18 Aug 1999 19:24:04 +0200 Message-ID: <1169.934997044@critter.freebsd.dk> From: Poul-Henning Kamp Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message <199908181716.KAA12220@usr02.primenet.com>, Terry Lambert writes: >> > > I'm not familiar with the VFS_default stuff. All the vop_default_desc >> > > routines in NetBSD point to error routines. >> > >> > In FreeBSD, they now point to default routines that are *not* error >> > routines. This is the problem. I admit the change was very well >> > intentioned, since it made the code a hell of a lot more readable, >> > but choosing between readable and additional function, I take function >> > over form (I think the way I would have "fixed" the readability is by >> > making the operations that result in the descriptor set for a mounted >> > FS instance be both discrete, and named for their specific function). >> >> As I recall most of FBSD's default routines are also error routines, if >> the exceptions were a problem it would would be trivial to fix. > >You would have to de-collapse several VOP lists that have been >pre-collapsed. You are talking gibberish here. Please show code where this is a problem. -- Poul-Henning Kamp FreeBSD coreteam member phk@FreeBSD.ORG "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 10:31: 5 1999 Delivered-To: freebsd-fs@freebsd.org Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17]) by hub.freebsd.org (Postfix) with ESMTP id CAB8914F2D; Wed, 18 Aug 1999 10:31:02 -0700 (PDT) (envelope-from wrstuden@marcy.nas.nasa.gov) Received: from localhost (wrstuden@localhost) by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id KAA16496; Wed, 18 Aug 1999 10:30:39 -0700 (PDT) Date: Wed, 18 Aug 1999 10:30:39 -0700 (PDT) From: Bill Studenmund To: Poul-Henning Kamp Cc: Terry Lambert , Alton Matthew , Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-Reply-To: <830.934961572@critter.freebsd.dk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wed, 18 Aug 1999, Poul-Henning Kamp wrote: > In message , Bill > Studenmund writes: > >On Sat, 14 Aug 1999, Terry Lambert wrote: > > Matt doesn't represent the FreeBSD project, and even if he rewrites > the VFS subsystem so he can understand it, his rewrite would face > considerable resistance on its way into FreeBSD. I don't think > there is reason to rewrite it, but there certainly are areas > that need fixing. Whew! That's reasuring. I agree there are things which need fixing. It'd be nice if both NetBSD and FreeBSD could fix things in the same way. > >> The use of the "vfs_default" to make unimplemented VOP's > >> fall through to code which implements function, while well > >> intentioned, is misguided. > > I beg to differ. The only difference is that we pass through > multiple layers before we hit the bottom of the stack. There is > no loss of functionality but significant gain of clarity and > modularity. If I understood the issue, it is that the leaf fs's (the bottom ones) would use a default routine for non-error functionality. I think Terry's point (which I agree with) was that a leaf fs's default routine should only return errors. > >> 3. The filesystem itself is broken for Y2038 > >One other suggestion I've heard is to split the 64 bits we have for time > >into 44 bits for seconds, and 20 bits for microseconds. That's more than > >enough modification resolution, and also pushes things to past year > >500,000 AD. Versioning the indoe would cover this easily. > > This would be misguided, and given the current speed of evolution > lead to other problems far before 2038. > > Both struct timespec and struct timeval are major mistakes, they > make arithmetic on timestamps an expensive operation. Timestamps > should be stored as integers using an fix-point notations, for > instance 64bits with 32bit fractional seconds (the NTP timestamp), > or in the future 128/48. I like that idea. One thing I should probably mention is that I'm not suggesting we ever do arighmetic on the 44/20 number, just we store it that way. struct inode would contain time fields in whatever format the host prefers, with the 44/20 stuff only being in struct dinode. Converting from 44/20 would only happen on initial read. Math would happen on the host format version. :-) If time structures go to 64/32 fixed-point math, then my suggestion can be re-phrased as storing 44.20 worth of that number in the on-disk inode. > Extending from 64 to 128bits would be a cheap shift and increased > precision and range could go hand in hand. I doubt we need more than 64 bit times. 2^63 seconds works out to 292,279,025,208 years, or 292 (american) billion years. Current theories put the age of the universe at I think 12 to 16 billion years. So 64-bit signed times in seconds will cover from before the big bang to way past any time we'll be caring about. :-) Take care, Bill To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 10:36:24 1999 Delivered-To: freebsd-fs@freebsd.org Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131]) by hub.freebsd.org (Postfix) with ESMTP id 6636614C90; Wed, 18 Aug 1999 10:36:14 -0700 (PDT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id TAA01242; Wed, 18 Aug 1999 19:36:22 +0200 (CEST) (envelope-from phk@critter.freebsd.dk) To: Bill Studenmund Cc: Terry Lambert , Alton Matthew , Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-reply-to: Your message of "Wed, 18 Aug 1999 10:30:39 PDT." Date: Wed, 18 Aug 1999 19:36:22 +0200 Message-ID: <1240.934997782@critter.freebsd.dk> From: Poul-Henning Kamp Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message , Bill Studenmund writes: >Whew! That's reasuring. I agree there are things which need fixing. It'd >be nice if both NetBSD and FreeBSD could fix things in the same way. Well, >that< still remains to be seen... >> >> The use of the "vfs_default" to make unimplemented VOP's >> >> fall through to code which implements function, while well >> >> intentioned, is misguided. >> >> I beg to differ. The only difference is that we pass through >> multiple layers before we hit the bottom of the stack. There is >> no loss of functionality but significant gain of clarity and >> modularity. > >If I understood the issue, it is that the leaf fs's (the bottom ones) >would use a default routine for non-error functionality. I think Terry's >point (which I agree with) was that a leaf fs's default routine should >only return errors. I beg to differ. It is far more likely, in my mind, that you will want to handle a currently existing, unimplemented VOP than add a new one. Using the default for >all< unimplemented VOPs makes this possible, using the same logic which makes adding a VOP possible. Go back and review the diffs from when I did this, and my other argument why this is a good idea should be obvious. >I doubt we need more than 64 bit times. 2^63 seconds works out to >292,279,025,208 years, or 292 (american) billion years. Current theories >put the age of the universe at I think 12 to 16 billion years. So 64-bit >signed times in seconds will cover from before the big bang to way past >any time we'll be caring about. :-) But we cannot do time in seconds resolution, we need to resolve at least the cpu clock frequency, which right now is approaching 1GHz (30bit!) -- Poul-Henning Kamp FreeBSD coreteam member phk@FreeBSD.ORG "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 10:55:58 1999 Delivered-To: freebsd-fs@freebsd.org Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17]) by hub.freebsd.org (Postfix) with ESMTP id F1C6C14D15; Wed, 18 Aug 1999 10:55:54 -0700 (PDT) (envelope-from wrstuden@marcy.nas.nasa.gov) Received: from localhost (wrstuden@localhost) by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id KAA19003; Wed, 18 Aug 1999 10:56:27 -0700 (PDT) Date: Wed, 18 Aug 1999 10:56:27 -0700 (PDT) From: Bill Studenmund To: Poul-Henning Kamp Cc: Terry Lambert , Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-Reply-To: <1240.934997782@critter.freebsd.dk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wed, 18 Aug 1999, Poul-Henning Kamp wrote: > In message , Bill Studenmund writes: > > >Whew! That's reasuring. I agree there are things which need fixing. It'd > >be nice if both NetBSD and FreeBSD could fix things in the same way. > > Well, >that< still remains to be seen... :-) > >I doubt we need more than 64 bit times. 2^63 seconds works out to > >292,279,025,208 years, or 292 (american) billion years. Current theories > >put the age of the universe at I think 12 to 16 billion years. So 64-bit > >signed times in seconds will cover from before the big bang to way past > >any time we'll be caring about. :-) I was unclear. I was refering to the seconds side of things. Sub-second resolution would need other bits. Take care, Bill To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 11: 6:14 1999 Delivered-To: freebsd-fs@freebsd.org Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131]) by hub.freebsd.org (Postfix) with ESMTP id 0FF5C14F5D; Wed, 18 Aug 1999 11:06:09 -0700 (PDT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id UAA01364; Wed, 18 Aug 1999 20:04:49 +0200 (CEST) (envelope-from phk@critter.freebsd.dk) To: Bill Studenmund Cc: Terry Lambert , Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-reply-to: Your message of "Wed, 18 Aug 1999 10:56:27 PDT." Date: Wed, 18 Aug 1999 20:04:49 +0200 Message-ID: <1362.934999489@critter.freebsd.dk> From: Poul-Henning Kamp Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message , Bill Studenmund writes: >> >I doubt we need more than 64 bit times. 2^63 seconds works out to >> >292,279,025,208 years, or 292 (american) billion years. Current theories >> >put the age of the universe at I think 12 to 16 billion years. So 64-bit >> >signed times in seconds will cover from before the big bang to way past >> >any time we'll be caring about. :-) > >I was unclear. I was refering to the seconds side of things. Sub-second >resolution would need other bits. Yes, but we need subsecond in the filesystems. Think about make(1) on a blinding fast machine... -- Poul-Henning Kamp FreeBSD coreteam member phk@FreeBSD.ORG "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 11: 8: 2 1999 Delivered-To: freebsd-fs@freebsd.org Received: from alpo.whistle.com (alpo.whistle.com [207.76.204.38]) by hub.freebsd.org (Postfix) with ESMTP id 13FD81505D; Wed, 18 Aug 1999 11:07:53 -0700 (PDT) (envelope-from julian@whistle.com) Received: from current1.whistle.com (current1.whistle.com [207.76.205.22]) by alpo.whistle.com (8.9.1a/8.9.1) with SMTP id LAA09356; Wed, 18 Aug 1999 11:00:47 -0700 (PDT) Date: Wed, 18 Aug 1999 11:01:58 -0700 (PDT) From: Julian Elischer To: Poul-Henning Kamp Cc: Bill Studenmund , Terry Lambert , Alton Matthew , Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-Reply-To: <830.934961572@critter.freebsd.dk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wed, 18 Aug 1999, Poul-Henning Kamp wrote: > Matt doesn't represent the FreeBSD project, and even if he rewrites > the VFS subsystem so he can understand it, his rewrite would face > considerable resistance on its way into FreeBSD. I don't think > there is reason to rewrite it, but there certainly are areas > that need fixing. You are misinformed as far as I know.. From discussions I saw, th main architect of a VFS rewrite would be Kirk, and Matt would be acting as Kirk's right-hand-man. > > >> The use of the "vfs_default" to make unimplemented VOP's > >> fall through to code which implements function, while well > >> intentioned, is misguided. > > I beg to differ. The only difference is that we pass through > multiple layers before we hit the bottom of the stack. There is > no loss of functionality but significant gain of clarity and > modularity. Well I believe that Kirk considers them misguided too, but he stated that he wasn't going to remove them without serious thought about the alternatives. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 11:16: 1 1999 Delivered-To: freebsd-fs@freebsd.org Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131]) by hub.freebsd.org (Postfix) with ESMTP id 68E981505D; Wed, 18 Aug 1999 11:15:54 -0700 (PDT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id UAA01443; Wed, 18 Aug 1999 20:15:59 +0200 (CEST) (envelope-from phk@critter.freebsd.dk) To: Julian Elischer Cc: Bill Studenmund , Terry Lambert , Alton Matthew , Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-reply-to: Your message of "Wed, 18 Aug 1999 11:01:58 PDT." Date: Wed, 18 Aug 1999 20:15:59 +0200 Message-ID: <1441.935000159@critter.freebsd.dk> From: Poul-Henning Kamp Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message , Julian Elischer writes: >On Wed, 18 Aug 1999, Poul-Henning Kamp wrote: > >> Matt doesn't represent the FreeBSD project, and even if he rewrites >> the VFS subsystem so he can understand it, his rewrite would face >> considerable resistance on its way into FreeBSD. I don't think >> there is reason to rewrite it, but there certainly are areas >> that need fixing. > >You are misinformed as far as I know.. From discussions I saw, th >main architect of a VFS rewrite would be Kirk, and Matt would be acting as >Kirk's right-hand-man. I bet that Matt and Kirk uses "rewrite" for two very different concepts. The resulting reviews will be equally different. >> >> The use of the "vfs_default" to make unimplemented VOP's >> >> fall through to code which implements function, while well >> >> intentioned, is misguided. >> >> I beg to differ. The only difference is that we pass through >> multiple layers before we hit the bottom of the stack. There is >> no loss of functionality but significant gain of clarity and >> modularity. > >Well I believe that Kirk considers them misguided too, but he stated that >he wasn't going to remove them without serious thought about the alternatives. I'll be more than ready to discuss this with Kirk. -- Poul-Henning Kamp FreeBSD coreteam member phk@FreeBSD.ORG "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 11:20:35 1999 Delivered-To: freebsd-fs@freebsd.org Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id C51B81595D; Wed, 18 Aug 1999 11:20:18 -0700 (PDT) (envelope-from tlambert@usr02.primenet.com) Received: (from daemon@localhost) by smtp04.primenet.com (8.9.3/8.9.3) id LAA04794; Wed, 18 Aug 1999 11:19:40 -0700 (MST) Received: from usr02.primenet.com(206.165.6.202) via SMTP by smtp04.primenet.com, id smtpdAAAFFaOvj; Wed Aug 18 11:19:33 1999 Received: (from tlambert@localhost) by usr02.primenet.com (8.8.5/8.8.5) id LAA14096; Wed, 18 Aug 1999 11:19:43 -0700 (MST) From: Terry Lambert Message-Id: <199908181819.LAA14096@usr02.primenet.com> Subject: Re: BSD XFS Port & BSD VFS Rewrite To: wrstuden@nas.nasa.gov Date: Wed, 18 Aug 1999 18:19:42 +0000 (GMT) Cc: tlambert@primenet.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG In-Reply-To: from "Bill Studenmund" at Aug 17, 99 01:44:34 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > > > > > 2. Advisory locks are hung off private backing objects. > > > I'm not sure. The struct lock * is only used by layered filesystems, so > > > they can keep track both of the underlying vnode lock, and if needed their > > > own vnode lock. For advisory locks, would we want to keep track both of > > > locks on our layer and the layer below? Don't we want either one or the > > > other? i.e. layers bypass to the one below, or deal with it all > > > themselves. > > > > I think you want the lock on the intermediate layer: basically, on > > every vnode that has data associated with it that is unique to a > > layer. Let's not forget, also, that you can expose a layer into > > the namespace in one place, and expose it covered under another > > layer, at another. If you locked down to the backing object, then > > the only issue you would be left with is one or more intermediate > > backing objects. > > Right. That exported struct lock * makes locking down to the lowest-level > file easy - you just feed it to the lock manager, and you're locking the > same lock the lowest level fs uses. You then lock all vnodes stacked over > this one at the same time. Otherwise, you just call VOP_LOCK below and > then lock yourself. I think this defeats the purpose of the stacking architecture; I think that if you look at an unadulterated NULLFS, you'll see what I mean. Intermediate FS's should not trap VOP's that are not applicable to them. One of the purposes of doing a VOP_LOCK on intermediate vnodes that aren't backing objects is to deal with the global vnode pool management. I'd really like FS's to own their vnode pools, but even without that, you don't need the locking, since you only need to flush data on vnodes that are backing objects. If we look at a stack of FS's with intermediate exposure into the namespace, then it's clear that the issue is really only applicable to objects that act as a backing store: ---------------------- ---------------------- -------------------- FS Exposed in hierarchy Backing object ---------------------- ---------------------- -------------------- top yes no intermediate_1 no no intermediate_2 no yes intermediate_3 yes no bottom no yes ---------------------- ---------------------- -------------------- So when we lock "top", we only lock in intermediate_2 and in bottom. Then we attempt to lock in intermediate_3, but it fails: not because there is a lock on the vnode in intermediate_3, but because there is a lock in bottom. It's unnecessary to lock the vnodes in the intermediate path, or even at the exposure level, unless they are vnodes that have an associated backing store. The need to lock in intermediate_2 exists because it is a translation layer or a namespace escape. It deals with compression, or it deals with file-as-a-directory folding, or it deals with file-hiding (perhaps for a quoata file), etc.. If it didn't, it wouldn't need backing store (and therefore wouldn't need to be locked). > > For a layer with an intermediate backing object, I'm prepared to > > declare it "special", and proxy the operation down to any inferior > > backing object (e.g. a union FS that adds files from two FS's > > together, rather than just directoriy entry lists). I think such > > layers are the exception, not the rule. > > Actually isn't the only problem when you have vnode fan-in (union FS)? > i.e. a plain compressing layer should not introduce vnode locking > problems. If it's a block compression layer, it will. Also a translation layer; consider a pure Unicode system that wants to remotely mount an FS from a legacy system. To do this, it needs to expand the pages from the legacy system [only it can, since the legacy system doesn't know about Unicode] in a 2:1 ratio. Now consider doing a byte-range lock on a file on such a system. To propogate the lock, you have to do an arithmetic conversion at the translation layer. This gets worse if the lower end FS is exposed in the namespace as well. You could make the same arguments for other types of translation or namespace escapes. > > I think that export policies are the realm of /etc/exports. > > > > The problem with each FS implementing its own policy, is that this > > is another place that copyinstr() gets called, when it shouldn't. > > Well, my thought was that, like with current code, most every fs would > just call vfs_export() when it's presented an export operation. But by > retaining the option of having the fs do its own thing, we can support > different export semantics if desired. I think this bears down on whether the NFS server VFS consumer is allowed access to the VFS stack at the particular intermediate layer. I think this is really an administrative policy decision, and not an option for the VFS. I think it would be bad if a given VFS could refuse to participate in a stacking operation because it didn't like who was stacking. If we insist on the ability for a VFS to refused stacking, then we should generalize the idea, such that an intermediate VFS could refuse exposure into the filesystem namespace accessible to users. Consider the case of a VFS without quota support, stacked under a VFS layer that provided quota support by hiding a file in the top level directory ("quota") and then folding the directory closed by rerooting in a subdirectory of the top level directory ("root/"). It's reasonable to assume that most admins that want to enforce quotas would *not* want the possibility of exposing the VFS without quota support in the user accessible namespace. Should the VFS without quotas refuse such exposure? I think the answer is "no", and that it is an administrative control issue, not a VFS's preference issue. Administrators enforce this by protecting the path to exposure points, or by mounting stacks over top of exposure points, which results in the exposure being hidden under another mount. Using the QUOTAFS example, you mount the FS to be quota-enforced on /home, and then you mount the QUOTAFS over top of it, and have it cover "/home" itself, hiding the underlying FS from exposure. > > I would resolve this by passing a standard option to the mount code > > in user space. For root mounts, a vnode is passed down. For other > > mounts, the vnode is parsed and passed if the option is specified. > > Or maybe add a field to vfsops. This info says what the mount call will > expect (I want a block device, a regular file, a directory, etc), so it > fits. :-) This is actually an elegant soloution to the problem. Much of the time, we don't consider data interfaces when they are appropriate because of their widespread use in inappropriate ways (e.g. "ps"). > Also, if we leave it to userland, what happens if someone writes a > program which calls sys_mount with something the fs doesn't expect. :-) Well, that gets to another grail of mine: when a device containing a filesystem "arrives", I believe it should trigger a mount into the list of mounted filesystems. I don't necessarily mean that it should also be exported into the filesystem hierarchy at that point (but it's an option, using the "last mounted on" information). > > I think that you will only be able to find rare examples of FS's > > that don't take device names as arguments. But for those, you > > don't specify the option, and it gets "NULL", and whatever local > > options you specify. > > I agree I can't see a leaf fs not taking a device node. But layered > fs's certainly will want something else. :-) I think they want a vnode of an already mounted FS. The trick is to enforce the "already mounted" part of that. I'm comforable with doing this by saying "it's not already mounted until you can look up a vnode on it". > > The point is that, for FS's that can be both root and sub-root, > > the mount code doesn't have to make the decision, it can be punted > > to higher level code, in one place, where the code can be centrally > > maintained and kept from getting "stale" when things change out > > from under it. > > True. > > And with good comments we can catch the times when the centrally located > code changes & brakes an assumption made by the fs. :-) 8-). > > > Except for a minor buglet with device nodes, stacking works in NetBSD at > > > present. :-) > > > > Have you tried Heidemann's student's stacking layers? There is one > > encryption, and one per-file compression with namespace hiding, that > > I think it would be hard pressed to keep up with. But I'll give it > > the benefit of the doubt. 8-). > > Nope. The problem is that while stacking (null, umap, and overlay fs's) > work, we don't have the coherency issues worked out so that upper layers > can cache data. i.e. so that the lower fs knows it has to ask the uper > layers to give pages back. :-) But multiple ls -lR's work fine. :-) With UVM in NetBSD, this is (supposedly) not an issue. You could actually think of it this way, as well: only FS's that contain vnodes that provide backing should implement VOP_GETPAGES and VOP_PUTPAGES, and all I/O should be done through paging. > > > I agree it's ugly, but it has the advantage that it doesn't grow the > > > on-disk inode. A lot of flks have designs on the remaining 64 bits free. > > > :-) > > > > Well, so long as we can resolve the issue for a long, long time; > > I plan on being around to have to put up with the bugs, if I can > > wrangle it... 8-). > > :-) > > I bet by then (559447 AD) we won't be using ffs, so the problem will be > moot. :-) Unless I'm the curator of a computer museum... 8-). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 11:23:21 1999 Delivered-To: freebsd-fs@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [209.157.86.2]) by hub.freebsd.org (Postfix) with ESMTP id 7878C1597A; Wed, 18 Aug 1999 11:23:10 -0700 (PDT) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id LAA48344; Wed, 18 Aug 1999 11:22:20 -0700 (PDT) (envelope-from dillon) Date: Wed, 18 Aug 1999 11:22:20 -0700 (PDT) From: Matthew Dillon Message-Id: <199908181822.LAA48344@apollo.backplane.com> To: Julian Elischer Cc: Poul-Henning Kamp , Bill Studenmund , Terry Lambert , Alton Matthew , Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite References: Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org :On Wed, 18 Aug 1999, Poul-Henning Kamp wrote: : :> Matt doesn't represent the FreeBSD project, and even if he rewrites :> the VFS subsystem so he can understand it, his rewrite would face :> considerable resistance on its way into FreeBSD. I don't think :> there is reason to rewrite it, but there certainly are areas :> that need fixing. : :You are misinformed as far as I know.. From discussions I saw, th :main architect of a VFS rewrite would be Kirk, and Matt would be acting as :Kirk's right-hand-man. Yes, this is correct. Kirk is going to be the main architect. I have been heavily involved and will continue to be. :> >> The use of the "vfs_default" to make unimplemented VOP's : :> I beg to differ. The only difference is that we pass through :> multiple layers before we hit the bottom of the stack. There is :... :Well I believe that Kirk considers them misguided too, but he stated that :he wasn't going to remove them without serious thought about the alternatives. The vfs op callout layering has not been on the radar screen. There are much too many other more serious problems. I really doubt that any changes will be made to this piece any time in the next year or even two, if at all. The main items on the radar screen are related to buffer management (struct buf stuff. For example, preventing VM blockages due to pages being wired by write I/O's), VFS locking and reference count issues (for example, namei lookups, blockages in the pager and syncer due to vnode locks held by blocked processes, etc...), and interactions between VFS and VM (for example: moving away from VOP_READ/VOP_WRITE and moving more towards a getpages/putpages model). None of the items have been set in stone yet. We're waiting for Kirk to get back from vacation and get back into the groove. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 11:48: 4 1999 Delivered-To: freebsd-fs@freebsd.org Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132]) by hub.freebsd.org (Postfix) with ESMTP id 9761B14F8A; Wed, 18 Aug 1999 11:47:54 -0700 (PDT) (envelope-from tlambert@usr02.primenet.com) Received: (from daemon@localhost) by smtp02.primenet.com (8.8.8/8.8.8) id LAA08775; Wed, 18 Aug 1999 11:48:06 -0700 (MST) Received: from usr02.primenet.com(206.165.6.202) via SMTP by smtp02.primenet.com, id smtpd008709; Wed Aug 18 11:48:03 1999 Received: (from tlambert@localhost) by usr02.primenet.com (8.8.5/8.8.5) id LAA14960; Wed, 18 Aug 1999 11:48:01 -0700 (MST) From: Terry Lambert Message-Id: <199908181848.LAA14960@usr02.primenet.com> Subject: Re: BSD XFS Port & BSD VFS Rewrite To: phk@critter.freebsd.dk (Poul-Henning Kamp) Date: Wed, 18 Aug 1999 18:48:01 +0000 (GMT) Cc: tlambert@primenet.com, michaelh@cet.co.jp, wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG In-Reply-To: <1169.934997044@critter.freebsd.dk> from "Poul-Henning Kamp" at Aug 18, 99 07:24:04 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > >> > > I'm not familiar with the VFS_default stuff. All the vop_default_desc > >> > > routines in NetBSD point to error routines. > >> > > >> > In FreeBSD, they now point to default routines that are *not* error > >> > routines. This is the problem. I admit the change was very well > >> > intentioned, since it made the code a hell of a lot more readable, > >> > but choosing between readable and additional function, I take function > >> > over form (I think the way I would have "fixed" the readability is by > >> > making the operations that result in the descriptor set for a mounted > >> > FS instance be both discrete, and named for their specific function). > >> > >> As I recall most of FBSD's default routines are also error routines, if > >> the exceptions were a problem it would would be trivial to fix. > > > >You would have to de-collapse several VOP lists that have been > >pre-collapsed. > > You are talking gibberish here. Please show code where this is > a problem. When you write a proxy stacking layer, such as John Heidemann's network proxy stacking layer (an NFS alternative), VOP's which would normally be handled by vfs_default have to be handled on the other end of the proxy, instead, in the same way that they would be handled by the vfs_default stuff. Some VOP's, like advisory locking, need both local assertion and remote proxy of the VOP to avoid introducing race windows. The result of this is that, if you rely on the vfs_default stuff, then you can't proxy those VOP's into a different address space, either on another machine, or to a user space VFS stacking layer developement environment. This is the same problem that embedding VM references directly into any FS causes, and that vm_object_t aliases would exacerbate. John has, in the past, sent me a number of stacking layers done by various people, with the requirement that I not redistribute them, as they are not what he would consider to be properly representative of finished work. Since John himself did the network proxy, you could perhaps get him to send you a copy, so you could have direct access to code where this was a problem. Make sure that the system you are talking to over the proxy is not assumed to be a FreeBSD system (e.g. don't assume that the vfs_default stuff exists on the other end of the proxy, or that it would be functional). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 11:57:27 1999 Delivered-To: freebsd-fs@freebsd.org Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131]) by hub.freebsd.org (Postfix) with ESMTP id DC46A158DC; Wed, 18 Aug 1999 11:57:22 -0700 (PDT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id UAA01776; Wed, 18 Aug 1999 20:56:58 +0200 (CEST) (envelope-from phk@critter.freebsd.dk) To: Terry Lambert Cc: michaelh@cet.co.jp, wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-reply-to: Your message of "Wed, 18 Aug 1999 18:48:01 -0000." <199908181848.LAA14960@usr02.primenet.com> Date: Wed, 18 Aug 1999 20:56:58 +0200 Message-ID: <1774.935002618@critter.freebsd.dk> From: Poul-Henning Kamp Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message <199908181848.LAA14960@usr02.primenet.com>, Terry Lambert writes: >> >You would have to de-collapse several VOP lists that have been >> >pre-collapsed. >> >> You are talking gibberish here. Please show code where this is >> a problem. > >When you write a proxy stacking layer, such as John Heidemann's >network proxy stacking layer (an NFS alternative), VOP's which >would normally be handled by vfs_default have to be handled on >the other end of the proxy, instead, in the same way that they >would be handled by the vfs_default stuff. And what prevents you from taking over the default op ? -- Poul-Henning Kamp FreeBSD coreteam member phk@FreeBSD.ORG "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 11:59:16 1999 Delivered-To: freebsd-fs@freebsd.org Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17]) by hub.freebsd.org (Postfix) with ESMTP id 18C031513F; Wed, 18 Aug 1999 11:59:06 -0700 (PDT) (envelope-from wrstuden@marcy.nas.nasa.gov) Received: from localhost (wrstuden@localhost) by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id LAA23974; Wed, 18 Aug 1999 11:59:01 -0700 (PDT) Date: Wed, 18 Aug 1999 11:59:01 -0700 (PDT) From: Bill Studenmund Reply-To: Bill Studenmund To: Terry Lambert Cc: Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-Reply-To: <199908181819.LAA14096@usr02.primenet.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wed, 18 Aug 1999, Terry Lambert wrote: > > Right. That exported struct lock * makes locking down to the lowest-level > > file easy - you just feed it to the lock manager, and you're locking the > > same lock the lowest level fs uses. You then lock all vnodes stacked over > > this one at the same time. Otherwise, you just call VOP_LOCK below and > > then lock yourself. > > I think this defeats the purpose of the stacking architecture; I > think that if you look at an unadulterated NULLFS, you'll see what I > mean. Please be more precise. I have looked at an unadulterated NULLFS, and found it lacking. I don't see how this change breaks stacking. > Intermediate FS's should not trap VOP's that are not applicable > to them. True. But VOP_LOCK is applicable to layered fs's. :-) > One of the purposes of doing a VOP_LOCK on intermediate vnodes > that aren't backing objects is to deal with the global vnode > pool management. I'd really like FS's to own their vnode pools, > but even without that, you don't need the locking, since you > only need to flush data on vnodes that are backing objects. > > If we look at a stack of FS's with intermediate exposure into the > namespace, then it's clear that the issue is really only applicable > to objects that act as a backing store: > > > ---------------------- ---------------------- -------------------- > FS Exposed in hierarchy Backing object > ---------------------- ---------------------- -------------------- > top yes no > intermediate_1 no no > intermediate_2 no yes > intermediate_3 yes no > bottom no yes > ---------------------- ---------------------- -------------------- > > So when we lock "top", we only lock in intermediate_2 and in bottom. No. One of the things Heidemann notes in his dissertation is that to prevent deadlock, you have to lock the whole stack of vnodes at once, not bit by bit. i.e. there is one lock for the whole thing. > > Actually isn't the only problem when you have vnode fan-in (union FS)? > > i.e. a plain compressing layer should not introduce vnode locking > > problems. > > If it's a block compression layer, it will. Also a translation layer; > consider a pure Unicode system that wants to remotely mount an FS > from a legacy system. To do this, it needs to expand the pages from > the legacy system [only it can, since the legacy system doesn't know > about Unicode] in a 2:1 ratio. Now consider doing a byte-range lock > on a file on such a system. To propogate the lock, you have to do > an arithmetic conversion at the translation layer. This gets worse > if the lower end FS is exposed in the namespace as well. Wait. byte-range locking is different from vnode locking. I've been talking about vnode locking, which is different from the byte-range locking you're discussing above. > > Nope. The problem is that while stacking (null, umap, and overlay fs's) > > work, we don't have the coherency issues worked out so that upper layers > > can cache data. i.e. so that the lower fs knows it has to ask the uper > > layers to give pages back. :-) But multiple ls -lR's work fine. :-) > > With UVM in NetBSD, this is (supposedly) not an issue. UBC. UVM is a new memory manager. UBC unifies the buffer cache with the VM system. > You could actually think of it this way, as well: only FS's that > contain vnodes that provide backing should implement VOP_GETPAGES > and VOP_PUTPAGES, and all I/O should be done through paging. Right. That's part of UBC. :-) Take care, Bill To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 12: 8:34 1999 Delivered-To: freebsd-fs@freebsd.org Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17]) by hub.freebsd.org (Postfix) with ESMTP id CABFA15870; Wed, 18 Aug 1999 12:08:30 -0700 (PDT) (envelope-from wrstuden@marcy.nas.nasa.gov) Received: from localhost (wrstuden@localhost) by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id MAA25525; Wed, 18 Aug 1999 12:08:22 -0700 (PDT) Date: Wed, 18 Aug 1999 12:08:22 -0700 (PDT) From: Bill Studenmund Reply-To: Bill Studenmund To: Poul-Henning Kamp Cc: Terry Lambert , Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-Reply-To: <1362.934999489@critter.freebsd.dk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wed, 18 Aug 1999, Poul-Henning Kamp wrote: > Yes, but we need subsecond in the filesystems. Think about make(1) on > a blinding fast machine... Oh yes, I realize that. :-) It's just that I thought you were at one point suggesting having 128 bits to the left of the decimal point (128 bits worth of seconds). I was trying to say that'd be a bit much. :-) Take care, Bill To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 13:44:48 1999 Delivered-To: freebsd-fs@freebsd.org Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135]) by hub.freebsd.org (Postfix) with ESMTP id EA4F215D9F; Wed, 18 Aug 1999 13:43:47 -0700 (PDT) (envelope-from tlambert@usr06.primenet.com) Received: (from daemon@localhost) by smtp05.primenet.com (8.9.1/8.9.1) id NAA113206; Wed, 18 Aug 1999 13:43:27 -0700 Received: from usr06.primenet.com(206.165.6.206) via SMTP by smtp05.primenet.com, id smtpdDReHUa; Wed Aug 18 13:43:17 1999 Received: (from tlambert@localhost) by usr06.primenet.com (8.8.5/8.8.5) id NAA28863; Wed, 18 Aug 1999 13:43:14 -0700 (MST) From: Terry Lambert Message-Id: <199908182043.NAA28863@usr06.primenet.com> Subject: Re: BSD XFS Port & BSD VFS Rewrite To: wrstuden@nas.nasa.gov Date: Wed, 18 Aug 1999 20:43:14 +0000 (GMT) Cc: tlambert@primenet.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG In-Reply-To: from "Bill Studenmund" at Aug 18, 99 11:59:01 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > > Right. That exported struct lock * makes locking down to the lowest-level > > > file easy - you just feed it to the lock manager, and you're locking the > > > same lock the lowest level fs uses. You then lock all vnodes stacked over > > > this one at the same time. Otherwise, you just call VOP_LOCK below and > > > then lock yourself. > > > > I think this defeats the purpose of the stacking architecture; I > > think that if you look at an unadulterated NULLFS, you'll see what I > > mean. > > Please be more precise. I have looked at an unadulterated NULLFS, and > found it lacking. I don't see how this change breaks stacking. OK, there's the concept of "collapse" of stacking layer. This was first introduced in the Rosenthal stacking vnode architecture, out of Sun Microsystems. Rosenthal was concerned that, when you stack 500 putatively "null" NULLFS's, that the amount of function call overhead not increase proportionally. To resolve this, he introduced the concept of a "collapsed" VFS stack. That is, the actual array of function vectors is actually a one dimensional projection of a two dimensional stack, and that the visible portion is actually where the first layer on the way down the stack that implements a VOP occurs. We can visualize this like so: VOPs Layer | VOP1 VOP2 VOP3 VOP4 VOP5 VOP6 ... ----------------------------------------------------------- L1 - - - imp - - ... L2 imp - - imp - imp ... L3 imp - - imp imp - ... L4 - - imp - - - ... L5 imp imp imp imp imp imp ... The resulting "collapsed" array of entry vectors looks like so: L2VOP1 L5VOP2 L4VOP3 L1VOP4 L3VOP5 L2VOP6 ... There is an implicit assumption here that most stacks will not be randomly staggered like this example. The idea behind this assumption is that additional layers will most frequently add functionality, rather than replacing it. Heidemann carried this idea over into his architecture, to be employed at the point that a VFS stack is first instanced. The BSD4.4 implementation of this is partially flawed. There is an implicit implementation of this for the UFS/FFS "stack" of layers, in the VOP's descriptor array exported by the combination of the two being hard coded as being a precollapsed stack. This is actually antithetical to the design. The second place this flaw is apparent is in the inability to add VOP's into an existing kernel, since the entry point vector is a fixed size, and is not expanded implicitly by the act of adding a VFS layer containing a new VOP. For the use of non-error vfs_defaults, this is also flawed for proxies, but not for the consumer of the VFS stack, only for the producer end on the other side of the proxy, which although it does not implement a particular VOP, needs to _NOT_ use the local vfs_default for the VOP, but instead needs to proxy the VOP over to the other side for remote processing. The act of getting a vfs_default VOP after a collapse, instead of having a NULL entry point that the descriptor call mechanism treats as a call failure, damages the ability to proxy unknown VOP's. > > Intermediate FS's should not trap VOP's that are not applicable > > to them. > > True. But VOP_LOCK is applicable to layered fs's. :-) Only for translation layers that require local backing store. I'm prepared to make an exception for them, and require that they explicitly call the VOP in the underlying vnode over which they are stacked. This is the same compromise that both Rosenthal and Heidemann consciously chose. > > One of the purposes of doing a VOP_LOCK on intermediate vnodes > > that aren't backing objects is to deal with the global vnode > > pool management. I'd really like FS's to own their vnode pools, > > but even without that, you don't need the locking, since you > > only need to flush data on vnodes that are backing objects. > > > > If we look at a stack of FS's with intermediate exposure into the > > namespace, then it's clear that the issue is really only applicable > > to objects that act as a backing store: > > > > > > ---------------------- ---------------------- -------------------- > > FS Exposed in hierarchy Backing object > > ---------------------- ---------------------- -------------------- > > top yes no > > intermediate_1 no no > > intermediate_2 no yes > > intermediate_3 yes no > > bottom no yes > > ---------------------- ---------------------- -------------------- > > > > So when we lock "top", we only lock in intermediate_2 and in bottom. > > No. One of the things Heidemann notes in his dissertation is that to > prevent deadlock, you have to lock the whole stack of vnodes at once, not > bit by bit. > > i.e. there is one lock for the whole thing. This is not true for a unified VM and buffer cache environment, and a significant reduction in overhead can be achieved thereby. Heidemann did his work on SVR4, which does not have a unified VM and buffer cache. The deadlock discussion in his dissertation is only applicable to systems where the coherency model is such that each and every vnode has buffers associated with it. That is, it applies to vnodes which act as backing store (buffer cache object references). If you seperate the concept, such that you don't have to deal with vnodes that do not have coherency issues, then you can drastically reduce the number of coherency operations required (locking is a coherency operation). In addition to this, you can effectively obtain what neither the Rosenthal or the SVR4 version of the Heidemann stacking framework can otherwise obtain: intermediate VFS layer NULL VOP call collapse. The way you obtain this is by caching the vnode of the backing object in the intermediate layer, and dereferencing it to get at it's VOP vector directly. This means that a functional layer that shodows an underlying VOP, seperated by 1,000 NULLFS layers, does not result in a 1,000 function call overhead. > > > Actually isn't the only problem when you have vnode fan-in (union FS)? > > > i.e. a plain compressing layer should not introduce vnode locking > > > problems. > > > > If it's a block compression layer, it will. Also a translation layer; > > consider a pure Unicode system that wants to remotely mount an FS > > from a legacy system. To do this, it needs to expand the pages from > > the legacy system [only it can, since the legacy system doesn't know > > about Unicode] in a 2:1 ratio. Now consider doing a byte-range lock > > on a file on such a system. To propogate the lock, you have to do > > an arithmetic conversion at the translation layer. This gets worse > > if the lower end FS is exposed in the namespace as well. > > Wait. byte-range locking is different from vnode locking. I've been > talking about vnode locking, which is different from the byte-range > locking you're discussing above. Conceptually, they're not really different at all. You want to apply an operation against a stack of vnodes, and only involve the relevent vnodes when you do it. > > > Nope. The problem is that while stacking (null, umap, and overlay fs's) > > > work, we don't have the coherency issues worked out so that upper layers > > > can cache data. i.e. so that the lower fs knows it has to ask the uper > > > layers to give pages back. :-) But multiple ls -lR's work fine. :-) > > > > With UVM in NetBSD, this is (supposedly) not an issue. > > UBC. UVM is a new memory manager. UBC unifies the buffer cache with the VM > system. I was under the impression that th "U" in "UVM" was for "Unified". Does NetBSD not have a unified VM and buffer cache? is th "U" in "UVM" referring not to buffer cache unification, but to platform unification? It was my understanding from John Dyson, who had to work on NetBSD for NCI, that the new NetBSD stuff actually unified the VM and the buffer cache. If this isn't the case, then, yes, you will need to lock all the way up and down, and eat the copy overhead for the concurrency for the intermediate vnodes. 8-(. > > You could actually think of it this way, as well: only FS's that > > contain vnodes that provide backing should implement VOP_GETPAGES > > and VOP_PUTPAGES, and all I/O should be done through paging. > > Right. That's part of UBC. :-) Yep. Again, if NetBSD doesn't have this, it's really important that it obtain it. 8-(. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 14: 2:47 1999 Delivered-To: freebsd-fs@freebsd.org Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135]) by hub.freebsd.org (Postfix) with ESMTP id 472E2151E1; Wed, 18 Aug 1999 14:02:05 -0700 (PDT) (envelope-from tlambert@usr06.primenet.com) Received: (from daemon@localhost) by smtp05.primenet.com (8.9.1/8.9.1) id OAA435022; Wed, 18 Aug 1999 14:02:07 -0700 Received: from usr06.primenet.com(206.165.6.206) via SMTP by smtp05.primenet.com, id smtpdaToVMa; Wed Aug 18 14:01:59 1999 Received: (from tlambert@localhost) by usr06.primenet.com (8.8.5/8.8.5) id OAA29646; Wed, 18 Aug 1999 14:01:54 -0700 (MST) From: Terry Lambert Message-Id: <199908182101.OAA29646@usr06.primenet.com> Subject: Re: BSD XFS Port & BSD VFS Rewrite To: phk@critter.freebsd.dk (Poul-Henning Kamp) Date: Wed, 18 Aug 1999 21:01:53 +0000 (GMT) Cc: tlambert@primenet.com, michaelh@cet.co.jp, wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG In-Reply-To: <1774.935002618@critter.freebsd.dk> from "Poul-Henning Kamp" at Aug 18, 99 08:56:58 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > >> >You would have to de-collapse several VOP lists that have been > >> >pre-collapsed. > >> > >> You are talking gibberish here. Please show code where this is > >> a problem. > > > >When you write a proxy stacking layer, such as John Heidemann's > >network proxy stacking layer (an NFS alternative), VOP's which > >would normally be handled by vfs_default have to be handled on > >the other end of the proxy, instead, in the same way that they > >would be handled by the vfs_default stuff. > > And what prevents you from taking over the default op ? It needs to be NULL, not taken over. machine 1 machine2 machine 3 vfs consumer upper proxy <---------> lower proxy vfs stacking layer upper proxy <---------> lower proxy vfs producer How do I get a VOP, unknown to machine 2, from the vfs consumer on machine 1 that does know about it, to the vfs producer on machine 3 that also knows about it? My understanding is that it is very hard, given vfs_default: On machine 1, since the upper proxy doesn't know from VOP's, it wants to locally satisfy it from vfs_default on machine 1. Taking over the default op doesn't really help me; I have to do surgery to the in core dispatch vector instance to do the job properly (e.g. zapping it out, not taking it over). On machine 2, it is out of range, but still needs to be passed through the stacking layer, from the lower porxy to the upper proxy (and the response, back). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 14:17:57 1999 Delivered-To: freebsd-fs@freebsd.org Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131]) by hub.freebsd.org (Postfix) with ESMTP id 9C92214ED3; Wed, 18 Aug 1999 14:17:30 -0700 (PDT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id XAA02921; Wed, 18 Aug 1999 23:15:49 +0200 (CEST) (envelope-from phk@critter.freebsd.dk) To: Terry Lambert Cc: michaelh@cet.co.jp, wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite In-reply-to: Your message of "Wed, 18 Aug 1999 21:01:53 -0000." <199908182101.OAA29646@usr06.primenet.com> Date: Wed, 18 Aug 1999 23:15:49 +0200 Message-ID: <2919.935010949@critter.freebsd.dk> From: Poul-Henning Kamp Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Terry, It is very fine with this example, but I'm not even going to bother much with it for several reasons, most of which you can find codified in the development rules for X11 which you can find in Scheiflers book. But for the record: your example would get even shorter on the code we had before I started using the default op sensibly because all the layers tended to shunt things they didn't understand to errno rather than pass them through, so in fact my change took us closer to being able to handle the rather lofty example you have here. Once you show me an actual implementation which has a problem with it, I will look at it again, until then, I think pretty much everything else is more important (Scheiflers 1st rule :-) Poul-Henning >> And what prevents you from taking over the default op ? > >It needs to be NULL, not taken over. > > >machine 1 machine2 machine 3 > >vfs consumer >upper proxy <---------> lower proxy > vfs stacking layer > upper proxy <---------> lower proxy > vfs producer > >How do I get a VOP, unknown to machine 2, from the vfs consumer >on machine 1 that does know about it, to the vfs producer on >machine 3 that also knows about it? -- Poul-Henning Kamp FreeBSD coreteam member phk@FreeBSD.ORG "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 17:18:55 1999 Delivered-To: freebsd-fs@freebsd.org Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133]) by hub.freebsd.org (Postfix) with ESMTP id 6F08615987; Wed, 18 Aug 1999 17:18:51 -0700 (PDT) (envelope-from tlambert@usr06.primenet.com) Received: (from daemon@localhost) by smtp03.primenet.com (8.9.3/8.9.3) id RAA11206; Wed, 18 Aug 1999 17:18:41 -0700 (MST) Received: from usr06.primenet.com(206.165.6.206) via SMTP by smtp03.primenet.com, id smtpdAAA4ka42v; Wed Aug 18 17:18:36 1999 Received: (from tlambert@localhost) by usr06.primenet.com (8.8.5/8.8.5) id RAA09816; Wed, 18 Aug 1999 17:18:41 -0700 (MST) From: Terry Lambert Message-Id: <199908190018.RAA09816@usr06.primenet.com> Subject: Re: BSD XFS Port & BSD VFS Rewrite To: phk@critter.freebsd.dk (Poul-Henning Kamp) Date: Thu, 19 Aug 1999 00:18:41 +0000 (GMT) Cc: tlambert@primenet.com, michaelh@cet.co.jp, wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG In-Reply-To: <2919.935010949@critter.freebsd.dk> from "Poul-Henning Kamp" at Aug 18, 99 11:15:49 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > Terry, > > It is very fine with this example, but I'm not even going to bother > much with it for several reasons, most of which you can find codified > in the development rules for X11 which you can find in Scheiflers > book. > > But for the record: your example would get even shorter on > the code we had before I started using the default op sensibly > because all the layers tended to shunt things they didn't > understand to errno rather than pass them through, so in > fact my change took us closer to being able to handle the > rather lofty example you have here. > > Once you show me an actual implementation which has a problem > with it, I will look at it again, until then, I think pretty > much everything else is more important (Scheiflers 1st rule :-) > > Poul-Henning That's a fair requirement. I have some of Heidemann's code that runs into the problem, but I don't have any that I can redistribute. Would it be OK if I asked John to send you his code as well, if you will abide with the non-redistribution requirement? I understand the prioritization process, and FWIW, I agree with it, in a resource-starved situation (e.g.g FreeBSD). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Aug 18 22:41:27 1999 Delivered-To: freebsd-fs@freebsd.org Received: from peach.ocn.ne.jp (peach.ocn.ne.jp [210.145.254.87]) by hub.freebsd.org (Postfix) with ESMTP id 7C26E14F15; Wed, 18 Aug 1999 22:41:17 -0700 (PDT) (envelope-from dcs@newsguy.com) Received: from newsguy.com by peach.ocn.ne.jp (8.9.1a/OCN) id OAA11981; Thu, 19 Aug 1999 14:39:39 +0900 (JST) Message-ID: <37BB88F3.7184305@newsguy.com> Date: Thu, 19 Aug 1999 13:32:51 +0900 From: "Daniel C. Sobral" X-Mailer: Mozilla 4.6 [en] (Win98; I) X-Accept-Language: en,pt-BR,ja MIME-Version: 1.0 To: Terry Lambert Cc: Poul-Henning Kamp , michaelh@cet.co.jp, wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite References: <199908181848.LAA14960@usr02.primenet.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Terry Lambert wrote: > > Make sure that the system you are talking to over the proxy is > not assumed to be a FreeBSD system (e.g. don't assume that the > vfs_default stuff exists on the other end of the proxy, or that > it would be functional). Now, Terry, that is ridiculous. One has to assume that both ends play by the same rules. That is not only a reasonably expectation, it's minimum requirement for any protocol to work. -- Daniel C. Sobral (8-DCS) dcs@newsguy.com dcs@freebsd.org - Can I speak to your superior? - There's some religious debate on that question. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 19 7:21:25 1999 Delivered-To: freebsd-fs@freebsd.org Received: from lupo.thebarn.com (x101-182-203.unreg.umn.edu [128.101.182.203]) by hub.freebsd.org (Postfix) with ESMTP id 6599C150F5; Thu, 19 Aug 1999 07:21:15 -0700 (PDT) (envelope-from cattelan@thebarn.com) Received: from thebarn.com ([128.101.182.201]) by lupo.thebarn.com (8.9.3/8.9.1) with ESMTP id BAA86916; Thu, 19 Aug 1999 01:01:15 -0500 (CDT) Message-ID: <37BB9DAB.E7F0FED0@thebarn.com> Date: Thu, 19 Aug 1999 01:01:15 -0500 From: Russell Cattelan X-Mailer: Mozilla 4.61 [en] (X11; I; FreeBSD 4.0-CURRENT i386) X-Accept-Language: en MIME-Version: 1.0 To: "Alton, Matthew" Cc: "'Hackers@FreeBSD.ORG'" , "'fs@FreeBSD.ORG'" Subject: Re: BSD-XFS Update References: <0740CBD1D149D31193EB0008C7C56836EB8B05@STLABCEXG012> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org "Alton, Matthew" wrote: > SGI has released a portion of the XFS source code under the GPL: > > http://oss.sgi.com/projects/xfs/download/ > > the source file is xfs_log.tar.gz. > > Of greater interest at this stage are the documents in: > > http://oss.sgi.com/projects/xfs/design_docs/ > > I am currently researching methods for implementing the 64-bit > syscalls stat64(), fstat64(), lseek64() &etc. delineated in the > SGI design doc _64 Bit File Access_ by Adam Sweeney. The xxxx64 calls are no longer an issue as of IRIX 6.(something 2 I think) all the standard calls were converted to use 64 bit types directly. Have a better one for you to research. Find out if buffers can be pined? if not what is it going to take to fix that. > > The BSD-XFS port will be made available as a patch to the RELEASE > FreeBSD kernels. Given the size of XFS it might be easier to make FreeBSD a patch to XFS. <- major humor here. :-) :-) > > > Matthew Alton > Computer Services - UNIX Systems Administration > (314)632-6644 matthew.alton@anheuser-busch.com > alton@plantnet.com > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-hackers" in the body of the message -- Russell Cattelan cattelan@thebarn.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 19 7:21:37 1999 Delivered-To: freebsd-fs@freebsd.org Received: from lupo.thebarn.com (x101-182-203.unreg.umn.edu [128.101.182.203]) by hub.freebsd.org (Postfix) with ESMTP id AA72D1511C; Thu, 19 Aug 1999 07:21:15 -0700 (PDT) (envelope-from cattelan@thebarn.com) Received: from thebarn.com ([128.101.182.201]) by lupo.thebarn.com (8.9.3/8.9.1) with ESMTP id AAA86822; Thu, 19 Aug 1999 00:41:29 -0500 (CDT) Message-ID: <37BB9909.53D356FE@thebarn.com> Date: Thu, 19 Aug 1999 00:41:29 -0500 From: Russell Cattelan X-Mailer: Mozilla 4.61 [en] (X11; I; FreeBSD 4.0-CURRENT i386) X-Accept-Language: en MIME-Version: 1.0 To: "Alton, Matthew" Cc: "'Hackers@FreeBSD.ORG'" , "'fs@FreeBSD.ORG'" Subject: Re: BSD XFS Port & BSD VFS Rewrite References: <0740CBD1D149D31193EB0008C7C56836EB8AFC@STLABCEXG012> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Glad to hear somebody is willing to dive in to XFS. Right now I am one of three people working on the XFS to linux port, so I have pretty good view of what is currently happening. When is it going to be ready? Don't hold your breath. Officially SGI has said by the end of the year, technically... whew frankly I can't even guess. I would hope within a month or so we will have the basics of a FS. There are a lot of hurtles to overcome. XFS is a very very complex file system that relies on some of the more advanced features of IRIX. The buffer cache and chunk cache (chunking buffers together to do large IO) are two examples that come to mind. SGI is rewriting the buffer cache (calling it the page cache) such that is will be able to support XFS. chunk cache... ? not sure what we are going to do with that. We have been having several discussions about the best way to "interface". IRIX uses VFS,VNODE,BEHAVIOR which is similar to the BSD's interface but of course very IRIX specific. Linux's vfs/vnode is different from either. Realizing this, a lot of our discussions have been around how to go at making a new/modify existing interface layer that might be more "universal" i.e. not irix not linux not bsd not etc.... specific. In reading Terry's & Bill's comments seems there is a a lot of room for improvement. Initially we trying to make as few changes as possible to XFS to get an initial implementation running on linux. After we get things running we will start to analyze where the problems exist, and decide what direction in terms of interface to take at that time. I would like any constructive input people have on this matter. I have a pretty good chance of setting design direction. Be waned: SGI at the moment is committed to linux, development directions will favor that platform. They are not against other OS's being XFS'atized but SGI is in the business of selling hardware/solutions based on that hardware and linux one of the OS they have decided to use for their intel based boxes. Also as far as the GPL issue goes, get over it! I understand the issues and agree with many of the points. My suggestion lets find a way to work with the GPL (i.e. loadable kernel module / softupdates model) If somebody has a very very good argument/solution to the licensing debate let me know, I can present it to the people dealing with the lawyers. The license issue has slowed the release of the actual code more than anything else, and will not be revisited again without great pain. > I am currently conducting a thorough study of the VFS subsystem > in preparation for an all-out effort to port SGI's XFS filesystem to > FreeBSD 4.x at such time as SGI gives up the code. Matt Dillon > has written in hackers- that the VFS subsystem is presently not > well understood by any of the active kernel code contributers and > that it will be rewritten later this year. This is obviously of great > concern to me in this port. I greatly appreciate all assistance in > answering the following questions: > > 1) What are the perceived problems with the current VFS? > 2) What options are available to us as remedies? > 3) To what extent will existing FS code require revision in order > to be useful after the rewrite? > 4) Will Chapters 6,7,8 & 9 of "The Design and Implementation of > the 4.4BSD Operating System" still pertain after the rewrite? > 5) How important are questions 3 & 4 in the design of the new > VFS? > > I believe that the VFS is conceptually sound and that the existing > semantics should be strictly retained in the new code. Any new > functionality should be added in the form of entirely new kernel > routines and system calls, or possibly by such means as > converting the existing routines to the vararg format &etc. > > Does anyone know when SGI will release XFS? > > -- Russell Cattelan cattelan@thebarn.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 19 8:47: 1 1999 Delivered-To: freebsd-fs@freebsd.org Received: from gatewaya.anheuser-busch.com (gatewaya.anheuser-busch.com [151.145.250.252]) by hub.freebsd.org (Postfix) with SMTP id 06922150B0; Thu, 19 Aug 1999 08:46:42 -0700 (PDT) (envelope-from Matthew.Alton@anheuser-busch.com) Received: by gatewaya.anheuser-busch.com; id KAA02280; Thu, 19 Aug 1999 10:47:29 -0500 Received: from stlexggtw002-pozzoli.fw-users.busch.com(151.145.101.130) by gatewaya.anheuser-busch.com via smap (V5.0) id xma002157; Thu, 19 Aug 99 10:46:42 -0500 Received: from stlabcexg006.anheuser-busch.com ([151.145.101.161]) by 151.145.101.130 (Norton AntiVirus for Internet Email Gateways 1.0) ; Thu, 19 Aug 1999 15:44:28 0000 (GMT) Received: by stlabcexg006.anheuser-busch.com with Internet Mail Service (5.5.2448.0) id ; Thu, 19 Aug 1999 10:44:06 -0500 Message-ID: <0740CBD1D149D31193EB0008C7C56836EB8B14@STLABCEXG012> From: "Alton, Matthew" To: "'Russell Cattelan'" Cc: "'Hackers@FreeBSD.ORG'" , "'fs@FreeBSD.ORG'" Subject: RE: BSD XFS Port & BSD VFS Rewrite Date: Thu, 19 Aug 1999 10:44:27 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2448.0) Content-Type: text/plain; charset="iso-8859-1" Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Do you have access to more of the code than is currently posted on SGI's web page? I am willing to sign an NDA in order to get access to all relevant source. I would like to assist in porting XFS to Linux also. I would very much like to see SGI succeed by using open source software in the commercial realm. As for licensing issues, I am purely agnostic -- I trust that any legal issues can be worked out after the fact by the proper people. > -----Original Message----- > From: Russell Cattelan [SMTP:cattelan@thebarn.com] > Sent: Thursday, August 19, 1999 12:41 AM > To: Alton, Matthew > Cc: 'Hackers@FreeBSD.ORG'; 'fs@FreeBSD.ORG' > Subject: Re: BSD XFS Port & BSD VFS Rewrite > > Glad to hear somebody is willing to dive in to XFS. > > > Right now I am one of three people working on the XFS to linux port, so I > have > pretty good view of what is currently happening. > > When is it going to be ready? > Don't hold your breath. Officially SGI has said by the end of the year, > technically... whew > frankly I can't even guess. I would hope within a month or so we will > have the basics of a FS. > > There are a lot of hurtles to overcome. XFS is a very very complex file > system that relies on > some of the more advanced features of IRIX. The buffer cache and chunk > cache (chunking > buffers together to do large IO) are two examples that come to mind. SGI > is rewriting > the buffer cache (calling it the page cache) such that is will be able to > support XFS. > chunk cache... ? not sure what we are going to do with that. > > We have been having several discussions about the best way to > "interface". > IRIX uses VFS,VNODE,BEHAVIOR which is similar to the BSD's interface > but of course very IRIX specific. Linux's vfs/vnode is different from > either. > Realizing this, a lot of our discussions have been around how to go at > making a > new/modify existing interface layer that might be more "universal" > i.e. not irix not linux not bsd not etc.... specific. > > In reading Terry's & Bill's comments seems there is a a lot of room for > improvement. > > Initially we trying to make as few changes as possible to XFS to get an > initial implementation > running on linux. After we get things running we will start to analyze > where the problems exist, > and decide what direction in terms of interface to take at that time. > > I would like any constructive input people have on this matter. I have a > pretty good > chance of setting design direction. > Be waned: SGI at the moment is committed to linux, development directions > will favor that platform. > They are not against other OS's being XFS'atized but SGI is in the > business of selling > hardware/solutions based on that hardware and linux one of the OS they > have decided to use for > their intel based boxes. > > Also as far as the GPL issue goes, get over it! I understand the issues > and agree with many > of the points. > My suggestion lets find a way to work with the GPL (i.e. loadable kernel > module / > softupdates model) > If somebody has a very very good argument/solution to the licensing > debate let me > know, I can present it to the people dealing with the lawyers. > The license issue has slowed the release of the actual code more than > anything else, > and will not be revisited again without great pain. > > > > I am currently conducting a thorough study of the VFS subsystem > > in preparation for an all-out effort to port SGI's XFS filesystem to > > FreeBSD 4.x at such time as SGI gives up the code. Matt Dillon > > has written in hackers- that the VFS subsystem is presently not > > well understood by any of the active kernel code contributers and > > that it will be rewritten later this year. This is obviously of great > > concern to me in this port. I greatly appreciate all assistance in > > answering the following questions: > > > > 1) What are the perceived problems with the current VFS? > > 2) What options are available to us as remedies? > > 3) To what extent will existing FS code require revision in order > > to be useful after the rewrite? > > 4) Will Chapters 6,7,8 & 9 of "The Design and Implementation of > > the 4.4BSD Operating System" still pertain after the rewrite? > > 5) How important are questions 3 & 4 in the design of the new > > VFS? > > > > I believe that the VFS is conceptually sound and that the existing > > semantics should be strictly retained in the new code. Any new > > functionality should be added in the form of entirely new kernel > > routines and system calls, or possibly by such means as > > converting the existing routines to the vararg format &etc. > > > > Does anyone know when SGI will release XFS? > > > > > > -- > Russell Cattelan > cattelan@thebarn.com > > > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-hackers" in the body of the message To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 19 8:56: 6 1999 Delivered-To: freebsd-fs@freebsd.org Received: from gatewaya.anheuser-busch.com (gatewaya.anheuser-busch.com [151.145.250.252]) by hub.freebsd.org (Postfix) with SMTP id 0B16914C37; Thu, 19 Aug 1999 08:56:00 -0700 (PDT) (envelope-from Matthew.Alton@anheuser-busch.com) Received: by gatewaya.anheuser-busch.com; id KAA04634; Thu, 19 Aug 1999 10:57:39 -0500 Received: from stlexggtw002-pozzoli.fw-users.busch.com(151.145.101.130) by gatewaya.anheuser-busch.com via smap (V5.0) id xma004561; Thu, 19 Aug 99 10:57:18 -0500 Received: from stlabcexg004.anheuser-busch.com ([151.145.101.160]) by 151.145.101.130 (Norton AntiVirus for Internet Email Gateways 1.0) ; Thu, 19 Aug 1999 15:55:03 0000 (GMT) Received: by stlabcexg004.anheuser-busch.com with Internet Mail Service (5.5.2448.0) id ; Thu, 19 Aug 1999 10:54:54 -0500 Message-ID: <0740CBD1D149D31193EB0008C7C56836EB8B15@STLABCEXG012> From: "Alton, Matthew" To: "'Russell Cattelan'" Cc: "'Hackers@FreeBSD.ORG'" , "'fs@FreeBSD.ORG'" Subject: RE: BSD-XFS Update Date: Thu, 19 Aug 1999 10:55:11 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2448.0) Content-Type: text/plain Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Pinned in the AIX-style "pinned memory" sense? Succinctly, AIX allows userland programs to tag memory pages so as to guarantee that they will not be swapped to backing store. Portions of the _KERNEL_ are paged out instead if necessary. I assume that the pinning is of the AIX sort and that it is desirable, if not necessary, for the realtime throughput guarantee policy. Nes pas? > -----Original Message----- > From: Russell Cattelan [SMTP:cattelan@thebarn.com] > Sent: Thursday, August 19, 1999 1:01 AM > To: Alton, Matthew > Cc: 'Hackers@FreeBSD.ORG'; 'fs@FreeBSD.ORG' > Subject: Re: BSD-XFS Update > > "Alton, Matthew" wrote: > > > SGI has released a portion of the XFS source code under the GPL: > > > > http://oss.sgi.com/projects/xfs/download/ > > > > the source file is xfs_log.tar.gz. > > > > Of greater interest at this stage are the documents in: > > > > http://oss.sgi.com/projects/xfs/design_docs/ > > > > I am currently researching methods for implementing the 64-bit > > syscalls stat64(), fstat64(), lseek64() &etc. delineated in the > > SGI design doc _64 Bit File Access_ by Adam Sweeney. > > The xxxx64 calls are no longer an issue as of IRIX 6.(something 2 I > think) all > the standard calls were converted to use 64 bit types directly. > > Have a better one for you to research. > Find out if buffers can be pined? if not what is it going to take to fix > that. > > > > > The BSD-XFS port will be made available as a patch to the RELEASE > > FreeBSD kernels. > > Given the size of XFS it might be easier to make FreeBSD a patch to XFS. > <- major humor here. > :-) :-) > > > > > > > Matthew Alton > > Computer Services - UNIX Systems Administration > > (314)632-6644 matthew.alton@anheuser-busch.com > > alton@plantnet.com > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > > with "unsubscribe freebsd-hackers" in the body of the message > > -- > Russell Cattelan > cattelan@thebarn.com > > > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 19 11: 3:35 1999 Delivered-To: freebsd-fs@freebsd.org Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id 4D249159DE; Thu, 19 Aug 1999 11:03:25 -0700 (PDT) (envelope-from tlambert@usr06.primenet.com) Received: (from daemon@localhost) by smtp04.primenet.com (8.9.3/8.9.3) id LAA13024; Thu, 19 Aug 1999 11:02:27 -0700 (MST) Received: from usr06.primenet.com(206.165.6.206) via SMTP by smtp04.primenet.com, id smtpdAAAH8aWyz; Thu Aug 19 11:02:23 1999 Received: (from tlambert@localhost) by usr06.primenet.com (8.8.5/8.8.5) id LAA25563; Thu, 19 Aug 1999 11:02:34 -0700 (MST) From: Terry Lambert Message-Id: <199908191802.LAA25563@usr06.primenet.com> Subject: Re: BSD XFS Port & BSD VFS Rewrite To: dcs@newsguy.com (Daniel C. Sobral) Date: Thu, 19 Aug 1999 18:02:34 +0000 (GMT) Cc: tlambert@primenet.com, phk@critter.freebsd.dk, michaelh@cet.co.jp, wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG In-Reply-To: <37BB88F3.7184305@newsguy.com> from "Daniel C. Sobral" at Aug 19, 99 01:32:51 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > Terry Lambert wrote: > > > > Make sure that the system you are talking to over the proxy is > > not assumed to be a FreeBSD system (e.g. don't assume that the > > vfs_default stuff exists on the other end of the proxy, or that > > it would be functional). > > Now, Terry, that is ridiculous. One has to assume that both ends > play by the same rules. That is not only a reasonably expectation, > it's minimum requirement for any protocol to work. That's kind of the point. No other VFS stacking system out there plays by FreeBSD's revamped rules. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 19 12:18:14 1999 Delivered-To: freebsd-fs@freebsd.org Received: from oceana.nlanr.net (oceana.sdsc.edu [132.249.40.200]) by hub.freebsd.org (Postfix) with ESMTP id DCD94151D0 for ; Thu, 19 Aug 1999 12:18:00 -0700 (PDT) (envelope-from tshansen@oceana.nlanr.net) Received: from localhost (tshansen@localhost) by oceana.nlanr.net (8.8.6/8.8.6) with SMTP id MAA29339; Thu, 19 Aug 1999 12:16:28 -0700 (PDT) Date: Thu, 19 Aug 1999 12:16:28 -0700 (PDT) From: Todd Hansen To: freebsd-fs@freebsd.org Cc: Tony McGregor Subject: turning of filesystem caching for specific filesystems Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org I was wondering if there was some hidden method in the kernel configuration or in sysctl that would allow me to turn off kernel level filesystem cachine for a specific filesystem? The reason I want to do this is because I have one very large ccd0c filesystem that is accessed randomly but at a very high frequency (both read and writes). Anyway, I also have system disks with the programs and such that are run in order to process the data on the ccd filesystem. The problem is as I am running these programs I am noticing that I have a .5 MB/s access to the system disk even though I am only calling one or two sub-programs. Anyway, I believe that is because the ccd0c filesystem is being used so much it is exausting the cache. Thanks in advance for your help. -todd To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 19 14:46:43 1999 Delivered-To: freebsd-fs@freebsd.org Received: from alpo.whistle.com (alpo.whistle.com [207.76.204.38]) by hub.freebsd.org (Postfix) with ESMTP id 24FF71535C for ; Thu, 19 Aug 1999 14:46:38 -0700 (PDT) (envelope-from julian@whistle.com) Received: from current1.whistle.com (current1.whistle.com [207.76.205.22]) by alpo.whistle.com (8.9.1a/8.9.1) with SMTP id OAA70240 for ; Thu, 19 Aug 1999 14:44:44 -0700 (PDT) Date: Thu, 19 Aug 1999 14:46:01 -0700 (PDT) From: Julian Elischer To: fs@freebsd.org Subject: BUG in 3.2 fsck! (fwd) Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org FS types.. thoughts? An ex collegue writes to me: ------------------------------------------- I have created and am testing an fsck version which will make lost+found much larger (until it fills the first indirect disk page) and which has the ability to suppress output for errors fixed by preen (which is not exactly what I proposed before but which achieves the same end with less code and less risk). It isn't really time yet to discuss merging these features into freeBSD, but I think that day will come. What this mail is really about is a bug in fsck. I am not currently competent to submit the fix or to even be positive that the bug is current. I think this is a potentially serious bug in the current sources. Are you interested? BUG IN FSCK: When the 3.2 version of fsck has to create the lost+found directory, it may fail to flag the appropriate inode busy! Patch: mkdir lost+found in the root directory of all your file systems. Discussion: fsck allocates only enough space to keep track of the first inodes in each cylinder group. This is clever and good - inode usage tends to occur at the front of the cylinder group and this saves space. Unfortunately, it does not work out well when a directory is created which increases the highest inode number for the cylinder group - the inode usage doesn't get recorded in the right place and the inode will be flagged available during pass 5. Fix: The code change causes fsck to check the cylinder group allocation when adding an inode and expand the inode list for the cylinder group if necessary. In inode.c::allocino (near line 605): for (ino = request; ino < maxino; ino++) if (inoinfo(ino)->ino_state == USTATE) break; if (ino == maxino) return (0); inoallocinfo (ino); **** one new line of code. In fsck.h, add the prototype for the new function inoallocinfo. In utility.c (near line 138), replace the function inoinfo with the following: static struct inostat unallocated = { USTATE, 0, 0 }; /* * Look up state information for an inode. */ struct inostat * inoinfo(inum) ino_t inum; { struct inostatlist *ilp; int iloff; if (inum > maxino) errx(EEXIT, "inoinfo: inumber %d out of range", inum); ilp = &inostathead[inum / sblock.fs_ipg]; iloff = inum % sblock.fs_ipg; if (iloff >= ilp->il_numalloced) return (&unallocated); return (&ilp->il_stat[iloff]); } /* * Make it safe to allocate this inode! */ void inoallocinfo (inum) ino_t inum; { struct inostat *info; struct inostatlist *ilp; unsigned i, iloff; if (inum > maxino) errx(EEXIT, "inoinfo: inumber %d out of range", inum); ilp = &inostathead[inum / sblock.fs_ipg]; iloff = inum % sblock.fs_ipg; if (iloff >= (unsigned)ilp->il_numalloced) { info = calloc (iloff + 1, sizeof *info); if (info == NULL) errx(EEXIT, "cannot alloc %u bytes for inoinfo\n", (unsigned)(sizeof *info * (iloff + 1))); memmove (info, ilp->il_stat, ilp->il_numalloced * sizeof *info); free(ilp->il_stat); ilp->il_stat = info; for (i = ilp->il_numalloced; i <= iloff; ++i) memmove (info + i, &unallocated, sizeof unallocated); ilp->il_numalloced = iloff + 1; } } To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 19 17: 6:30 1999 Delivered-To: freebsd-fs@freebsd.org Received: from cygnus.rush.net (cygnus.rush.net [209.45.245.133]) by hub.freebsd.org (Postfix) with ESMTP id 4B7411520D for ; Thu, 19 Aug 1999 17:06:24 -0700 (PDT) (envelope-from bright@rush.net) Received: from localhost (bright@localhost) by cygnus.rush.net (8.9.3/8.9.3) with SMTP id UAA02736; Thu, 19 Aug 1999 20:13:25 -0400 (EDT) Date: Thu, 19 Aug 1999 20:13:24 -0400 (EDT) From: Alfred Perlstein To: "Alton, Matthew" Cc: "'Russell Cattelan'" , "'fs@FreeBSD.ORG'" Subject: RE: BSD-XFS Update In-Reply-To: <0740CBD1D149D31193EB0008C7C56836EB8B15@STLABCEXG012> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Thu, 19 Aug 1999, Alton, Matthew wrote: > Pinned in the AIX-style "pinned memory" sense? Succinctly, AIX > allows userland programs to tag memory pages so as to guarantee that > they will not be swapped to backing store. Portions of the _KERNEL_ > are paged out instead if necessary. > > I assume that the pinning is of the AIX sort and that it is desirable, if > not necessary, for the realtime throughput guarantee policy. Nes pas? man mlock -Alfred To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Aug 19 20:11:20 1999 Delivered-To: freebsd-fs@freebsd.org Received: from chuq.com (w130.z209220044.sjc-ca.dsl.cnc.net [209.220.44.130]) by hub.freebsd.org (Postfix) with ESMTP id 0BCC614DA0; Thu, 19 Aug 1999 20:11:15 -0700 (PDT) (envelope-from chuq@chuq.com) Received: (from chs@localhost) by chuq.com (8.8.8/8.8.8) id UAA02199; Thu, 19 Aug 1999 20:10:58 -0700 (PDT) Date: Thu, 19 Aug 1999 20:10:57 -0700 From: Chuck Silvers To: Terry Lambert Cc: wrstuden@nas.nasa.gov, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite Message-ID: <19990819201057.A2185@chuq.chuq.com> References: <199908182043.NAA28863@usr06.primenet.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <199908182043.NAA28863@usr06.primenet.com>; from Terry Lambert on Wed, Aug 18, 1999 at 08:43:14PM +0000 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wed, Aug 18, 1999 at 08:43:14PM +0000, Terry Lambert wrote: > > > > Nope. The problem is that while stacking (null, umap, and overlay fs's) > > > > work, we don't have the coherency issues worked out so that upper layers > > > > can cache data. i.e. so that the lower fs knows it has to ask the uper > > > > layers to give pages back. :-) But multiple ls -lR's work fine. :-) > > > > > > With UVM in NetBSD, this is (supposedly) not an issue. > > > > UBC. UVM is a new memory manager. UBC unifies the buffer cache with the VM > > system. > > I was under the impression that th "U" in "UVM" was for "Unified". > > Does NetBSD not have a unified VM and buffer cache? is th "U" in > "UVM" referring not to buffer cache unification, but to platform > unification? > > It was my understanding from John Dyson, who had to work on NetBSD > for NCI, that the new NetBSD stuff actually unified the VM and the > buffer cache. > > If this isn't the case, then, yes, you will need to lock all the way > up and down, and eat the copy overhead for the concurrency for the > intermediate vnodes. 8-(. netbsd w/UVM currently doesn't have unified caches. that feature is what I named UBC, for "unified buffer cache" (ala DEC's UBC). the U in UVM doesn't actually stand for anything. :-) > > > You could actually think of it this way, as well: only FS's that > > > contain vnodes that provide backing should implement VOP_GETPAGES > > > and VOP_PUTPAGES, and all I/O should be done through paging. > > > > Right. That's part of UBC. :-) > > Yep. Again, if NetBSD doesn't have this, it's really important > that it obtain it. 8-(. I'm workin' on it... it'll go in soon after the branch for the next release is created (ie. it won't be in the next release, but the one after that). -Chuck To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Aug 20 11:16:39 1999 Delivered-To: freebsd-fs@freebsd.org Received: from alpo.whistle.com (alpo.whistle.com [207.76.204.38]) by hub.freebsd.org (Postfix) with ESMTP id 6509B15370 for ; Fri, 20 Aug 1999 11:16:35 -0700 (PDT) (envelope-from julian@whistle.com) Received: from current1.whistle.com (current1.whistle.com [207.76.205.22]) by alpo.whistle.com (8.9.1a/8.9.1) with SMTP id LAA03603 for ; Fri, 20 Aug 1999 11:11:49 -0700 (PDT) Date: Fri, 20 Aug 1999 11:13:13 -0700 (PDT) From: Julian Elischer To: fs@freebsd.org Subject: Re: BUG in 3.2 fsck! (fwd) Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org further discussion... ---------- Forwarded message ---------- Date: Fri, 20 Aug 1999 02:57:25 -0700 (PDT) From: milt To: julian@whistle.com, milt@vicor-nb.com Cc: cayford@vicor-nb.com, conor@vicor-nb.com, davep@vicor-nb.com, daver@vicor-nb.com, jrh@vicor-nb.com Subject: Re: BUG in 3.2 fsck! HIYA Well, soft updates sure sounds interesting. One of our current problems is that our damn raids don't preserve the disk write order requested by unix. I'm fighting that - maybe soft updates will give me enough ammunition to win that fight. If not, soft updates won't help us! I haven't read the soft updates paper yet - I will, but not tonight. One question, I am not sure how soft updates are intended to inter-act with fsck. Is it your intention that fsck behave differently only in preen mode? If so, you have mis-spelled at least one if statement (the LINK COUNT INCREASING test in dir.c::adjust). I am writing this now because I want to let you see my current status for fsck fixes before you install my previous patch. I will repeat all of this in later mail (with test instructions yet) once I figure out what the heck I want to do. What I have right now is: BUG 1: When lost+found is allocated on a new highest block number for a cylinder group it ends up without an inoinfo entry and will be flagged available during pass 5. This is the one in my previous mail. BUG 2: When an orphan directory happens to start with the a parent pointer to an inode which will become a newly allocated lost+found, the loop in pass 2 will skip the i_dotdot update because it points to a USTATE inode, but pass 3 will unwind the update which wasn't done because it unwinds i_dotdot for everything it connects! (The inode isn't USTATE anymore because it's now lost+found's inode.) BUG 3: It has become virtually impossible to learn things from redirected output. Some lines go partially to stderr and partially to stdout with disastrous results even when both stdout and stderr are redirected! What I am running right now is an fsck that does not mention stderr. (2.2.8's fsck mentions stderr only for fatal setup problems - that works too, but it requires less thought to just eliminated stderr.) BUG 4: When Milt's new code puts over 32768 files in lost+found is is committing a grave error (di_nlinks is a signed, 16 bit quantity). Milt better get his act together before he publishes this. NOTE: there is no problem with allocation or extra passes here. fsck has long been allocating disk pages as it extends lost+found. NEW FEATURE: a q switch which suppresses output for and questions about things that would be(/are) fixed in preen mode. When q is in effect, preen mode fixes just happen - no notification to the operator and no questions. This allows us to get a screen which shows only the interesting errors. The preen mode problems get fixed quietly and only the serious stuff ends up in the operators face or on the redirected output file! My original intention was to have preen mode keep running after some errors, but I now understand why you thought that would be hard. This new switch achieves my goal of seeing only the real problems and is lots easier to implement. DISCUSSION: As you can deduce from my discovery of bug 4, I really am having lots of fun testing all this junk. Current solution to bug 2 is to update the i_dotdot count even in USTATE inodes during pass2. That causes lost+found to come out right but pre-cludes adding inodes in mid stream, invalidating my previous patch for bug 1. Currently, I am pre-allocating one extra inoinfo slot per cylinder group (which prevents bug 1) and updating USTATE counts (which fixes bug 2). I realized that bug 4 was out there only a few minutes ago. Two solutions occur to me: a. Switch to a new directory under a different name when lost+found has 32760 entries. b. Bag it and claim lost+found is full when it has 32760 files in it. With 5 to 8 million files/directories in a file system, 32760 isn't very many so I am not enthused about b. On the other hand, I can't think of a fix for bugs 1/2 which is compatable with solution a. So, I think I'll go to bed! Hmmm, pondering and rereading this an interesting possibility occurs to me. On a bad hardware day, it would help if we put each fsck run in a different lost+found directory (lost+found.01, lost+found.02, etc.). fsck would ALWAYS allocate a new lost+found and if you had multi crashes on one day it would be easier to tell which lost+found files should be recovered to where. (We really do have tools to recover these beasties and are working on improving them.) Which makes the unimplementable solution a more interesting! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Aug 20 21:59:24 1999 Delivered-To: freebsd-fs@freebsd.org Received: from peach.ocn.ne.jp (peach.ocn.ne.jp [210.145.254.87]) by hub.freebsd.org (Postfix) with ESMTP id 314E514A09; Fri, 20 Aug 1999 21:59:21 -0700 (PDT) (envelope-from dcs@newsguy.com) Received: from newsguy.com by peach.ocn.ne.jp (8.9.1a/OCN) id NAA08836; Sat, 21 Aug 1999 13:57:19 +0900 (JST) Message-ID: <37BE317E.4B1D7791@newsguy.com> Date: Sat, 21 Aug 1999 13:56:30 +0900 From: "Daniel C. Sobral" X-Mailer: Mozilla 4.6 [en] (Win98; I) X-Accept-Language: en,pt-BR,ja MIME-Version: 1.0 To: Terry Lambert Cc: phk@critter.freebsd.dk, michaelh@cet.co.jp, wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite References: <199908191802.LAA25563@usr06.primenet.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Terry Lambert wrote: > > That's kind of the point. No other VFS stacking system out there > plays by FreeBSD's revamped rules. I look around and I see no standards. It is still time to be experimental. -- Daniel C. Sobral (8-DCS) dcs@newsguy.com dcs@freebsd.org - Can I speak to your superior? - There's some religious debate on that question. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Aug 21 2:44: 5 1999 Delivered-To: freebsd-fs@freebsd.org Received: from peach.ocn.ne.jp (peach.ocn.ne.jp [210.145.254.87]) by hub.freebsd.org (Postfix) with ESMTP id 1BA3E14EBE; Sat, 21 Aug 1999 02:43:56 -0700 (PDT) (envelope-from dcs@newsguy.com) Received: from newsguy.com by peach.ocn.ne.jp (8.9.1a/OCN) id SAA05512; Sat, 21 Aug 1999 18:39:36 +0900 (JST) Message-ID: <37BE6CE8.D59FF19C@newsguy.com> Date: Sat, 21 Aug 1999 18:10:00 +0900 From: "Daniel C. Sobral" X-Mailer: Mozilla 4.6 [en] (Win98; I) X-Accept-Language: en,pt-BR,ja MIME-Version: 1.0 To: Terry Lambert , phk@critter.freebsd.dk, michaelh@cet.co.jp, wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: BSD XFS Port & BSD VFS Rewrite References: <199908191802.LAA25563@usr06.primenet.com> <37BE317E.4B1D7791@newsguy.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org "Daniel C. Sobral" wrote: > > Terry Lambert wrote: > > > > That's kind of the point. No other VFS stacking system out there > > plays by FreeBSD's revamped rules. > > I look around and I see no standards. It is still time to be > experimental. Since someone complained of my meekness, let me restate that... :-) 1) BS. That was not your point. Your point, in which you spent many paragraphs, was that the present way FreeBSD things does it stuff cannot support passing a method through an intermediate host/fs that does not know it. If your "point" was the above, you could just have said "no one else does it this way, so we won't be able to have non-FreeBSD intermediate/frontend/backend hosts". Only that does not prove that "our" way is not right. 2) There is *no* compatibility in the VFS out there. It's a jungle. If we implemented something compatible with anyone else, it would be a first. And given that everything out there have it's problems, it would be a huge mistake to adopt someone's standard just for the sake of being compatible. And if you disagree with point 2, feel free to argue against it. But in no way it will justify that absurd comment you made. Either that paragraph was trying to cover a flaw in your logic, or you just lost your train of thought. It certainly detracted from the content of the message. "You must assume that the intermediate host doesn't play by your rules". Bah. [not that I don't generally agree with you more often than it would be prudent to let it be publicly known :-) ] -- Daniel C. Sobral (8-DCS) dcs@newsguy.com dcs@freebsd.org - Can I speak to your superior? - There's some religious debate on that question. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Aug 21 18:11:27 1999 Delivered-To: freebsd-fs@freebsd.org Received: from dt011n65.san.rr.com (dt010nb9.san.rr.com [204.210.12.185]) by hub.freebsd.org (Postfix) with ESMTP id 4E493154B2 for ; Sat, 21 Aug 1999 18:11:09 -0700 (PDT) (envelope-from Doug@gorean.org) Received: from gorean.org (master [10.0.0.2]) by dt011n65.san.rr.com (8.9.3/8.8.8) with ESMTP id SAA97209; Sat, 21 Aug 1999 18:09:22 -0700 (PDT) (envelope-from Doug@gorean.org) Message-ID: <37BF4DCB.1E9B7F82@gorean.org> Date: Sat, 21 Aug 1999 18:09:31 -0700 From: Doug Organization: Triborough Bridge & Tunnel Authority X-Mailer: Mozilla 4.61 [en] (X11; U; FreeBSD 4.0-CURRENT-0815 i386) X-Accept-Language: en MIME-Version: 1.0 To: alk@pobox.com Cc: freebsd-fs@FreeBSD.ORG Subject: Re: blocking References: <14250.853.418320.65158@avalon.east> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Anthony Kimball wrote: > > An NFS blocking behaviour which doesn't seem correct to me: > > 1. background a long /bin/cp to /foo from an NFS-mounted file system. > 2. ls /foo > > note that (2) hangs until (1) completes. Is this a bug? Someone smarter than me will probably respond to tell me that I'm wrong, but in my nascent understanding of NFS I'd say no, although I can't quite explain exactly what I'm thinking about it. The best way I can express it is to say that while one client is already making a change on a file system more requests from the same client get queued. I believe that if you were to do the 'ls' from a different system it would not block. Ok, there's the slow hanging curve, someone else can step up and hit it out of the park. :) Doug To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message