From owner-freebsd-hackers Mon Dec 16 14:32:43 1996 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.4/8.8.4) id OAA09361 for hackers-outgoing; Mon, 16 Dec 1996 14:32:43 -0800 (PST) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.8.4/8.8.4) with SMTP id OAA09355 for ; Mon, 16 Dec 1996 14:32:38 -0800 (PST) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id PAA02172; Mon, 16 Dec 1996 15:29:12 -0700 From: Terry Lambert Message-Id: <199612162229.PAA02172@phaeton.artisoft.com> Subject: Re: Heidemann Framework integration (Re: Other filesystems under FreeBSD) To: koshy@india.hp.com (A JOSEPH KOSHY) Date: Mon, 16 Dec 1996 15:29:12 -0700 (MST) Cc: terry@lambert.org, freebsd-hackers@freefall.freebsd.org In-Reply-To: <199612161415.AA106675713@fakir.india.hp.com> from "A JOSEPH KOSHY" at Dec 16, 96 07:15:13 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk > tl> FreeBSD is well suited to other file systems. What I was discussing > tl> was a number of small changes to the mechanism to seperate the > tl> implementation from the instantiation. Basically, I wanted to be > tl> able to simplify the code I needed to write to write a new FS, even > tl> more than it is already simplified over that for Linux. This is not > tl> the same thing as the code being impossible without the changes, > tl> only "more difficult than it would be in Terry's ideal world". > > Do we have any plans of integrating the Heidemann framework more completely > into the 3.0 development tree? > > IMO this would be a good idea. The default VFS framework in BSD *is* the Heidemann framework. The problems with it are all interface issues, where CSRG pounded it into 4.4BSD in reaction to the USL/UCB consent decree. One of the terms of the decree was that UCB agreed to drop files, effectively rendering inoperable 5 major kernel subsystems, including the file system. My take on this is that USL's intent was to damage the ability of 4.4BSD-Lite derived code to be (easily) converted into a running system. CSRG was routing around the damage under heavy time pressure; the results were to be expected. The main problem areas in the 4.4BSD/FreeBSD implementation of the Heidemann framework are: 1) The size of the vnode_op_desc structure is determined from the FFS vfs structures in vfs_init.c. This is in error because: o The FFS must be compiled into the kernel o The FFS must be recompiled when vnode_if.c and vnode_if.h are recreated to add ne vfsops types to the vector, since the structure size in a precompiled FFS will not account for these new ops otherwise o There is a requirement for at least one FS to be compiled into the kernel, even if the FFS explicit dependency were to be removed, since the initialization code depends on a static FS declaration for sizing no matter what Corrective actions are: o Get the size of the vnode_op_desc structure from a generated integer value from vnode_if.c; the actual implementation is to add two output lines (one of them blank) to vnode_if.c to declare and make a manifest divide by the element size, and to add an extern declaration (two more lines, one of them blank) to vnode_if.h. This is accomplished by a four line change to vnode_if.sh. o Change the initialization so that FS's self-register using the same interface that LKM's do. Cause a linker set of FS's wanting to be registered to be created (and then called) through init_main.c; a side effect of this will be that all LKM and non-LKM FS modules can be the same object code; that is, it is a link time, rather than a config-time, descision about whether or not an FS is a module or statically bound. o You must still link one FS with the kernel (the one that mounts / for you), but this is only a temporary requirement until the boot blocks can preload modules from the boot media before jumping to the kernel. 2) The vnode_op_desc structure is used to make the VOP calls order independent in the structure; basically, as long as you put the right calls in there, the order doesn't matter. This was done to allow insertion of new VOPs without damaging the ability of structure users to use them. This is wrong because: o It also depends on the vnode_if.c being recompiled for each change. o The intentional order independence denies linker set technology, which was unavailable to (or unused by) the 4.4BSD team. Compared to use of linker sets, it is a royal kludge to have to recompile the vnode_if.c each time. o The use of dynamic ordering means that the VOP reference must be made by descriptor. This is the reason for the inline functions in vnode_if.h, and the additional overhead of their use (non-manifest array dereference, repush of arguments), as opposed to direct function dereference. o Kill vclean. It requires VOCALL, which is evil. Corrective actions are: o Change the vnode_if.src/vnode_if.sh to use a linker set to gather the VOPs so that VOPS can be added to a running kernel. o Destroy the order dependence; this can be easily done by *sorting* the descriptor list for each FS at time of insertion to make the VOP offset constant (and consistent with the gathered list). o Allow runtime gathering of the list by reallocating the linker set as necessary. This means putting the VOP list into rallocable memory in the first place, or doing so during init (init_main.c) before use. o By using constant offsets in the internalized reference copy of the desc vector for each FS, the vnode_if.h can be changes to make the references direct references instead of through descriptors o In lieu of the immediate death of vclean, the use of the struct fileops should be discontinued. This should be discontinued anyway, when devfs becomes default, and specfs is destroyed. The remaining filesops can be rolled back out; vclean will have to use a flag marker on vnodes it wishes to invalidate, instead. 3) The VFS is not treated as a consumer interface; specifically, it's objects are treated non-opaquely. The biggest offender is struct nameidata, which is allocated by the caller, but freed by the callee. This is bad because: o It locks BSD into a single name space (ie: it can not support, easily, multiple name spaces for a single file system object. That is, it can not support, easily, VFAT, HPFS, NTFS, or MACFS. o It prevents the use of alternate storage for name space objects by making their appearance visible to the VFS consumers (currently, the system calls and the kernel NFS server code). That is, it can not support Unicode storage of data. This is an error for VFAT and NTFS, and for most modern FS work being done in the rest of the academic community. In fact, CIFS (the successor to LANMan) uses native Unicode wire data. Corrective actions are: o Change the nameidata interface to implement a corresponding nameifree() for each namei(). Make equivalents for the NFS server code, which is also a VFS consumer interface and expect the VFS to free its allocated data. o Change every file system's interfaces for those interfaces which operate on nameidata, so that the FS's themselves never free the data. o Allow namei()/nameifree() to deal with Unicode and name space conversions, transparently. 4) The VFS stacking is broken. The VFS stacking fails, mainly because of bad interactions with the FS specific VOP_LOCK and VOP_ADVLOCK code. o The VOP_LOCK code fails because it maintains a promiscuous lock knowledge for use by vclean. This is not a per-FS issue. o The VOP_ADVLOCK is "corrected" in 4.4BSD-Lite2 by moving to a "common interface"; unfortunately, the "common interface" is accessed via call-down... that is, code in the FS calls code in kern_lock.c. This is a violation of interface direction, and mans that anyone writing an FS stacking module must implement similar code, and, further, if stacking occurs, treat a stacked FS differently than an FS that accesses physical media. Corrective actions are: o Murder vclean. Alternately, move the vclean locking into the VOP_LOCK inline function (or if you have corrected #2, above, then put it in the macro definition that replaces the inline function; at least that way, the crap is out of the FS. o Change each FS to not attempt the locking. Not all FS's do it correctly (or at all) anyway, leading to lots of bad behaviours. In particular, because directory entries *are* inodes in FAT/VFAT, there is a nice race condition that it is impossible to get rid of if the FS specific VOP_LOCK code is expected to manage the vclean lock. o Convert the VOP_ADVLOCK interface into a veto interface; in other words, the default code would assert the lock on the top level vnode, then call down the stack. By default, the stack call would be a NULL function returning success to allow the lock. For stacked FS's, the same is true. For FS stacking layers that "fan out" from one inode to two or more (union FS, quota FS, umsdos FS), they would specifically iterate the veto calls to each underlying vnode down. Any failure and all locks are released, and the lock operation fails all the way up; the upper layer is free to retry after sleeping on the first node down. This prevents a deadly embrace deadlock, which is not possible with the call-down code. If it is a non-blocking lock request, the top level code does not sleep. In case of a VOP_LOCK calldown failure, the top level lock is released, and the failure propagates back to the caller. 5) SMP and kernel multithreading issues regarding VFS reentrancy have not been considered in the current design: o They should be considered before much longer Corrective actions are: o Consider them. One possible simplification exercise to make it much easier to debug would be to make all FS subsystem functions single entry/single exit, prepatory to "pushing down" the global entrancy mutex through the trap code for the system call interface. Another would be comment documentation of expected lock state for all objects going into functions, and resulting lock states for objects on the way out (for instance, you must lock the dir vnode on the way into a lookup, and the resulting vnode will be locked coming out, with the parent vnode for the resulting vnode (the second-to-terminal path component) potentially being left locked as well. This is not well documented, and the reasons are unclear (the actual reasons are related to call for create or rename returning instead of failing if the file exists, but the create parameters from the user specify that the call should fail if the file exists). 6) Etc. (exclusion interfaces, VOP_READDIR interfaces and the "cookie" hack for NFS directory iteration restart, and so on, and so on...). Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.