From owner-freebsd-hackers  Mon Dec 16 14:32:43 1996
Return-Path: <owner-hackers>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.4/8.8.4) id OAA09361
          for hackers-outgoing; Mon, 16 Dec 1996 14:32:43 -0800 (PST)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.8.4/8.8.4) with SMTP id OAA09355
          for <freebsd-hackers@freefall.freebsd.org>; Mon, 16 Dec 1996 14:32:38 -0800 (PST)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id PAA02172; Mon, 16 Dec 1996 15:29:12 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199612162229.PAA02172@phaeton.artisoft.com>
Subject: Re: Heidemann Framework integration (Re: Other filesystems under FreeBSD)
To: koshy@india.hp.com (A JOSEPH KOSHY)
Date: Mon, 16 Dec 1996 15:29:12 -0700 (MST)
Cc: terry@lambert.org, freebsd-hackers@freefall.freebsd.org
In-Reply-To: <199612161415.AA106675713@fakir.india.hp.com> from "A JOSEPH KOSHY" at Dec 16, 96 07:15:13 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

> tl> FreeBSD is well suited to other file systems.  What I was discussing
> tl> was a number of small changes to the mechanism to seperate the
> tl> implementation from the instantiation.  Basically, I wanted to be
> tl> able to simplify the code I needed to write to write a new FS, even
> tl> more than it is already simplified over that for Linux.  This is not
> tl> the same thing as the code being impossible without the changes,
> tl> only "more difficult than it would be in Terry's ideal world".
> 
> Do we have any plans of integrating the Heidemann framework more completely
> into the 3.0 development tree?
> 
> IMO this would be a good idea.  

The default VFS framework in BSD *is* the Heidemann framework.

The problems with it are all interface issues, where CSRG pounded
it into 4.4BSD in reaction to the USL/UCB consent decree.  One
of the terms of the decree was that UCB agreed to drop files,
effectively rendering inoperable 5 major kernel subsystems, including
the file system.  My take on this is that USL's intent was to damage
the ability of 4.4BSD-Lite derived code to be (easily) converted
into a running system.  CSRG was routing around the damage under
heavy time pressure; the results were to be expected.


The main problem areas in the 4.4BSD/FreeBSD implementation of the
Heidemann framework are:

1)	The size of the vnode_op_desc structure is determined from
	the FFS vfs structures in vfs_init.c.  This is in error
	because:

	o	The FFS must be compiled into the kernel
	o	The FFS must be recompiled when vnode_if.c
		and vnode_if.h are recreated to add ne vfsops
		types to the vector, since the structure size
		in a precompiled FFS will not account for these
		new ops otherwise
	o	There is a requirement for at least one FS to
		be compiled into the kernel, even if the FFS
		explicit dependency were to be removed, since
		the initialization code depends on a static
		FS declaration for sizing no matter what

	Corrective actions are:

	o	Get the size of the vnode_op_desc structure from
		a generated integer value from vnode_if.c; the
		actual implementation is to add two output lines
		(one of them blank) to vnode_if.c to declare and
		make a manifest divide by the element size, and
		to add an extern declaration (two more lines, one
		of them blank) to vnode_if.h.  This is accomplished
		by a four line change to vnode_if.sh.
	o	Change the initialization so that FS's self-register
		using the same interface that LKM's do.  Cause a
		linker set of FS's wanting to be registered to be
		created (and then called) through init_main.c; a
		side effect of this will be that all LKM and non-LKM
		FS modules can be the same object code; that is, it is
		a link time, rather than a config-time, descision about
		whether or not an FS is a module or statically bound.
	o	You must still link one FS with the kernel (the one
		that mounts / for you), but this is only a temporary
		requirement until the boot blocks can preload modules
		from the boot media before jumping to the kernel.


2)	The vnode_op_desc structure is used to make the VOP calls
	order independent in the structure; basically, as long as you
	put the right calls in there, the order doesn't matter.  This
	was done to allow insertion of new VOPs without damaging
	the ability of structure users to use them.  This is wrong
	because:

	o	It also depends on the vnode_if.c being recompiled for
		each change.
	o	The intentional order independence denies linker set
		technology, which was unavailable to (or unused by)
		the 4.4BSD team.  Compared to use of linker sets,
		it is a royal kludge to have to recompile the vnode_if.c
		each time.
	o	The use of dynamic ordering means that the VOP reference
		must be made by descriptor.  This is the reason for the
		inline functions in vnode_if.h, and the additional
		overhead of their use (non-manifest array dereference,
		repush of arguments), as opposed to direct function
		dereference.
	o	Kill vclean.  It requires VOCALL, which is evil.

	Corrective actions are:

	o	Change the vnode_if.src/vnode_if.sh to use a linker set
		to gather the VOPs so that VOPS can be added to a running
		kernel.
	o	Destroy the order dependence; this can be easily done
		by *sorting* the descriptor list for each FS at time
		of insertion to make the VOP offset constant (and
		consistent with the gathered list).
	o	Allow runtime gathering of the list by reallocating
		the linker set as necessary.  This means putting the
		VOP list into rallocable memory in the first place, or
		doing so during init (init_main.c) before use.
	o	By using constant offsets in the internalized reference
		copy of the desc vector for each FS, the vnode_if.h
		can be changes to make the references direct references
		instead of through descriptors
	o	In lieu of the immediate death of vclean, the use of
		the struct fileops should be discontinued.  This should
		be discontinued anyway, when devfs becomes default, and
		specfs is destroyed.  The remaining filesops can be
		rolled back out; vclean will have to use a flag marker
		on vnodes it wishes to invalidate, instead.

3)	The VFS is not treated as a consumer interface; specifically,
	it's objects are treated non-opaquely.  The biggest offender
	is struct nameidata, which is allocated by the caller, but
	freed by the callee.  This is bad because:

	o	It locks BSD into a single name space (ie: it can not
		support, easily, multiple name spaces for a single
		file system object.  That is, it can not support,
		easily, VFAT, HPFS, NTFS, or MACFS.
	o	It prevents the use of alternate storage for name
		space objects by making their appearance visible to
		the VFS consumers (currently, the system calls and
		the kernel NFS server code).  That is, it can not
		support Unicode storage of data.  This is an error
		for VFAT and NTFS, and for most modern FS work being
		done in the rest of the academic community.  In fact,
		CIFS (the successor to LANMan) uses native Unicode
		wire data.

	Corrective actions are:

	o	Change the nameidata interface to implement a corresponding
		nameifree() for each namei().  Make equivalents for the
		NFS server code, which is also a VFS consumer interface
		and expect the VFS to free its allocated data.
	o	Change every file system's interfaces for those interfaces
		which operate on nameidata, so that the FS's themselves
		never free the data.
	o	Allow namei()/nameifree() to deal with Unicode and name
		space conversions, transparently.

4)	The VFS stacking is broken.  The VFS stacking fails, mainly
	because of bad interactions with the FS specific VOP_LOCK and
	VOP_ADVLOCK code.

	o	The VOP_LOCK code fails because it maintains a promiscuous
		lock knowledge for use by vclean.  This is not a per-FS
		issue.
	o	The VOP_ADVLOCK is "corrected" in 4.4BSD-Lite2
		by moving to a "common interface"; unfortunately, the
		"common interface" is accessed via call-down... that is,
		code in the FS calls code in kern_lock.c.  This is a
		violation of interface direction, and mans that anyone
		writing an FS stacking module must implement similar
		code, and, further, if stacking occurs, treat a stacked
		FS differently than an FS that accesses physical media.

	Corrective actions are:

	o	Murder vclean.  Alternately, move the vclean locking
		into the VOP_LOCK inline function (or if you have
		corrected #2, above, then put it in the macro
		definition that replaces the inline function; at least
		that way, the crap is out of the FS.
	o	Change each FS to not attempt the locking.  Not all FS's
		do it correctly (or at all) anyway, leading to lots of
		bad behaviours.  In particular, because directory entries
		*are* inodes in FAT/VFAT, there is a nice race condition
		that it is impossible to get rid of if the FS specific
		VOP_LOCK code is expected to manage the vclean lock.
	o	Convert the VOP_ADVLOCK interface into a veto interface;
		in other words, the default code would assert the lock
		on the top level vnode, then call down the stack.  By
		default, the stack call would be a NULL function returning
		success to allow the lock.  For stacked FS's, the same
		is true.  For FS stacking layers that "fan out" from
		one inode to two or more (union FS, quota FS, umsdos FS),
		they would specifically iterate the veto calls to each
		underlying vnode down.  Any failure and all locks are
		released, and the lock operation fails all the way up;
		the upper layer is free to retry after sleeping on the
		first node down.  This prevents a deadly embrace deadlock,
		which is not possible with the call-down code.  If it is
		a non-blocking lock request, the top level code does not
		sleep.  In case of a VOP_LOCK calldown failure, the top
		level lock is released, and the failure propagates back
		to the caller.

5)	SMP and kernel multithreading issues regarding VFS reentrancy
	have not been considered in the current design:

	o	They should be considered before much longer

	Corrective actions are:

	o	Consider them.  One possible simplification exercise
		to make it much easier to debug would be to make all
		FS subsystem functions single entry/single exit,
		prepatory to "pushing down" the global entrancy mutex
		through the trap code for the system call interface.
		Another would be comment documentation of expected
		lock state for all objects going into functions, and
		resulting lock states for objects on the way out (for
		instance, you must lock the dir vnode on the way into
		a lookup, and the resulting vnode will be locked coming
		out, with the parent vnode for the resulting vnode
		(the second-to-terminal path component) potentially being
		left locked as well.  This is not well documented, and
		the reasons are unclear (the actual reasons are related
		to call for create or rename returning instead of failing
		if the file exists, but the create parameters from the
		user specify that the call should fail if the file exists).

6)	Etc. (exclusion interfaces, VOP_READDIR interfaces and the "cookie"
	hack for NFS directory iteration restart, and so on, and so on...).


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.