Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 16 Aug 1999 13:48:16 -0700 (PDT)
From:      Bill Studenmund <wrstuden@nas.nasa.gov>
To:        Terry Lambert <tlambert@primenet.com>
Cc:        Alton Matthew <Matthew.Alton@anheuser-busch.com>, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject:   Re: BSD XFS Port & BSD VFS Rewrite
Message-ID:  <Pine.SOL.3.96.990816105106.27345H-100000@marcy.nas.nasa.gov>
In-Reply-To: <199908140150.SAA23891@usr04.primenet.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 14 Aug 1999, Terry Lambert wrote:

> > I am currently conducting a thorough study of the VFS subsystem
> > in preparation for an all-out effort to port SGI's XFS filesystem to
> > FreeBSD 4.x at such time as SGI gives up the code.  Matt Dillon
> > has written in hackers- that the VFS subsystem is presently not
> > well understood by any of the active kernel code contributers and
> > that it will be rewritten later this year.  This is obviously of great
> > concern to me in this port.
> 
> It is of great concern to me that a rewrite, apparently because of
> non-understanding, is taking place at all.

That concerns me too. Many aspects of the 4.4 vnode interface were there  
for specific reasons. Even if they were hack solutions, to re-write them  
because of a lack of understanding is dangerous as the new code will
likely run into the same problems as before. :-)

Also, it behooves all the *BSD's to not get too divergent. Sharing code
between us all helps all. Given that I'm working on the kernel side of a
data migration file system using NetBSD, I can assure you there are things
which FreeBSD would get access to more easily the more-similar the two VFS
interface are. :-)

> I would suggest that anyone planning on this rewrite should talk,
> in depth, with John Heidemann prior to engaging in such activity.
> John is very approachable, and is a deep thinker.  Any rewrite
> that does not meet his original design goals for his stacking
> architecture is, I think, a Very Bad Idea(tm).
> 
> 
> > I greatly appreciate all assistance in answering the following
> > questions:
> > 
> > 1)  What are the perceived problems with the current VFS?
> > 2)  What options are available to us as remedies?
> > 3)  To what extent will existing FS code require revision in order
> >      to be useful after the rewrite?
> > 4)  Will Chapters 6,7,8 & 9 of "The Design and Implementation of
> >      the 4.4BSD Operating System" still pertain after the rewrite?
> > 5)  How important are questions 3 & 4 in the design of the new
> >      VFS?
> > 
> > I believe that the VFS is conceptually sound and that the existing
> > semantics should be strictly retained in the new code.  Any new
> > functionality should be added in the form of entirely new kernel 
> > routines and system calls, or possibly by such means as
> > converting the existing routines to the vararg format &etc.
> 
> Here some of the problems I'm aware of, and my suggested remedies:
> 
> 1.	The interface is not reflexive, with regard to cn_pnbuf.
> 
> 	Specifically, path buffers are allocated by the caller, but
> 	not freed by the caller, and various routines in each FS
> 	implementation are expected to deal with this.
> 
> 	Each FS duplicates code, and such duplication is subject
> 	to error.  Not to mention that it makes your kernel fat.

Yep, that's not good.

> 2.	Advisory locks are hung off private backing objects.
> 
> 	Advisory locks are passed into VOP_ADVLOCK in each FS
> 	instance, and then each FS applies this by hanging the
> 	locks off a list on a private backing object.  For FFS,
> 	this is the in core inode.
> 
> 	A more correct approach would be to hang the lock off the
> 	vnode.  This effectively obviates the need for having a
> 	VOP_ADVLOCK at all, except for the NFS client FS, which
> 	will need to propagate lock requests across the net.  The
> 	most efficient mechanism for this would be to institute
> 	a pass/fail response for VOP_ADVLOCK calls, with a default
> 	of "pass", and an actual implementation of the operand only
> 	in the NFS client FS.

I agree that it's better for all fs's to share this functionality as much
as possible.

I'd vote againsts your implimentation suggestion for VOP_ADVLOCK on an 
efficiency concern. If we actually make a VOP call, that should be the
end of the story. I.e either add a vnode flag to indicate pas/fail-ness,
or add a genfs/std call to handle the problem.

I'd actually vote for the latter. Hang the byte-range locking off of the
vnode, and add a genfs_advlock() or vop_stdadvlock() routine (depending on
OS flavor) to handle the call. That way all fs's that can share code, and
the callers need only call VO_ADVLOCK() - no other logic.

NetBSD actually needs this to get unionfs to work. Do you want to talk
privately about it?

> 	Again, each FS must duplicate the advisory locking code,
> 	at present, and such duplication is subject to error.

Agreed.

> 3.	Object locks are implemented locally in many FS's.
> 
> 	The VOP_LOCK interface is implemented via vop_stdlock()
> 	calls in many FS's.  This is done using the "vfs_default"
> 	mechanism.  In other FS's, it's implemented locally.
> 
> 	The intent of the VOP_LOCK mechanism being implemented
> 	as a VOP at all was to allow it to be proxied to another
> 	machine over a network, using the original Heidemann
> 	design.  This is also the reason for the use of descriptors
> 	for all VOP arguments, since they can be opaquely proxied to
> 	another machine via a general mechanism.  Unlike NFS based
> 	network filesystems, this would allow you to add VOP's to
> 	both machines, without having to teach the transport about
> 	the new VOP for it to be usable remotely.

Just for a point of comparison, I recently got almost all the NetBSD fs's
to use common code. After our -Lite2 merge, all fs's were either calling
the lock manager, or using genfs_nolock() (a version for non-locking
fs's). Now there's a struct lock * and struct lock in struct vnode. The fs
exports its locking behavior via the struct lock *. For most fs's, the
struct lock * points to the struct lock, and genfs_lock() feeds that to
the lock manager.

But we've kept the ability to do something different (like call over the
network) alive. If the struct lock * is NULL, you have to call VOP_LOCK on
that fs. Note that this difference only matters for layered fs's -
everything else should be calling VOP_LOCK() and letting the dispatch code
figure out the right thing to do.

> 	Like the VOP_ADVLOCK, the need for VOP_LOCK is for proxy
> 	purposes, and it, too, should generate a pass/fail response,
> 	and be largely implemented in non-filesystem specific
> 	higher level code.

To an extent, that's that the exported struct lock * does, though the only
clients are layered filesystems. Everyone else calls VOP_LOCK. :-)

> 	Again, each FS which duplicates code for this function is
> 	subject to duplication errors.

Agreed.

> 4.	The VOP_READIR interface is irrational.
> 
> 	The VOP_READDIR interface returns its responses in "host
> 	cannonical format" (struct dirent, in sys/dirent.h).
> 	Internally, FFS operates on "directory entry blocks" that
> 	contain exactly these structures (an intentaional coincidence).
> 
> 	The problem with this approach, is that it makes the getdents
> 	system call sensitive to file systems for which some of the
> 	information returned (e.g. d_fileno, d_reclen, d_type, d_namlen)
> 	are synthetic.  What this means is that a native file system
> 	directory implementation single directory block must be able
> 	to fit into the buffer passed to the getdirentries(2) system
> 	call, or a directory listing is not a valid snapshot of the
> 	current state of the directory.
> 
> 	It also vastly complicates directory traversal restarts (hence
> 	the ncookies and a_cookies arguments, since the NFS server
> 	requires the ability to restart traversal, mid-block, since
> 	the NFSv2 protocol returns directory entries one at a time).
> 
> 	The "cookie" idea must be carried out faithfully, in an FS
> 	specific fashion, for each FS which is allowed to be NFS
> 	exported.  This code duplication is subject to error, or
> 	worse, non-implementation due to its complexity.
> 
> 	A more rational approach would be to split the operation
> 	into two seperate VOP's: one to acquire a snapshot of a set
> 	of FS specific directory entries of an arbitrary size, and
> 	the second to extract rentries into the user's buffer, in
> 	cannonical format.

Sounds interesting...

> 5.	The idea of "root" vs. "non-root" mounts is inherently bad.
> 
> 	Right now, there are several operations, all wrapped into
> 	a single "mount" entry point.  This is actually a partial
> 	transition to a more cannonically correct implemetnation.
> 
> 	The reason for the "root" vs. "non-root" knowledge in the
> 	code has to do with several logical operations:
> 
> 	1)	"Mounting" the filesystem; that is, getting the
> 		vnode for the device to be mounted, and doing any
> 		FS specific operations necessary to cause the
> 		correct in-core context to be established.
> 
> 	2)	Covering the vnode at the mount point.
> 
> 		This operation updates the vnode of the mount
> 		point so that traversals of the mount point will
> 		get you the root directory of the FS that was
> 		mounted instead of the directory that is covered
> 		by the mount.
> 
> 	3)	Saving the "last mounted on" information.
> 
> 		This is a clerical detail.  Read-only FS's, and
> 		some read-write FS's, do not implement this.  It
> 		is mostly a nicety for tools that manipulate FFS
> 		directly.
> 
> 	4)	Initialize the FS stat information.
> 
> 		Part of the in-core data for any FS is the mnt_stat
> 		data, which is what comes back from a VFS_STATFS()
> 		call

You forgot:

	5)	Update export lists

		If you call the mount routine with no device name
		(args.fspec == 0) and with MNT_UPDATE, you get
		routed to the vfs_export routine

> 	The first operation is invariant.  It must be done for all
> 	FS's, whether they are "root" or "non-root".
> 
> 	The second operation is specific to "non-root" FS's.  It
> 	could be moved to common, higher level code -- specifically,
> 	it could be moved into the mount system call.

I thought it was? Admitedly the only reference code I have is the ntfs
code in the NetBSD kernel. But given how full of #ifdef (__FreeBSD__)'s it
is, I thought it'd be an ok reference.

> 	The third operation is also specific to "non-root" FS's.  It
> 	could be discarded, or it could be moved to a seperate VFS
> 	operation, e.g. VFS_SETMNTINFO().  I would recommend moving
> 	it to a seperate VFSOP, instead of discarding it.  The reason
> 	for this is that an intelligent person could reasonably decide
> 	to add the setting of this data in newfs and tunefs, and do
> 	away with /etc/fstab.
> 
> 	The fourth operation is invariant.  It must be done for all
> 	FS's, whether they are "root" or "non-root".

For comparison, NetBSD has a mount entry point, and a mountroot entry
point. But all the other ick is there too.

> 	We can now see that we have two discrete operations:
> 
> 	1)	Placement of any FS, regardless of how it is intended
> 		to be used, into the list of mounted filesystems.
> 
> 	2)	Mapping a filesystem from the list of mounted FS's
> 		into the directory hierarchy.

	3)	Updating export information.

> 	The job of the per FS mount code should be to take a mount
> 	structure, the vnode of a device, the FS specific arguments,
> 	the mount point credentials, and the process requesting the
> 	mount, and _only_ do #1 and #4.
> 
> 	The conversion of the root device into a vnode pointer, or
> 	a path to a device into a vnode pointer, is the job of upper
> 	level code -- specifically, the mount system call, and the
> 	common code for booting.

My one concern about this is you've assumed that the user is mounting a
device onto a filesystem. Layered filesystems won't do that. nullfs,
umaptfs, and unionfs will want a directory. The hierarchical storage
system I'm working on will want a file. kernfs, procfs, and an fs which I
haven't checked into the NetBSD tree don't really need the extra
parameter. Supporting all these different cases would be a hassle for
upstream code.

> 	This removes a large amount of complex code from each of
> 	the file systems, and centralizes the maintenance task into
> 	one set of code that either works for everyone, or no one
> 	(removing the duplication of code/introduction of errors
> 	issue).

Might I suggest a common library of routines which different mount
routines can call? That way we'd get code sharing while letting the fs
make decisions about what it expects of the input arguments.

I've been looking forward to ripping the export updating out of the mount
call. It'd be nice if we could rototill both FreeBSD & NetBSD's mount
interfaces the same way at the same time. :-)

> 	In addition, the lack of "root" specific code in many FS's
> 	VFS_MOUNT entry points is the reason that they can not be
> 	mounted as "/".  This change would open it up, such that any
> 	FS that was supported by the kernel could be used as the
> 	root filesystem.
> 
> 6.	The "vfs_default" code damages stacking
> 
> 	The intent of the stacking architecture was to have the
> 	default operation for any VOP unknown to an FS fall through
> 	to the lower level code, and fail if it was not implemented.
> 
> 	The use of the "vfs_default" to make unimplemented VOP's
> 	fall through to code which implements function, while well
> 	intentioned, is misguided.
> 
> 	Consider the case of a VOP proxy that proxies requests.  These
> 	might be requests to another machine, as in the previous
> 	proxy example, or they might be requests to user space, to
> 	allow for easy developement of new filesystem layers.
> 
> 	In addition, in order to get a default operation to actually
> 	fail, you have to intentionally create a failing VOP for that
> 	particular FS.
> 
> 	Finally, the paradigm can not support new VOP's without a
> 	kernel recompilation.  This means that in order to add to
> 	the list of VOP's known to the system when you add a new FS,
> 	you don't merely have to reallocate the in-core copy of the
> 	vnodeop_desc to include a new (failing) member, you have to
> 	create a default behaviour for it, and modify the default
> 	operations table.  In other words, it's not extensible, as
> 	it was architected to be.

This problem is FreeBSD-specific. Your analysis seems sound.

> 7.	The struct nameidata (namei.h) is broken in conception.
> 
> 	One issue that recurrs frequently, and remains unaddressed,
> 	is the issue of namespace abstraction.
> 
> 	This issue is nowhere more apparent than in the VFAT and NTFS
> 	filesystems, where there are two namespaces: one 8.3, and the
> 	second, 16 bit Unicode.
> 
> 	The problem is one of coherency, and one of reference, and
> 	is not easily resolved in the context of the current nameidata
> 	structure.  Both NTFS and the VFAT FS try to cover this issue,
> 	both with varing degress of success.
> 
> 	The problem is that there is no cannonical format that the
> 	kernel can use to communicate namespace data to FS's.  Unlike
> 	VOP_READDIR, which has the abstract (though ill-implemented)
> 	struct dirent, there is no abstract representation of the
> 	data in a pathname buffer, which would allow you to treat
> 	path components as opaque entities.
> 
> 	One potential remedy for this situation would be to cannonize
> 	any path into an ordered list of components.  Ideally, this
> 	would be done in 16 bit Unicode (looking toward the future),
> 	but would minimally be seperate components with length counts
> 	to allow faster rejection of non-matching components, and
> 	frequent recalculation of length.

NetBSD's name cache is a bit different from FreeBSD's, and might win here.
We have just VOP_LOOKUP, which calls the cache lookup routine, rather than
both a VOP_LOOKUP and a VOP_CACHEDLOOKUP.

Jaromir Dolecek has been discussing adding a canonicalized component name
to the cache entries. That way the VOP_LOOKUP routine gets called,
canonicalizes the name as it sees fit (say making it all upper case) if
it chooses to, and hands off to the cache lookup routine. The advantage is
that each fs can chose its on canonicalization, if it wants to. For
instance, ffs won't do anything (it's case sensetive), while other
case-insensitive fs's will do different things.

> 8.	The filesystems have knowledge of the name cache.
> 
> 	Entries into the name cache, and deletion of entries from
> 	the name cache, should be handled in FS independent code
> 	at a higher level.  This can avoid expensive VFS_LOOKUP calls
> 	in many cases, and save marshalling arguments into and out of
> 	the descriptor structure, in addition to drastically reducing
> 	the function call overhead.
> 
> 	Someone recently profiling FreeBSD's FS to detemine speed
> 	bottleneck (I believe it was Mike Smith, attempting to
> 	optimize for a ZD Labs benchmark) found that FreeBSD spends
> 	much of its time in namei().

I'm interested in what you suggest, because I'd expect all *BSD's could
use a more efficient namei. But I'm concerned that pushing too much into
upper-level routines would remove the fs's ability to make policy
decisions.

> 9.	The implementation of namei() is POSIX non-compliant
> 
> 	The implementation of namei() is by means of coroutine
> 	"recursion"; this is similar to the only recursion you can
> 	achieve in FORTRAN.
> 
> 	The upshot of this is that the use of the "//" namespace
> 	escape allowed by POSIX can not be usefully implemented.
> 	This is because it is not possible to inherit a namespace
> 	escape deeper than a single path component for a stack of
> 	more than one layer in depth.
> 
> 	This needs to be fixed, both for "natural" SMBFS support,
> 	and for other uses of the namespace escape (HTTP "tunnels",
> 	extended attribute and/or resource fork access in an OS/2
> 	HPFS or Macintosh HFS implementation, etc.), including
> 	forward looking research.
> 
> 	This is related to item 7.

I'm sorry. This point didn't parse. Could you give an example?

I don't see how the namei recursion method prevents catching // as a
namespace escape.

> 10.	Stacking is broken
> 
> 	This is really an issue of not having a coherency protocol
> 	which can be applied between stacks of files.  It is somewhat
> 	related to almost all of the above issues.
> 
> 	The current thinking which has been forwarded by Matt and
> 	John is that a vnode should have an associated vm_object_t,
> 	and that coherency should be maintained that way.
> 
> 	This thinking is flawed for a number of reasons:
> 
> 	a.	The main utility of this would be for an MFS
> 		implementation.  While a "fast MFS" is a
> 		laudable goal, it isn't sufficient to drive this.
> 
> 	b.	A coherency protocol is required in any case,
> 		since a proxied VOP is not necessarily on the
> 		same machine or in the same VM space.  This
> 		approach would disallow the possibility of a
> 		user space filesystem developement framework.
> 
> 	c.	There already exist aliases (VM implementation
> 		errors); intentionally adding aliases as an
> 		implementation detail will futher obfuscate them.
> 		Minimally, the VM system should pass a full
> 		branch path analysis based test procedure before
> 		they are introduced.  Even then, I would argue
> 		that it would open up a large complexity space
> 		that would prevent us from ever being sure about
> 		problem resoloution again.
> 
> 	d.	Filesystems which need to transform data can
> 		never operate correctly, since they need to
> 		make local copies of the transformed content.
> 		This includes cryptographic, character set
> 		translation, compression, and similar stacking
> 		layers.
> 
> 	Instead, I think the interface design issues (VOP_ADVLOCK,
> 	VOP_GETPAGES, VOP_PUTPAGES, VOP_READ, VOP_WRITE, et. al.)
> 	that drive the desire to implement coherency in this
> 	fashion be examined.  I believe that an ideal soloution
> 	would be to never have the pages replicated at more than a
> 	single vnode.  This would likewise solve the coherency
> 	problem, without the additional complexity.  The issue
> 	would devolve into locating the real backing object, and
> 	potentially, translating extents.

As NetBSD's UBC work is moving in a similar direction, and I'm interested
in working on a compressing fs, I'm interested in the solution you
propose.

> 11.	The function call "footprint" of filesystems is too large
> 
> 	Attempt the following:
> 
> 		Compile up all of the files which make up an
> 		individual filesystem.  You can take all of
> 		the files for the ufs/ffs objects and the
> 		vnode_if.o from a compiled kernel for this
> 		exercise.
> 
> 		Now link them.  Ignore the missing "main"; how
> 		many undefined functions are there?
> 
> 	The problem you are seeing is the incursion of the VM
> 	system, and sloppy programming practices, into each VFS
> 	implementation.
> 
> 	This footprint impacts filesystem portability, and is
> 	one reason, among many (including some of the above) that
> 	VFS modules are no longer very portable between BSD
> 	flavors.
> 
> 	Minimally, the VFS incursions need to be macrotized, and
> 	not assume a unified VM and buffer cache (or a non-unified
> 	VM and buffer cache, as well, for that matter).  This would
> 	improve portability considerably.

Sounds good. :-)

> 	In addition to this change, a function minimzation effort
> 	should take place.
> 
> 	If the underlying interface utilized by VFS layers was not
> 	the kernel (for local media FS's, like FFS or NTFS), but
> 	instead a variable granularity block store with a numeric
> 	namespace, then the "top" and "bottom" interfaces could be
> 	identical.  For now, however, some work can be done (and
> 	should be done) to reduce the function call footprint.
> 	This is important work, which can only aid developement
> 	of future work (such as a user space filesystem framework
> 	for use by developers and researchers).
> 
> 	I hesitate to suggest this, but it might be reasonable to
> 	consider a struct containing externally referenced functions,
> 	which is registered into the FS via mount, and which is
> 	identical for all FS's.  This would, likewise, promote the
> 	idea of a user space framework.
> 
> 	Ideally, work would be done to port the Heidemann framework
> 	to Linux, so that their developers could be leveraged.
> 
> 
> 
> Some FFS-specific problems are:
> 
> 1.	The directory code in the UFS layer is intertwined with the
> 	filespace code
> 
> 	Ideally, one would be able to mount a filesystem as a flat
> 	numeric namespace (see #7, above), and then mount the idea
> 	of directory management over top of that.
> 
> 2.	The quota subsystem is too tightly integrated
> 
> 	Quotas should be an abstract stacking layer that can be
> 	applied to any FS, instead of an FFS specific monstrosity.

It should certainly be possible to add a quota layer on top of any leaf
fs. That way you could de-couple quotas. :-)

> 	The current quota system is also limited to 16 bits for a
> 	number of values which, in FreeBSD, can be greater than
> 	16 bits (e.g. UID's).
> 
> 	The current quota system is also broken for Y2038.
> 
> 3.	The filesystem itself is broken for Y2038
> 
> 	The space which was historically reserved for the Y2038 fix
> 	(a 64 bit time_t) was absconeded with for subsecond resoloution.
> 
> 	This change should be reverted, and fsck modified to re-zero
> 	the values, given a specific argument.
> 
> 	The subsecond resoloution doesn't really matter, but if it is
> 	seen as an issue which needs to be addressed, the only value
> 	which could reasonably require this is the modification time,
> 	and there is sufficient free space in the inode to be able
> 	to provide for this (there are 2x32 bit spares).

I think all the *BSD's need to do the same thing here. :-)

One other suggestion I've heard is to split the 64 bits we have for time
into 44 bits for seconds, and 20 bits for microseconds. That's more than
enough modification resolution, and also pushes things to past year
500,000 AD. Versioning the indoe would cover this easily.

> I have other suggestions, but the above covers the most obvious
> damage.

Well taken.

Take care,

Bill



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.SOL.3.96.990816105106.27345H-100000>