From owner-freebsd-fs  Mon Aug 16  5:30:43 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from worf.qntm.com (worf.qntm.com [146.174.250.100])
	by hub.freebsd.org (Postfix) with ESMTP id DA28C14F83
	for <freebsd-fs@FreeBSD.ORG>; Mon, 16 Aug 1999 05:30:29 -0700 (PDT)
	(envelope-from Stephen.Byan@quantum.com)
Received: from mail3.qntm.com by worf.qntm.com with ESMTP
	(1.40.112.12/16.2) id AA110606569; Mon, 16 Aug 1999 05:29:29 -0700
Received: from milcmima.qntm.com (milcmima.qntm.com [146.174.18.61])
	by mail3.qntm.com (8.8.6/8.8.6) with ESMTP id FAA06209;
	Mon, 16 Aug 1999 05:29:36 -0700 (PDT)
Received: by milcmima.qntm.com with Internet Mail Service (5.5.2448.0)
	id <Q01MX7HS>; Mon, 16 Aug 1999 05:29:26 -0700
Message-Id: <8133266FE373D11190CD00805FA768BF02EE9D26@SHRCMSG1>
From: Stephen Byan <Stephen.Byan@quantum.com>
To: "'Terry Lambert'" <tlambert@primenet.com>,
	zzhang@cs.binghamton.edu
Cc: phk@critter.freebsd.dk, roberto@keltia.freenix.fr,
	freebsd-fs@FreeBSD.ORG
Subject: RE: Help with understand file system performance
Date: Mon, 16 Aug 1999 05:29:23 -0700
Mime-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2448.0)
Content-Type: text/plain;
	charset="iso-8859-1"
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Terry Lambert wrote:

>I am becoming convinced that an intermediate abstraction is really
>what is called for, to turn the bottom end into what is, in effect,
>nothing more than a flat, numeric namespace on top of a variable
>granularity block store.  A nice topic for much research... 8-).

There's an effort to create such a beast as part of CMU's Network Attached
Secure Disk research <http://www.pdl.cs.cmu.edu/NASD/>,
<http://www.pdl.cs.cmu.edu/extreme/>, and develop and implement it as a disk
drive interface, as part of NSIC's Network Attached Storage Device working
group <http://www.nsic.org/nasd/index.html>, then standardize it through
ANSI T10 as a SCSI-4 command-set. If the file system development community
has something to say to the drive vendors, now is the time to do it.
Personally, I'd be vocal about atomicity requirements.

FWIW, the next NSIC NASD public meeting is tomorrow, Aug 17, at the Clarion
Hotel in Millbrae, CA (i.e. at the San Francisco airport). 

Regards,
-Steve

Steve Byan <stephen.byan@quantum.com>
Design Engineer 
Quantum Corporation <http://www.quantum.com>
MS 1-3/E23
333 South Street
Shrewsbury, MA 01545
voice: (508) 770-3414 
fax: (508) 770-2604


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Mon Aug 16 13:50:28 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17])
	by hub.freebsd.org (Postfix) with ESMTP
	id DF2F8156AB; Mon, 16 Aug 1999 13:49:57 -0700 (PDT)
	(envelope-from wrstuden@marcy.nas.nasa.gov)
Received: from localhost (wrstuden@localhost)
	by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id NAA20782;
	Mon, 16 Aug 1999 13:48:16 -0700 (PDT)
Date: Mon, 16 Aug 1999 13:48:16 -0700 (PDT)
From: Bill Studenmund <wrstuden@nas.nasa.gov>
Reply-To: Bill Studenmund <wrstuden@nas.nasa.gov>
To: Terry Lambert <tlambert@primenet.com>
Cc: Alton Matthew <Matthew.Alton@anheuser-busch.com>,
	Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
In-Reply-To: <199908140150.SAA23891@usr04.primenet.com>
Message-ID: <Pine.SOL.3.96.990816105106.27345H-100000@marcy.nas.nasa.gov>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Sat, 14 Aug 1999, Terry Lambert wrote:

> > I am currently conducting a thorough study of the VFS subsystem
> > in preparation for an all-out effort to port SGI's XFS filesystem to
> > FreeBSD 4.x at such time as SGI gives up the code.  Matt Dillon
> > has written in hackers- that the VFS subsystem is presently not
> > well understood by any of the active kernel code contributers and
> > that it will be rewritten later this year.  This is obviously of great
> > concern to me in this port.
> 
> It is of great concern to me that a rewrite, apparently because of
> non-understanding, is taking place at all.

That concerns me too. Many aspects of the 4.4 vnode interface were there  
for specific reasons. Even if they were hack solutions, to re-write them  
because of a lack of understanding is dangerous as the new code will
likely run into the same problems as before. :-)

Also, it behooves all the *BSD's to not get too divergent. Sharing code
between us all helps all. Given that I'm working on the kernel side of a
data migration file system using NetBSD, I can assure you there are things
which FreeBSD would get access to more easily the more-similar the two VFS
interface are. :-)

> I would suggest that anyone planning on this rewrite should talk,
> in depth, with John Heidemann prior to engaging in such activity.
> John is very approachable, and is a deep thinker.  Any rewrite
> that does not meet his original design goals for his stacking
> architecture is, I think, a Very Bad Idea(tm).
> 
> 
> > I greatly appreciate all assistance in answering the following
> > questions:
> > 
> > 1)  What are the perceived problems with the current VFS?
> > 2)  What options are available to us as remedies?
> > 3)  To what extent will existing FS code require revision in order
> >      to be useful after the rewrite?
> > 4)  Will Chapters 6,7,8 & 9 of "The Design and Implementation of
> >      the 4.4BSD Operating System" still pertain after the rewrite?
> > 5)  How important are questions 3 & 4 in the design of the new
> >      VFS?
> > 
> > I believe that the VFS is conceptually sound and that the existing
> > semantics should be strictly retained in the new code.  Any new
> > functionality should be added in the form of entirely new kernel 
> > routines and system calls, or possibly by such means as
> > converting the existing routines to the vararg format &etc.
> 
> Here some of the problems I'm aware of, and my suggested remedies:
> 
> 1.	The interface is not reflexive, with regard to cn_pnbuf.
> 
> 	Specifically, path buffers are allocated by the caller, but
> 	not freed by the caller, and various routines in each FS
> 	implementation are expected to deal with this.
> 
> 	Each FS duplicates code, and such duplication is subject
> 	to error.  Not to mention that it makes your kernel fat.

Yep, that's not good.

> 2.	Advisory locks are hung off private backing objects.
> 
> 	Advisory locks are passed into VOP_ADVLOCK in each FS
> 	instance, and then each FS applies this by hanging the
> 	locks off a list on a private backing object.  For FFS,
> 	this is the in core inode.
> 
> 	A more correct approach would be to hang the lock off the
> 	vnode.  This effectively obviates the need for having a
> 	VOP_ADVLOCK at all, except for the NFS client FS, which
> 	will need to propagate lock requests across the net.  The
> 	most efficient mechanism for this would be to institute
> 	a pass/fail response for VOP_ADVLOCK calls, with a default
> 	of "pass", and an actual implementation of the operand only
> 	in the NFS client FS.

I agree that it's better for all fs's to share this functionality as much
as possible.

I'd vote againsts your implimentation suggestion for VOP_ADVLOCK on an 
efficiency concern. If we actually make a VOP call, that should be the
end of the story. I.e either add a vnode flag to indicate pas/fail-ness,
or add a genfs/std call to handle the problem.

I'd actually vote for the latter. Hang the byte-range locking off of the
vnode, and add a genfs_advlock() or vop_stdadvlock() routine (depending on
OS flavor) to handle the call. That way all fs's that can share code, and
the callers need only call VO_ADVLOCK() - no other logic.

NetBSD actually needs this to get unionfs to work. Do you want to talk
privately about it?

> 	Again, each FS must duplicate the advisory locking code,
> 	at present, and such duplication is subject to error.

Agreed.

> 3.	Object locks are implemented locally in many FS's.
> 
> 	The VOP_LOCK interface is implemented via vop_stdlock()
> 	calls in many FS's.  This is done using the "vfs_default"
> 	mechanism.  In other FS's, it's implemented locally.
> 
> 	The intent of the VOP_LOCK mechanism being implemented
> 	as a VOP at all was to allow it to be proxied to another
> 	machine over a network, using the original Heidemann
> 	design.  This is also the reason for the use of descriptors
> 	for all VOP arguments, since they can be opaquely proxied to
> 	another machine via a general mechanism.  Unlike NFS based
> 	network filesystems, this would allow you to add VOP's to
> 	both machines, without having to teach the transport about
> 	the new VOP for it to be usable remotely.

Just for a point of comparison, I recently got almost all the NetBSD fs's
to use common code. After our -Lite2 merge, all fs's were either calling
the lock manager, or using genfs_nolock() (a version for non-locking
fs's). Now there's a struct lock * and struct lock in struct vnode. The fs
exports its locking behavior via the struct lock *. For most fs's, the
struct lock * points to the struct lock, and genfs_lock() feeds that to
the lock manager.

But we've kept the ability to do something different (like call over the
network) alive. If the struct lock * is NULL, you have to call VOP_LOCK on
that fs. Note that this difference only matters for layered fs's -
everything else should be calling VOP_LOCK() and letting the dispatch code
figure out the right thing to do.

> 	Like the VOP_ADVLOCK, the need for VOP_LOCK is for proxy
> 	purposes, and it, too, should generate a pass/fail response,
> 	and be largely implemented in non-filesystem specific
> 	higher level code.

To an extent, that's that the exported struct lock * does, though the only
clients are layered filesystems. Everyone else calls VOP_LOCK. :-)

> 	Again, each FS which duplicates code for this function is
> 	subject to duplication errors.

Agreed.

> 4.	The VOP_READIR interface is irrational.
> 
> 	The VOP_READDIR interface returns its responses in "host
> 	cannonical format" (struct dirent, in sys/dirent.h).
> 	Internally, FFS operates on "directory entry blocks" that
> 	contain exactly these structures (an intentaional coincidence).
> 
> 	The problem with this approach, is that it makes the getdents
> 	system call sensitive to file systems for which some of the
> 	information returned (e.g. d_fileno, d_reclen, d_type, d_namlen)
> 	are synthetic.  What this means is that a native file system
> 	directory implementation single directory block must be able
> 	to fit into the buffer passed to the getdirentries(2) system
> 	call, or a directory listing is not a valid snapshot of the
> 	current state of the directory.
> 
> 	It also vastly complicates directory traversal restarts (hence
> 	the ncookies and a_cookies arguments, since the NFS server
> 	requires the ability to restart traversal, mid-block, since
> 	the NFSv2 protocol returns directory entries one at a time).
> 
> 	The "cookie" idea must be carried out faithfully, in an FS
> 	specific fashion, for each FS which is allowed to be NFS
> 	exported.  This code duplication is subject to error, or
> 	worse, non-implementation due to its complexity.
> 
> 	A more rational approach would be to split the operation
> 	into two seperate VOP's: one to acquire a snapshot of a set
> 	of FS specific directory entries of an arbitrary size, and
> 	the second to extract rentries into the user's buffer, in
> 	cannonical format.

Sounds interesting...

> 5.	The idea of "root" vs. "non-root" mounts is inherently bad.
> 
> 	Right now, there are several operations, all wrapped into
> 	a single "mount" entry point.  This is actually a partial
> 	transition to a more cannonically correct implemetnation.
> 
> 	The reason for the "root" vs. "non-root" knowledge in the
> 	code has to do with several logical operations:
> 
> 	1)	"Mounting" the filesystem; that is, getting the
> 		vnode for the device to be mounted, and doing any
> 		FS specific operations necessary to cause the
> 		correct in-core context to be established.
> 
> 	2)	Covering the vnode at the mount point.
> 
> 		This operation updates the vnode of the mount
> 		point so that traversals of the mount point will
> 		get you the root directory of the FS that was
> 		mounted instead of the directory that is covered
> 		by the mount.
> 
> 	3)	Saving the "last mounted on" information.
> 
> 		This is a clerical detail.  Read-only FS's, and
> 		some read-write FS's, do not implement this.  It
> 		is mostly a nicety for tools that manipulate FFS
> 		directly.
> 
> 	4)	Initialize the FS stat information.
> 
> 		Part of the in-core data for any FS is the mnt_stat
> 		data, which is what comes back from a VFS_STATFS()
> 		call

You forgot:

	5)	Update export lists

		If you call the mount routine with no device name
		(args.fspec == 0) and with MNT_UPDATE, you get
		routed to the vfs_export routine

> 	The first operation is invariant.  It must be done for all
> 	FS's, whether they are "root" or "non-root".
> 
> 	The second operation is specific to "non-root" FS's.  It
> 	could be moved to common, higher level code -- specifically,
> 	it could be moved into the mount system call.

I thought it was? Admitedly the only reference code I have is the ntfs
code in the NetBSD kernel. But given how full of #ifdef (__FreeBSD__)'s it
is, I thought it'd be an ok reference.

> 	The third operation is also specific to "non-root" FS's.  It
> 	could be discarded, or it could be moved to a seperate VFS
> 	operation, e.g. VFS_SETMNTINFO().  I would recommend moving
> 	it to a seperate VFSOP, instead of discarding it.  The reason
> 	for this is that an intelligent person could reasonably decide
> 	to add the setting of this data in newfs and tunefs, and do
> 	away with /etc/fstab.
> 
> 	The fourth operation is invariant.  It must be done for all
> 	FS's, whether they are "root" or "non-root".

For comparison, NetBSD has a mount entry point, and a mountroot entry
point. But all the other ick is there too.

> 	We can now see that we have two discrete operations:
> 
> 	1)	Placement of any FS, regardless of how it is intended
> 		to be used, into the list of mounted filesystems.
> 
> 	2)	Mapping a filesystem from the list of mounted FS's
> 		into the directory hierarchy.

	3)	Updating export information.

> 	The job of the per FS mount code should be to take a mount
> 	structure, the vnode of a device, the FS specific arguments,
> 	the mount point credentials, and the process requesting the
> 	mount, and _only_ do #1 and #4.
> 
> 	The conversion of the root device into a vnode pointer, or
> 	a path to a device into a vnode pointer, is the job of upper
> 	level code -- specifically, the mount system call, and the
> 	common code for booting.

My one concern about this is you've assumed that the user is mounting a
device onto a filesystem. Layered filesystems won't do that. nullfs,
umaptfs, and unionfs will want a directory. The hierarchical storage
system I'm working on will want a file. kernfs, procfs, and an fs which I
haven't checked into the NetBSD tree don't really need the extra
parameter. Supporting all these different cases would be a hassle for
upstream code.

> 	This removes a large amount of complex code from each of
> 	the file systems, and centralizes the maintenance task into
> 	one set of code that either works for everyone, or no one
> 	(removing the duplication of code/introduction of errors
> 	issue).

Might I suggest a common library of routines which different mount
routines can call? That way we'd get code sharing while letting the fs
make decisions about what it expects of the input arguments.

I've been looking forward to ripping the export updating out of the mount
call. It'd be nice if we could rototill both FreeBSD & NetBSD's mount
interfaces the same way at the same time. :-)

> 	In addition, the lack of "root" specific code in many FS's
> 	VFS_MOUNT entry points is the reason that they can not be
> 	mounted as "/".  This change would open it up, such that any
> 	FS that was supported by the kernel could be used as the
> 	root filesystem.
> 
> 6.	The "vfs_default" code damages stacking
> 
> 	The intent of the stacking architecture was to have the
> 	default operation for any VOP unknown to an FS fall through
> 	to the lower level code, and fail if it was not implemented.
> 
> 	The use of the "vfs_default" to make unimplemented VOP's
> 	fall through to code which implements function, while well
> 	intentioned, is misguided.
> 
> 	Consider the case of a VOP proxy that proxies requests.  These
> 	might be requests to another machine, as in the previous
> 	proxy example, or they might be requests to user space, to
> 	allow for easy developement of new filesystem layers.
> 
> 	In addition, in order to get a default operation to actually
> 	fail, you have to intentionally create a failing VOP for that
> 	particular FS.
> 
> 	Finally, the paradigm can not support new VOP's without a
> 	kernel recompilation.  This means that in order to add to
> 	the list of VOP's known to the system when you add a new FS,
> 	you don't merely have to reallocate the in-core copy of the
> 	vnodeop_desc to include a new (failing) member, you have to
> 	create a default behaviour for it, and modify the default
> 	operations table.  In other words, it's not extensible, as
> 	it was architected to be.

This problem is FreeBSD-specific. Your analysis seems sound.

> 7.	The struct nameidata (namei.h) is broken in conception.
> 
> 	One issue that recurrs frequently, and remains unaddressed,
> 	is the issue of namespace abstraction.
> 
> 	This issue is nowhere more apparent than in the VFAT and NTFS
> 	filesystems, where there are two namespaces: one 8.3, and the
> 	second, 16 bit Unicode.
> 
> 	The problem is one of coherency, and one of reference, and
> 	is not easily resolved in the context of the current nameidata
> 	structure.  Both NTFS and the VFAT FS try to cover this issue,
> 	both with varing degress of success.
> 
> 	The problem is that there is no cannonical format that the
> 	kernel can use to communicate namespace data to FS's.  Unlike
> 	VOP_READDIR, which has the abstract (though ill-implemented)
> 	struct dirent, there is no abstract representation of the
> 	data in a pathname buffer, which would allow you to treat
> 	path components as opaque entities.
> 
> 	One potential remedy for this situation would be to cannonize
> 	any path into an ordered list of components.  Ideally, this
> 	would be done in 16 bit Unicode (looking toward the future),
> 	but would minimally be seperate components with length counts
> 	to allow faster rejection of non-matching components, and
> 	frequent recalculation of length.

NetBSD's name cache is a bit different from FreeBSD's, and might win here.
We have just VOP_LOOKUP, which calls the cache lookup routine, rather than
both a VOP_LOOKUP and a VOP_CACHEDLOOKUP.

Jaromir Dolecek has been discussing adding a canonicalized component name
to the cache entries. That way the VOP_LOOKUP routine gets called,
canonicalizes the name as it sees fit (say making it all upper case) if
it chooses to, and hands off to the cache lookup routine. The advantage is
that each fs can chose its on canonicalization, if it wants to. For
instance, ffs won't do anything (it's case sensetive), while other
case-insensitive fs's will do different things.

> 8.	The filesystems have knowledge of the name cache.
> 
> 	Entries into the name cache, and deletion of entries from
> 	the name cache, should be handled in FS independent code
> 	at a higher level.  This can avoid expensive VFS_LOOKUP calls
> 	in many cases, and save marshalling arguments into and out of
> 	the descriptor structure, in addition to drastically reducing
> 	the function call overhead.
> 
> 	Someone recently profiling FreeBSD's FS to detemine speed
> 	bottleneck (I believe it was Mike Smith, attempting to
> 	optimize for a ZD Labs benchmark) found that FreeBSD spends
> 	much of its time in namei().

I'm interested in what you suggest, because I'd expect all *BSD's could
use a more efficient namei. But I'm concerned that pushing too much into
upper-level routines would remove the fs's ability to make policy
decisions.

> 9.	The implementation of namei() is POSIX non-compliant
> 
> 	The implementation of namei() is by means of coroutine
> 	"recursion"; this is similar to the only recursion you can
> 	achieve in FORTRAN.
> 
> 	The upshot of this is that the use of the "//" namespace
> 	escape allowed by POSIX can not be usefully implemented.
> 	This is because it is not possible to inherit a namespace
> 	escape deeper than a single path component for a stack of
> 	more than one layer in depth.
> 
> 	This needs to be fixed, both for "natural" SMBFS support,
> 	and for other uses of the namespace escape (HTTP "tunnels",
> 	extended attribute and/or resource fork access in an OS/2
> 	HPFS or Macintosh HFS implementation, etc.), including
> 	forward looking research.
> 
> 	This is related to item 7.

I'm sorry. This point didn't parse. Could you give an example?

I don't see how the namei recursion method prevents catching // as a
namespace escape.

> 10.	Stacking is broken
> 
> 	This is really an issue of not having a coherency protocol
> 	which can be applied between stacks of files.  It is somewhat
> 	related to almost all of the above issues.
> 
> 	The current thinking which has been forwarded by Matt and
> 	John is that a vnode should have an associated vm_object_t,
> 	and that coherency should be maintained that way.
> 
> 	This thinking is flawed for a number of reasons:
> 
> 	a.	The main utility of this would be for an MFS
> 		implementation.  While a "fast MFS" is a
> 		laudable goal, it isn't sufficient to drive this.
> 
> 	b.	A coherency protocol is required in any case,
> 		since a proxied VOP is not necessarily on the
> 		same machine or in the same VM space.  This
> 		approach would disallow the possibility of a
> 		user space filesystem developement framework.
> 
> 	c.	There already exist aliases (VM implementation
> 		errors); intentionally adding aliases as an
> 		implementation detail will futher obfuscate them.
> 		Minimally, the VM system should pass a full
> 		branch path analysis based test procedure before
> 		they are introduced.  Even then, I would argue
> 		that it would open up a large complexity space
> 		that would prevent us from ever being sure about
> 		problem resoloution again.
> 
> 	d.	Filesystems which need to transform data can
> 		never operate correctly, since they need to
> 		make local copies of the transformed content.
> 		This includes cryptographic, character set
> 		translation, compression, and similar stacking
> 		layers.
> 
> 	Instead, I think the interface design issues (VOP_ADVLOCK,
> 	VOP_GETPAGES, VOP_PUTPAGES, VOP_READ, VOP_WRITE, et. al.)
> 	that drive the desire to implement coherency in this
> 	fashion be examined.  I believe that an ideal soloution
> 	would be to never have the pages replicated at more than a
> 	single vnode.  This would likewise solve the coherency
> 	problem, without the additional complexity.  The issue
> 	would devolve into locating the real backing object, and
> 	potentially, translating extents.

As NetBSD's UBC work is moving in a similar direction, and I'm interested
in working on a compressing fs, I'm interested in the solution you
propose.

> 11.	The function call "footprint" of filesystems is too large
> 
> 	Attempt the following:
> 
> 		Compile up all of the files which make up an
> 		individual filesystem.  You can take all of
> 		the files for the ufs/ffs objects and the
> 		vnode_if.o from a compiled kernel for this
> 		exercise.
> 
> 		Now link them.  Ignore the missing "main"; how
> 		many undefined functions are there?
> 
> 	The problem you are seeing is the incursion of the VM
> 	system, and sloppy programming practices, into each VFS
> 	implementation.
> 
> 	This footprint impacts filesystem portability, and is
> 	one reason, among many (including some of the above) that
> 	VFS modules are no longer very portable between BSD
> 	flavors.
> 
> 	Minimally, the VFS incursions need to be macrotized, and
> 	not assume a unified VM and buffer cache (or a non-unified
> 	VM and buffer cache, as well, for that matter).  This would
> 	improve portability considerably.

Sounds good. :-)

> 	In addition to this change, a function minimzation effort
> 	should take place.
> 
> 	If the underlying interface utilized by VFS layers was not
> 	the kernel (for local media FS's, like FFS or NTFS), but
> 	instead a variable granularity block store with a numeric
> 	namespace, then the "top" and "bottom" interfaces could be
> 	identical.  For now, however, some work can be done (and
> 	should be done) to reduce the function call footprint.
> 	This is important work, which can only aid developement
> 	of future work (such as a user space filesystem framework
> 	for use by developers and researchers).
> 
> 	I hesitate to suggest this, but it might be reasonable to
> 	consider a struct containing externally referenced functions,
> 	which is registered into the FS via mount, and which is
> 	identical for all FS's.  This would, likewise, promote the
> 	idea of a user space framework.
> 
> 	Ideally, work would be done to port the Heidemann framework
> 	to Linux, so that their developers could be leveraged.
> 
> 
> 
> Some FFS-specific problems are:
> 
> 1.	The directory code in the UFS layer is intertwined with the
> 	filespace code
> 
> 	Ideally, one would be able to mount a filesystem as a flat
> 	numeric namespace (see #7, above), and then mount the idea
> 	of directory management over top of that.
> 
> 2.	The quota subsystem is too tightly integrated
> 
> 	Quotas should be an abstract stacking layer that can be
> 	applied to any FS, instead of an FFS specific monstrosity.

It should certainly be possible to add a quota layer on top of any leaf
fs. That way you could de-couple quotas. :-)

> 	The current quota system is also limited to 16 bits for a
> 	number of values which, in FreeBSD, can be greater than
> 	16 bits (e.g. UID's).
> 
> 	The current quota system is also broken for Y2038.
> 
> 3.	The filesystem itself is broken for Y2038
> 
> 	The space which was historically reserved for the Y2038 fix
> 	(a 64 bit time_t) was absconeded with for subsecond resoloution.
> 
> 	This change should be reverted, and fsck modified to re-zero
> 	the values, given a specific argument.
> 
> 	The subsecond resoloution doesn't really matter, but if it is
> 	seen as an issue which needs to be addressed, the only value
> 	which could reasonably require this is the modification time,
> 	and there is sufficient free space in the inode to be able
> 	to provide for this (there are 2x32 bit spares).

I think all the *BSD's need to do the same thing here. :-)

One other suggestion I've heard is to split the 64 bits we have for time
into 44 bits for seconds, and 20 bits for microseconds. That's more than
enough modification resolution, and also pushes things to past year
500,000 AD. Versioning the indoe would cover this easily.

> I have other suggestions, but the above covers the most obvious
> damage.

Well taken.

Take care,

Bill



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Mon Aug 16 14:18:44 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp01.primenet.com (smtp01.primenet.com [206.165.6.131])
	by hub.freebsd.org (Postfix) with ESMTP
	id E270A14BD5; Mon, 16 Aug 1999 14:18:29 -0700 (PDT)
	(envelope-from tlambert@usr09.primenet.com)
Received: (from daemon@localhost)
	by smtp01.primenet.com (8.8.8/8.8.8) id OAA24762;
	Mon, 16 Aug 1999 14:18:56 -0700 (MST)
Received: from usr09.primenet.com(206.165.6.209)
 via SMTP by smtp01.primenet.com, id smtpd024727; Mon Aug 16 14:18:47 1999
Received: (from tlambert@localhost)
	by usr09.primenet.com (8.8.5/8.8.5) id OAA04940;
	Mon, 16 Aug 1999 14:18:45 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908162118.OAA04940@usr09.primenet.com>
Subject: Re: BSD XFS Port & BSD VFS Rewrite
To: wrstuden@nas.nasa.gov
Date: Mon, 16 Aug 1999 21:18:45 +0000 (GMT)
Cc: tlambert@primenet.com, Matthew.Alton@anheuser-busch.com,
	Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
In-Reply-To: <Pine.SOL.3.96.990816105106.27345H-100000@marcy.nas.nasa.gov> from "Bill Studenmund" at Aug 16, 99 01:48:16 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> > 2.	Advisory locks are hung off private backing objects.
> > 
> > 	Advisory locks are passed into VOP_ADVLOCK in each FS
> > 	instance, and then each FS applies this by hanging the
> > 	locks off a list on a private backing object.  For FFS,
> > 	this is the in core inode.
> > 
> > 	A more correct approach would be to hang the lock off the
> > 	vnode.  This effectively obviates the need for having a
> > 	VOP_ADVLOCK at all, except for the NFS client FS, which
> > 	will need to propagate lock requests across the net.  The
> > 	most efficient mechanism for this would be to institute
> > 	a pass/fail response for VOP_ADVLOCK calls, with a default
> > 	of "pass", and an actual implementation of the operand only
> > 	in the NFS client FS.
> 
> I agree that it's better for all fs's to share this functionality as much
> as possible.
> 
> I'd vote againsts your implimentation suggestion for VOP_ADVLOCK on an 
> efficiency concern. If we actually make a VOP call, that should be the
> end of the story. I.e either add a vnode flag to indicate pas/fail-ness,
> or add a genfs/std call to handle the problem.
> 
> I'd actually vote for the latter. Hang the byte-range locking off of the
> vnode, and add a genfs_advlock() or vop_stdadvlock() routine (depending on
> OS flavor) to handle the call. That way all fs's that can share code, and
> the callers need only call VO_ADVLOCK() - no other logic.

OK.  Here's the problem with that:  NFS client locks in a stacked
FS on top the the NFS client FS.

Specifically, you need to seperate the idea of asserting a lock
against the local vnode, asserting the lock via NFS locking, and
coelescing the local lock list, after both have succeeded, or
reverting the local assertion, should the remote assertion fail.

This is particularly important for transformative layers, specifically
cryptographic or compressing layers.  A similar issue exists for
character sets, e.g. a Unicode enabled OS NFS mounting via NFS
an ISO 8859-1 filesystem, and having to do the directory (de)bloat
on the fly.


> NetBSD actually needs this to get unionfs to work. Do you want to talk
> privately about it?

If you want.  FreeBSD needs it for unionfs and nullfs, so it's
something that would be worth airing.

I think you could say that no locking routine was an approval of
the uuper level lock.  This lets you bail on all FS's except NFS,
where you have to deal with the approve/reject from the remote
host.  The problem with this on FreeBSD is the VFS_default stuff,
which puts a non-NULL interface on all FS's for all VOP's.


> > 3.	Object locks are implemented locally in many FS's.
> > 
> > 	The VOP_LOCK interface is implemented via vop_stdlock()
> > 	calls in many FS's.  This is done using the "vfs_default"
> > 	mechanism.  In other FS's, it's implemented locally.
> > 
> > 	The intent of the VOP_LOCK mechanism being implemented
> > 	as a VOP at all was to allow it to be proxied to another
> > 	machine over a network, using the original Heidemann
> > 	design.  This is also the reason for the use of descriptors
> > 	for all VOP arguments, since they can be opaquely proxied to
> > 	another machine via a general mechanism.  Unlike NFS based
> > 	network filesystems, this would allow you to add VOP's to
> > 	both machines, without having to teach the transport about
> > 	the new VOP for it to be usable remotely.
> 
> Just for a point of comparison, I recently got almost all the NetBSD fs's
> to use common code. After our -Lite2 merge, all fs's were either calling
> the lock manager, or using genfs_nolock() (a version for non-locking
> fs's). Now there's a struct lock * and struct lock in struct vnode. The fs
> exports its locking behavior via the struct lock *. For most fs's, the
> struct lock * points to the struct lock, and genfs_lock() feeds that to
> the lock manager.
> 
> But we've kept the ability to do something different (like call over the
> network) alive. If the struct lock * is NULL, you have to call VOP_LOCK on
> that fs. Note that this difference only matters for layered fs's -
> everything else should be calling VOP_LOCK() and letting the dispatch code
> figure out the right thing to do.

Yes, this NULL is the same NULL I suggested for advisory locks,
above.

FreeBSD has moved to more common code, but it's all call-down
based because of the vfs_default stuff again.


> > 5.	The idea of "root" vs. "non-root" mounts is inherently bad.
> > 
> > 	Right now, there are several operations, all wrapped into
> > 	a single "mount" entry point.  This is actually a partial
> > 	transition to a more cannonically correct implemetnation.
> > 
> > 	The reason for the "root" vs. "non-root" knowledge in the
> > 	code has to do with several logical operations:
> > 
> > 	1)	"Mounting" the filesystem; that is, getting the
> > 		vnode for the device to be mounted, and doing any
> > 		FS specific operations necessary to cause the
> > 		correct in-core context to be established.
> > 
> > 	2)	Covering the vnode at the mount point.
> > 
> > 		This operation updates the vnode of the mount
> > 		point so that traversals of the mount point will
> > 		get you the root directory of the FS that was
> > 		mounted instead of the directory that is covered
> > 		by the mount.
> > 
> > 	3)	Saving the "last mounted on" information.
> > 
> > 		This is a clerical detail.  Read-only FS's, and
> > 		some read-write FS's, do not implement this.  It
> > 		is mostly a nicety for tools that manipulate FFS
> > 		directly.
> > 
> > 	4)	Initialize the FS stat information.
> > 
> > 		Part of the in-core data for any FS is the mnt_stat
> > 		data, which is what comes back from a VFS_STATFS()
> > 		call
> 
> You forgot:
> 
> 	5)	Update export lists
> 
> 		If you call the mount routine with no device name
> 		(args.fspec == 0) and with MNT_UPDATE, you get
> 		routed to the vfs_export routine

This must be the job of the upper level code, so that there is
a single control point for export information, instead of spreading
it throughout ead FS's mount entry point.

> > 	The first operation is invariant.  It must be done for all
> > 	FS's, whether they are "root" or "non-root".
> > 
> > 	The second operation is specific to "non-root" FS's.  It
> > 	could be moved to common, higher level code -- specifically,
> > 	it could be moved into the mount system call.
> 
> I thought it was? Admitedly the only reference code I have is the ntfs
> code in the NetBSD kernel. But given how full of #ifdef (__FreeBSD__)'s it
> is, I thought it'd be an ok reference.

No.

Basically, what you would have is the equivalent of a variable
length "mounted volume" table, from which mappings (and exports,
based on the mappings) are externalized into the namespace.


> > 	The third operation is also specific to "non-root" FS's.  It
> > 	could be discarded, or it could be moved to a seperate VFS
> > 	operation, e.g. VFS_SETMNTINFO().  I would recommend moving
> > 	it to a seperate VFSOP, instead of discarding it.  The reason
> > 	for this is that an intelligent person could reasonably decide
> > 	to add the setting of this data in newfs and tunefs, and do
> > 	away with /etc/fstab.
> > 
> > 	The fourth operation is invariant.  It must be done for all
> > 	FS's, whether they are "root" or "non-root".
> 
> For comparison, NetBSD has a mount entry point, and a mountroot entry
> point. But all the other ick is there too.

Right.  It should just have a "mount" entry point, and the rest
of the stuff moves to higher level code, called by the mount system
call, and the mountroot stuff during boot, to externalize the root
volume at the top of the hierarchy.

An ideal world would mount a / that had a /dev under it, and then
do transparent mounts over top of that.



> > 	We can now see that we have two discrete operations:
> > 
> > 	1)	Placement of any FS, regardless of how it is intended
> > 		to be used, into the list of mounted filesystems.
> > 
> > 	2)	Mapping a filesystem from the list of mounted FS's
> > 		into the directory hierarchy.
> 
> 	3)	Updating export information.

Built into the higher level code, same place as #2.

> > 	The job of the per FS mount code should be to take a mount
> > 	structure, the vnode of a device, the FS specific arguments,
> > 	the mount point credentials, and the process requesting the
> > 	mount, and _only_ do #1 and #4.
> > 
> > 	The conversion of the root device into a vnode pointer, or
> > 	a path to a device into a vnode pointer, is the job of upper
> > 	level code -- specifically, the mount system call, and the
> > 	common code for booting.
> 
> My one concern about this is you've assumed that the user is mounting a
> device onto a filesystem.

No.  Vnoide, not bdevvp.  The bdevvp stuff is for the boot time stuff
in the upper level code, and only applies to the root volume.

> Layered filesystems won't do that. nullfs,
> umaptfs, and unionfs will want a directory. The hierarchical storage
> system I'm working on will want a file. kernfs, procfs, and an fs which I
> haven't checked into the NetBSD tree don't really need the extra
> parameter. Supporting all these different cases would be a hassle for
> upstream code.
> 
> > 	This removes a large amount of complex code from each of
> > 	the file systems, and centralizes the maintenance task into
> > 	one set of code that either works for everyone, or no one
> > 	(removing the duplication of code/introduction of errors
> > 	issue).
> 
> Might I suggest a common library of routines which different mount
> routines can call? That way we'd get code sharing while letting the fs
> make decisions about what it expects of the input arguments.

This is the "footprint" problem, all over again.  Reject/accept (or 
"accept if no VOP") seems more elegant, and also reduces footprint.


> I've been looking forward to ripping the export updating out of the mount
> call. It'd be nice if we could rototill both FreeBSD & NetBSD's mount
> interfaces the same way at the same time. :-)

8-).


> > 7.	The struct nameidata (namei.h) is broken in conception.
> > 
> > 	One issue that recurrs frequently, and remains unaddressed,
> > 	is the issue of namespace abstraction.
> > 
> > 	This issue is nowhere more apparent than in the VFAT and NTFS
> > 	filesystems, where there are two namespaces: one 8.3, and the
> > 	second, 16 bit Unicode.
> > 
> > 	The problem is one of coherency, and one of reference, and
> > 	is not easily resolved in the context of the current nameidata
> > 	structure.  Both NTFS and the VFAT FS try to cover this issue,
> > 	both with varing degress of success.
> > 
> > 	The problem is that there is no cannonical format that the
> > 	kernel can use to communicate namespace data to FS's.  Unlike
> > 	VOP_READDIR, which has the abstract (though ill-implemented)
> > 	struct dirent, there is no abstract representation of the
> > 	data in a pathname buffer, which would allow you to treat
> > 	path components as opaque entities.
> > 
> > 	One potential remedy for this situation would be to cannonize
> > 	any path into an ordered list of components.  Ideally, this
> > 	would be done in 16 bit Unicode (looking toward the future),
> > 	but would minimally be seperate components with length counts
> > 	to allow faster rejection of non-matching components, and
> > 	frequent recalculation of length.
> 
> NetBSD's name cache is a bit different from FreeBSD's, and might win here.
> We have just VOP_LOOKUP, which calls the cache lookup routine, rather than
> both a VOP_LOOKUP and a VOP_CACHEDLOOKUP.
> 
> Jaromir Dolecek has been discussing adding a canonicalized component name
> to the cache entries. That way the VOP_LOOKUP routine gets called,
> canonicalizes the name as it sees fit (say making it all upper case) if
> it chooses to, and hands off to the cache lookup routine. The advantage is
> that each fs can chose its on canonicalization, if it wants to. For
> instance, ffs won't do anything (it's case sensetive), while other
> case-insensitive fs's will do different things.

Can you push a Unicode name down from an appropriate system call?

I don't see any way to deal with an NT FS for characters outside
ISO 8859-1, otherwise.  8-(.


> > 9.	The implementation of namei() is POSIX non-compliant
> > 
> > 	The implementation of namei() is by means of coroutine
> > 	"recursion"; this is similar to the only recursion you can
> > 	achieve in FORTRAN.
> > 
> > 	The upshot of this is that the use of the "//" namespace
> > 	escape allowed by POSIX can not be usefully implemented.
> > 	This is because it is not possible to inherit a namespace
> > 	escape deeper than a single path component for a stack of
> > 	more than one layer in depth.
> > 
> > 	This needs to be fixed, both for "natural" SMBFS support,
> > 	and for other uses of the namespace escape (HTTP "tunnels",
> > 	extended attribute and/or resource fork access in an OS/2
> > 	HPFS or Macintosh HFS implementation, etc.), including
> > 	forward looking research.
> > 
> > 	This is related to item 7.
> 
> I'm sorry. This point didn't parse. Could you give an example?
> 
> I don't see how the namei recursion method prevents catching // as a
> namespace escape.


//apple-resource-fork/intermediate_dir/some_other_dir/file_with_fork

You can't inherit the fact that you are looking at the resource fork
in the terminal component, ONLY.


> > 	Instead, I think the interface design issues (VOP_ADVLOCK,
> > 	VOP_GETPAGES, VOP_PUTPAGES, VOP_READ, VOP_WRITE, et. al.)
> > 	that drive the desire to implement coherency in this
> > 	fashion be examined.  I believe that an ideal soloution
> > 	would be to never have the pages replicated at more than a
> > 	single vnode.  This would likewise solve the coherency
> > 	problem, without the additional complexity.  The issue
> > 	would devolve into locating the real backing object, and
> > 	potentially, translating extents.
> 
> As NetBSD's UBC work is moving in a similar direction, and I'm interested
> in working on a compressing fs, I'm interested in the solution you
> propose.

Matt Dillion is apparently the person doing the work here.  It seems
I am out of date on the current thinking, as the vm_object_t
apprach has apparently been discarded.


> > 2.	The quota subsystem is too tightly integrated
> > 
> > 	Quotas should be an abstract stacking layer that can be
> > 	applied to any FS, instead of an FFS specific monstrosity.
> 
> It should certainly be possible to add a quota layer on top of any leaf
> fs. That way you could de-couple quotas. :-)

Yes, assuming stacking works in the first place...


> > 3.	The filesystem itself is broken for Y2038
> > 
> > 	The space which was historically reserved for the Y2038 fix
> > 	(a 64 bit time_t) was absconeded with for subsecond resoloution.
> > 
> > 	This change should be reverted, and fsck modified to re-zero
> > 	the values, given a specific argument.
> > 
> > 	The subsecond resoloution doesn't really matter, but if it is
> > 	seen as an issue which needs to be addressed, the only value
> > 	which could reasonably require this is the modification time,
> > 	and there is sufficient free space in the inode to be able
> > 	to provide for this (there are 2x32 bit spares).
> 
> I think all the *BSD's need to do the same thing here. :-)
> 
> One other suggestion I've heard is to split the 64 bits we have for time
> into 44 bits for seconds, and 20 bits for microseconds. That's more than
> enough modification resolution, and also pushes things to past year
> 500,000 AD. Versioning the indoe would cover this easily.

Ugh.  But possible...


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Mon Aug 16 16:28:42 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17])
	by hub.freebsd.org (Postfix) with ESMTP
	id 24F1714DA5; Mon, 16 Aug 1999 16:28:28 -0700 (PDT)
	(envelope-from wrstuden@marcy.nas.nasa.gov)
Received: from localhost (wrstuden@localhost)
	by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id QAA03157;
	Mon, 16 Aug 1999 16:04:11 -0700 (PDT)
Date: Mon, 16 Aug 1999 16:04:11 -0700 (PDT)
From: Bill Studenmund <wrstuden@nas.nasa.gov>
Reply-To: Bill Studenmund <wrstuden@nas.nasa.gov>
To: Terry Lambert <tlambert@primenet.com>
Cc: Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG,
	fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
In-Reply-To: <199908162118.OAA04940@usr09.primenet.com>
Message-ID: <Pine.SOL.3.96.990816143421.27345M-100000@marcy.nas.nasa.gov>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Mon, 16 Aug 1999, Terry Lambert wrote:

> > > 2.	Advisory locks are hung off private backing objects.
> > I'd vote againsts your implimentation suggestion for VOP_ADVLOCK on an 
> > efficiency concern. If we actually make a VOP call, that should be the
> > end of the story. I.e either add a vnode flag to indicate pas/fail-ness,
> > or add a genfs/std call to handle the problem.
> > 
> > I'd actually vote for the latter. Hang the byte-range locking off of the
> > vnode, and add a genfs_advlock() or vop_stdadvlock() routine (depending on
> > OS flavor) to handle the call. That way all fs's that can share code, and
> > the callers need only call VO_ADVLOCK() - no other logic.
> 
> OK.  Here's the problem with that:  NFS client locks in a stacked
> FS on top the the NFS client FS.

Ahh, but it'd be the fs's decision to map genfs_advlock()/vop_stdadvlock()
to its vop_advlock_desc entry or not. In this case, NFS wouldn't want to
do that.

Though it would mean growing the fs footprint.

> Specifically, you need to seperate the idea of asserting a lock
> against the local vnode, asserting the lock via NFS locking, and
> coelescing the local lock list, after both have succeeded, or
> reverting the local assertion, should the remote assertion fail.

Right. But my thought was that you'd be calling an NFS routine, so it
could do the right thing.

> > NetBSD actually needs this to get unionfs to work. Do you want to talk
> > privately about it?
> 
> If you want.  FreeBSD needs it for unionfs and nullfs, so it's
> something that would be worth airing.
> 
> I think you could say that no locking routine was an approval of
> the uuper level lock.  This lets you bail on all FS's except NFS,
> where you have to deal with the approve/reject from the remote
> host.  The problem with this on FreeBSD is the VFS_default stuff,
> which puts a non-NULL interface on all FS's for all VOP's.

I'm not familiar with the VFS_default stuff. All the vop_default_desc
routines in NetBSD point to error routines.

> Yes, this NULL is the same NULL I suggested for advisory locks,
> above.

I'm not sure. The struct lock * is only used by layered filesystems, so
they can keep track both of the underlying vnode lock, and if needed their
own vnode lock. For advisory locks, would we want to keep track both of
locks on our layer and the layer below? Don't we want either one or the
other? i.e. layers bypass to the one below, or deal with it all
themselves.

> > > 5.	The idea of "root" vs. "non-root" mounts is inherently bad.
> > You forgot:
> > 
> > 	5)	Update export lists
> > 
> > 		If you call the mount routine with no device name
> > 		(args.fspec == 0) and with MNT_UPDATE, you get
> > 		routed to the vfs_export routine
> 
> This must be the job of the upper level code, so that there is
> a single control point for export information, instead of spreading
> it throughout ead FS's mount entry point.

I agree it should be detangled, but think it should remain the fs's job to
choose to call vfs_export. Otherwise an fs can't impliment its own export
policies. :-)

> > I thought it was? Admitedly the only reference code I have is the ntfs
> > code in the NetBSD kernel. But given how full of #ifdef (__FreeBSD__)'s it
> > is, I thought it'd be an ok reference.
> 
> No.

We've lost the context, but what I was trying to say was that I thought
the marking-the-vnode-as-mounted-on bit was done in the mount syscall at
present. At least that's what
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_syscalls.c?rev=1.130
seems to be doing.

> Basically, what you would have is the equivalent of a variable
> length "mounted volume" table, from which mappings (and exports,
> based on the mappings) are externalized into the namespace.

Ahh, sounds like you're talking about a new formalism..

> Right.  It should just have a "mount" entry point, and the rest
> of the stuff moves to higher level code, called by the mount system
> call, and the mountroot stuff during boot, to externalize the root
> volume at the top of the hierarchy.
> 
> An ideal world would mount a / that had a /dev under it, and then
> do transparent mounts over top of that.

That would be quite a different place than we have now. ;-)

> > > 	The conversion of the root device into a vnode pointer, or
> > > 	a path to a device into a vnode pointer, is the job of upper
> > > 	level code -- specifically, the mount system call, and the
> > > 	common code for booting.
> > 
> > My one concern about this is you've assumed that the user is mounting a
> > device onto a filesystem.
> 
> No.  Vnoide, not bdevvp.  The bdevvp stuff is for the boot time stuff
> in the upper level code, and only applies to the root volume.

Maybe I mis-parsed. I thought you were talking about parsing the first
mount option (in mount /dev/disk there, the /dev/disk option) into a
vnode. The concern below is that different fs's have different ideas as to
what that node should be. Some want it a device node which no one else is
using (most leaf fs's), while some others want a directory (nullfs, etc),
some want a file or device (the HSM system I'm working on) while others
don't care (in mount -t kernfs /kern /kern , the first kern doesn't matter
at all). But all is well with different support routines which the
mount_foo() routine can call.

> > Layered filesystems won't do that. nullfs,
> > umaptfs, and unionfs will want a directory. The hierarchical storage
> > system I'm working on will want a file. kernfs, procfs, and an fs which I
> > haven't checked into the NetBSD tree don't really need the extra
> > parameter. Supporting all these different cases would be a hassle for
> > upstream code.
> > 
> > > 	This removes a large amount of complex code from each of
> > > 	the file systems, and centralizes the maintenance task into
> > > 	one set of code that either works for everyone, or no one
> > > 	(removing the duplication of code/introduction of errors
> > > 	issue).
> > 
> > Might I suggest a common library of routines which different mount
> > routines can call? That way we'd get code sharing while letting the fs
> > make decisions about what it expects of the input arguments.
> 
> This is the "footprint" problem, all over again.  Reject/accept (or 
> "accept if no VOP") seems more elegant, and also reduces footprint.

Very true. The problem is that the current VFS system was designed as a
black box. It gets handed all calls, and it gets to decide policy, and do
everything on its own. We're now basically discussing ways of having the
plethora of fs's we now have do things the same way. :-)

> > > 7.	The struct nameidata (namei.h) is broken in conception.
> 
> Can you push a Unicode name down from an appropriate system call?
> 
> I don't see any way to deal with an NT FS for characters outside
> ISO 8859-1, otherwise.  8-(.

Hmmm. I think the real problem is that the kernel(s) is(are) not at all
designed well for different laguages.

> > > 9.	The implementation of namei() is POSIX non-compliant
> > > 
> > > 	The implementation of namei() is by means of coroutine
> > > 	"recursion"; this is similar to the only recursion you can
> > > 	achieve in FORTRAN.
> > > 
> > > 	The upshot of this is that the use of the "//" namespace
> > > 	escape allowed by POSIX can not be usefully implemented.
> > > 	This is because it is not possible to inherit a namespace
> > > 	escape deeper than a single path component for a stack of
> > > 	more than one layer in depth.
> > > 
> > > 	This needs to be fixed, both for "natural" SMBFS support,
> > > 	and for other uses of the namespace escape (HTTP "tunnels",
> > > 	extended attribute and/or resource fork access in an OS/2
> > > 	HPFS or Macintosh HFS implementation, etc.), including
> > > 	forward looking research.
> > > 
> > > 	This is related to item 7.
> > 
> > I'm sorry. This point didn't parse. Could you give an example?
> > 
> > I don't see how the namei recursion method prevents catching // as a
> > namespace escape.
> 
> 
> //apple-resource-fork/intermediate_dir/some_other_dir/file_with_fork
> 
> You can't inherit the fact that you are looking at the resource fork
> in the terminal component, ONLY.

Yep, there's no easy way to do that now.. The one thing which comes to
mind is to have lookup() rip out the first component and save it in the
namei struct.

Though the devil's advocate in me points out that this difficulty is not
inherent in the recursion setup, but in how lookup() is designed. :-)

> > > 	Quotas should be an abstract stacking layer that can be
> > > 	applied to any FS, instead of an FFS specific monstrosity.
> > 
> > It should certainly be possible to add a quota layer on top of any leaf
> > fs. That way you could de-couple quotas. :-)
> 
> Yes, assuming stacking works in the first place...

Except for a minor buglet with device nodes, stacking works in NetBSD at
present. :-)

> > One other suggestion I've heard is to split the 64 bits we have for time
> > into 44 bits for seconds, and 20 bits for microseconds. That's more than
> > enough modification resolution, and also pushes things to past year
> > 500,000 AD. Versioning the indoe would cover this easily.
> 
> Ugh.  But possible...

I agree it's ugly, but it has the advantage that it doesn't grow the
on-disk inode. A lot of flks have designs on the remaining 64 bits free.
:-)

Take care,

Bill



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Mon Aug 16 19:33: 6 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132])
	by hub.freebsd.org (Postfix) with ESMTP
	id E37FB14D15; Mon, 16 Aug 1999 19:32:53 -0700 (PDT)
	(envelope-from tlambert@usr02.primenet.com)
Received: (from daemon@localhost)
	by smtp02.primenet.com (8.8.8/8.8.8) id TAA19029;
	Mon, 16 Aug 1999 19:31:20 -0700 (MST)
Received: from usr02.primenet.com(206.165.6.202)
 via SMTP by smtp02.primenet.com, id smtpd019018; Mon Aug 16 19:31:18 1999
Received: (from tlambert@localhost)
	by usr02.primenet.com (8.8.5/8.8.5) id TAA08526;
	Mon, 16 Aug 1999 19:31:16 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908170231.TAA08526@usr02.primenet.com>
Subject: Re: BSD XFS Port & BSD VFS Rewrite
To: wrstuden@nas.nasa.gov
Date: Tue, 17 Aug 1999 02:31:16 +0000 (GMT)
Cc: tlambert@primenet.com, Matthew.Alton@anheuser-busch.com,
	Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
In-Reply-To: <Pine.SOL.3.96.990816143421.27345M-100000@marcy.nas.nasa.gov> from "Bill Studenmund" at Aug 16, 99 04:04:11 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> > > > 2.	Advisory locks are hung off private backing objects.
> > >
> > > I'd vote againsts your implimentation suggestion for VOP_ADVLOCK on an 
> > > efficiency concern. If we actually make a VOP call, that should be the
> > > end of the story. I.e either add a vnode flag to indicate pas/fail-ness,
> > > or add a genfs/std call to handle the problem.
> > > 
> > > I'd actually vote for the latter. Hang the byte-range locking off of the
> > > vnode, and add a genfs_advlock() or vop_stdadvlock() routine (depending on
> > > OS flavor) to handle the call. That way all fs's that can share code, and
> > > the callers need only call VO_ADVLOCK() - no other logic.
> > 
> > OK.  Here's the problem with that:  NFS client locks in a stacked
> > FS on top the the NFS client FS.
> 
> Ahh, but it'd be the fs's decision to map genfs_advlock()/vop_stdadvlock()
> to its vop_advlock_desc entry or not. In this case, NFS wouldn't want to
> do that.
> 
> Though it would mean growing the fs footprint.

Nope; that's not really the problem.

The problem is if I have two local processes that get into a race
in order to obtain a remote lock.

Because the remote lock is not asserted, there's no way to ensure
that the order of service for the request is the same as the order
of request -- consider cooperating programs, like sendmail and pine
or elm (or whatever).

The only way to resolve this is to ensure that the cooperating
programs on the same system are lockstepped: at the client.  The
only way to do this is to assert the lock locally, then remotely,
if the local assertion succeeds.

In the case of our cooperating local processes, this resolves the
race condition (depending on F_SETLCK/F_SETLCKW, they behave as if
the locks were local.  Which is what you want.


> > Specifically, you need to seperate the idea of asserting a lock
> > against the local vnode, asserting the lock via NFS locking, and
> > coelescing the local lock list, after both have succeeded, or
> > reverting the local assertion, should the remote assertion fail.
> 
> Right. But my thought was that you'd be calling an NFS routine, so it
> could do the right thing.

The problem is that the local lock doesn't belong to NFS.  Even if it
did (I think this would be an error for a remotely mounted "whiteout"
in a "translucent" local FS), the problem is that in doing the local
assertion, you will intrinsically coeelesce locks.

Now if the lock mode you are requesting overlaps a previous lock,
and the modes are not exactly the same, there's no way to back out
the local promotion or demotion without a coelesce.

This doesn't resolve the most complex cases you could contrive, with
multiple stacking layers that don't support a distributed coherency
protocol for locks for two or more players, but it handles the local
vs. NFS issues acceptably.


> > > NetBSD actually needs this to get unionfs to work. Do you want to talk
> > > privately about it?
> > 
> > If you want.  FreeBSD needs it for unionfs and nullfs, so it's
> > something that would be worth airing.
> > 
> > I think you could say that no locking routine was an approval of
> > the uuper level lock.  This lets you bail on all FS's except NFS,
> > where you have to deal with the approve/reject from the remote
> > host.  The problem with this on FreeBSD is the VFS_default stuff,
> > which puts a non-NULL interface on all FS's for all VOP's.
> 
> I'm not familiar with the VFS_default stuff. All the vop_default_desc
> routines in NetBSD point to error routines.

In FreeBSD, they now point to default routines that are *not* error
routines.  This is the problem.  I admit the change was very well
intentioned, since it made the code a hell of a lot more readable,
but choosing between readable and additional function, I take function
over form (I think the way I would have "fixed" the readability is by
making the operations that result in the descriptor set for a mounted
FS instance be both discrete, and named for their specific function).


> > Yes, this NULL is the same NULL I suggested for advisory locks,
> > above.
> 
> I'm not sure. The struct lock * is only used by layered filesystems, so
> they can keep track both of the underlying vnode lock, and if needed their
> own vnode lock. For advisory locks, would we want to keep track both of
> locks on our layer and the layer below? Don't we want either one or the
> other? i.e. layers bypass to the one below, or deal with it all
> themselves.

I think you want the lock on the intermediate layer: basically, on
every vnode that has data associated with it that is unique to a
layer.  Let's not forget, also, that you can expose a layer into
the namespace in one place, and expose it covered under another
layer, at another.  If you locked down to the backing object, then
the only issue you would be left with is one or more intermediate
backing objects.

For a layer with an intermediate backing object, I'm prepared to
declare it "special", and proxy the operation down to any inferior
backing object (e.g. a union FS that adds files from two FS's
together, rather than just directoriy entry lists).  I think such
layers are the exception, not the rule.


> > > > 5.	The idea of "root" vs. "non-root" mounts is inherently bad.
> > > You forgot:
> > > 
> > > 	5)	Update export lists
> > > 
> > > 		If you call the mount routine with no device name
> > > 		(args.fspec == 0) and with MNT_UPDATE, you get
> > > 		routed to the vfs_export routine
> > 
> > This must be the job of the upper level code, so that there is
> > a single control point for export information, instead of spreading
> > it throughout ead FS's mount entry point.
> 
> I agree it should be detangled, but think it should remain the fs's job to
> choose to call vfs_export. Otherwise an fs can't impliment its own export
> policies. :-)

I think that export policies are the realm of /etc/exports.

The problem with each FS implementing its own policy, is that this
is another place that copyinstr() gets called, when it shouldn't.


> > > I thought it was? Admitedly the only reference code I have is the ntfs
> > > code in the NetBSD kernel. But given how full of #ifdef (__FreeBSD__)'s it
> > > is, I thought it'd be an ok reference.
> > 
> > No.
> 
> We've lost the context, but what I was trying to say was that I thought
> the marking-the-vnode-as-mounted-on bit was done in the mount syscall at
> present. At least that's what
> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_syscalls.c?rev=1.130
> seems to be doing.
> 
> > Basically, what you would have is the equivalent of a variable
> > length "mounted volume" table, from which mappings (and exports,
> > based on the mappings) are externalized into the namespace.
> 
> Ahh, sounds like you're talking about a new formalism..

Right.  The "covering" operation is not the same as the "marking as
covered" operation.  Both need to be at the higher level.

If you wanted to get gross, you could say that it was a volume table,
and the use POSIX namespace escapes, such as "//DISK/2/..." to
access each disk as its own "/".

This sounds gross, but if you had 4M extents on a very large disk,
it would be nearly ideal for installing software: each package would
get its own "disk", and you could share "packages" instead of "FS"'s.

For something like mobile computing, consider a package as a shared
resource.  You have your presentation package which you mount off
your local net, you fly to New York to present, and you mount the
same presentation package from the network where you are a guest.
Forget paths, installation, and all that crap.


> > Right.  It should just have a "mount" entry point, and the rest
> > of the stuff moves to higher level code, called by the mount system
> > call, and the mountroot stuff during boot, to externalize the root
> > volume at the top of the hierarchy.
> > 
> > An ideal world would mount a / that had a /dev under it, and then
> > do transparent mounts over top of that.
> 
> That would be quite a different place than we have now. ;-)

Not really.  Julian Elisher had code that mounted a /devfs under
/ automatically, before the user was ever allowed to see /.  As a
result, the FS that you were left with was indistinguishable from
what I describe.

The only real difference is that, as a translucent mount over /devfs,
the one I describe would be capable of implementing persistant changes
to the /devfs, as whiteouts.  I don't think this is really that
desirable, but some people won't accept a devfs that doesn't have
traditional persistance semantics (e.g. "chmod" vs. modifying a
well known kernel data structure as an administrative operation).

I guess the other difference is that you don't have to worry about
large minor numbers when you are bringing up a new platform via
NFS from an old platform that can't support large minors in its FS
at all.  ;-).


> > > > 	The conversion of the root device into a vnode pointer, or
> > > > 	a path to a device into a vnode pointer, is the job of upper
> > > > 	level code -- specifically, the mount system call, and the
> > > > 	common code for booting.
> > > 
> > > My one concern about this is you've assumed that the user is mounting a
> > > device onto a filesystem.
> > 
> > No.  Vnode, not bdevvp.  The bdevvp stuff is for the boot time stuff
> > in the upper level code, and only applies to the root volume.
> 
> Maybe I mis-parsed. I thought you were talking about parsing the first
> mount option (in mount /dev/disk there, the /dev/disk option) into a
> vnode. The concern below is that different fs's have different ideas as to
> what that node should be. Some want it a device node which no one else is
> using (most leaf fs's), while some others want a directory (nullfs, etc),
> some want a file or device (the HSM system I'm working on) while others
> don't care (in mount -t kernfs /kern /kern , the first kern doesn't matter
> at all). But all is well with different support routines which the
> mount_foo() routine can call.

I would resolve this by passing a standard option to the mount code
in user space.  For root mounts, a vnode is passed down.  For other
mounts, the vnode is parsed and passed if the option is specified.

I think that you will only be able to find rare examples of FS's
that don't take device names as arguments.  But for those, you
don't specify the option, and it gets "NULL", and whatever local
options you specify.

The point is that, for FS's that can be both root and sub-root,
the mount code doesn't have to make the decision, it can be punted
to higher level code, in one place, where the code can be centrally
maintained and kept from getting "stale" when things change out
from under it.


> > > Might I suggest a common library of routines which different mount
> > > routines can call? That way we'd get code sharing while letting the fs
> > > make decisions about what it expects of the input arguments.
> > 
> > This is the "footprint" problem, all over again.  Reject/accept (or 
> > "accept if no VOP") seems more elegant, and also reduces footprint.
> 
> Very true. The problem is that the current VFS system was designed as a
> black box. It gets handed all calls, and it gets to decide policy, and do
> everything on its own. We're now basically discussing ways of having the
> plethora of fs's we now have do things the same way. :-)

I don't think so.

I like to think in terms of "VFS consumer" and "VFS producer".  The
implied semantics are the provenanace of the "VFS consumer".

A good example of this is to look at another VFS consumer, the NFS
server.  It really doesn't want implied semantics, and, in fact,
wants to have a set of semantics (server locking information) sent
in through a seperate communications channel.  The way things are
right now, as a VFS consumer, the NFS server is a second class citizen.

One could imagine an AppleTalk or SMB server in the kernel, as well,
also VFS consumers.  And one could imagine doing VFS operations
against files _from within the kernel_ (say in a "quota" stacking
layer, or a resource fork/extended attributes stacking layer).  The
point is, you want to stop implying some semantics for these consumers.
Where you draw the line is where you imply sematics via call-down, or
via reject/accept.  If you don't want them implied all the time, for
all consumers, then they belong in the system call layer; othersise,
they belong in the VFS layer doing the implementation.

There's an abstraction here: is the VFS stacking layer you are
talking about one that implements semantics?  For an ACL stacking
layer, your answer is yes.  But for an NFS server stacked on a
VFS?  Or a namespace hiding layer?



> > > > 7.	The struct nameidata (namei.h) is broken in conception.
> > 
> > Can you push a Unicode name down from an appropriate system call?
> > 
> > I don't see any way to deal with an NT FS for characters outside
> > ISO 8859-1, otherwise.  8-(.
> 
> Hmmm. I think the real problem is that the kernel(s) is(are) not at all
> designed well for different laguages.

Well, if you make the path component descriptor into an opaque object,
you can pass it down to the point you get to someone who understands
the encapsulated data.  The interpretation is a rendesvous -- an
agreement -- between the source providing the data, and the target
interpreting it.


> > > > 9.	The implementation of namei() is POSIX non-compliant
> > > > 
> > > > 	The implementation of namei() is by means of coroutine
> > > > 	"recursion"; this is similar to the only recursion you can
> > > > 	achieve in FORTRAN.

[ ... ]

> > > 
> > > I'm sorry. This point didn't parse. Could you give an example?
> > > 
> > > I don't see how the namei recursion method prevents catching // as a
> > > namespace escape.
> > 
> > 
> > //apple-resource-fork/intermediate_dir/some_other_dir/file_with_fork
> > 
> > You can't inherit the fact that you are looking at the resource fork
> > in the terminal component, ONLY.
> 
> Yep, there's no easy way to do that now.. The one thing which comes to
> mind is to have lookup() rip out the first component and save it in the
> namei struct.
> 
> Though the devil's advocate in me points out that this difficulty is not
> inherent in the recursion setup, but in how lookup() is designed. :-)

If it were a parameter, "namespace", to the function, it'd work, too.

The problem is that you really want to install "namespace handlers"
for these escapes, probably on a per FS basis.  The only way I can
see this working is to place the namespace into the path descriptor
_seperately_ from the path components (however they get parsed out by
that namespace).

This shows the evils of "copyinstr()" in the full light of day:  I can't
have a "//unicode/..." name space escape, unless I assume ISO-8859-1,
like the NTFS currently does, or unless I engage in some unnatural act
with my "..." following the escape (e.g. UTF-8).


> > > > 	Quotas should be an abstract stacking layer that can be
> > > > 	applied to any FS, instead of an FFS specific monstrosity.
> > > 
> > > It should certainly be possible to add a quota layer on top of any leaf
> > > fs. That way you could de-couple quotas. :-)
> > 
> > Yes, assuming stacking works in the first place...
> 
> Except for a minor buglet with device nodes, stacking works in NetBSD at
> present. :-)

Have you tried Heidemann's student's stacking layers?  There is one
encryption, and one per-file compression with namespace hiding, that
I think it would be hard pressed to keep up with.  But I'll give it
the benefit of the doubt.  8-).


> > > One other suggestion I've heard is to split the 64 bits we have for time
> > > into 44 bits for seconds, and 20 bits for microseconds. That's more than
> > > enough modification resolution, and also pushes things to past year
> > > 500,000 AD. Versioning the indoe would cover this easily.
> > 
> > Ugh.  But possible...
> 
> I agree it's ugly, but it has the advantage that it doesn't grow the
> on-disk inode. A lot of flks have designs on the remaining 64 bits free.
> :-)

Well, so long as we can resolve the issue for a long, long time;
I plan on being around to have to put up with the bugs, if I can
wrangle it... 8-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Aug 17  7:18: 2 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from sv01.cet.co.jp (sv01.cet.co.jp [210.171.56.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 0C5D9156C4; Tue, 17 Aug 1999 07:17:54 -0700 (PDT)
	(envelope-from michaelh@cet.co.jp)
Received: from localhost (michaelh@localhost)
	by sv01.cet.co.jp (8.9.3/8.9.3) with SMTP id OAA17678;
	Tue, 17 Aug 1999 14:18:07 GMT
Date: Tue, 17 Aug 1999 23:18:06 +0900 (JST)
From: Michael Hancock <michaelh@cet.co.jp>
To: Terry Lambert <tlambert@primenet.com>
Cc: wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com,
	Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
In-Reply-To: <199908170231.TAA08526@usr02.primenet.com>
Message-ID: <Pine.BSF.3.95LJ1.1b3.990817224323.17508B-100000@sv01.cet.co.jp>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> > I'm not familiar with the VFS_default stuff. All the vop_default_desc
> > routines in NetBSD point to error routines.
> 
> In FreeBSD, they now point to default routines that are *not* error
> routines.  This is the problem.  I admit the change was very well
> intentioned, since it made the code a hell of a lot more readable,
> but choosing between readable and additional function, I take function
> over form (I think the way I would have "fixed" the readability is by
> making the operations that result in the descriptor set for a mounted
> FS instance be both discrete, and named for their specific function).

As I recall most of FBSD's default routines are also error routines, if
the exceptions were a problem it would would be trivial to fix.

I think fixing resource allocation/deallocation for things like vnodes,
cnbufs, and locks are a higher priority for now.  There are examples such
as in detached threading where it might make sense for the detached child
to be responsible for releasing resources allocated to it by the parent,
but in stacking this model is very messy and unnatural.  This is why the
purpose of VOP_ABORTOP appears to be to release cnbufs but this is really
just an ugly side effect.  With stacking the code that allocates should be
the code that deallocates. Substitute, "code"  with "layer" to be more
correct. 

I fixed a lot of the vnode and locking cases, unfortunately the ones that
remain are probably ugly cases where you have to reacquire locks that had
to be unlocked somewhere in the executing layer.  See VOP_RENAME for an
example.  Compare the number of WILLRELEs in vnode_if.src in FreeBSD and
NetBSD, ideally there'd be none.

Regards,


Mike Hancock




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Aug 17  9:20:30 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17])
	by hub.freebsd.org (Postfix) with ESMTP
	id 9B5BE1503B; Tue, 17 Aug 1999 09:20:27 -0700 (PDT)
	(envelope-from wrstuden@marcy.nas.nasa.gov)
Received: from localhost (wrstuden@localhost)
	by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id JAA08787;
	Tue, 17 Aug 1999 09:20:29 -0700 (PDT)
Date: Tue, 17 Aug 1999 09:20:29 -0700 (PDT)
From: Bill Studenmund <wrstuden@nas.nasa.gov>
To: Michael Hancock <michaelh@cet.co.jp>
Cc: Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
In-Reply-To: <Pine.BSF.3.95LJ1.1b3.990817224323.17508B-100000@sv01.cet.co.jp>
Message-ID: <Pine.SOL.3.96.990817091538.6014B-100000@marcy.nas.nasa.gov>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Tue, 17 Aug 1999, Michael Hancock wrote:

> As I recall most of FBSD's default routines are also error routines, if
> the exceptions were a problem it would would be trivial to fix.
> 
> I think fixing resource allocation/deallocation for things like vnodes,
> cnbufs, and locks are a higher priority for now.  There are examples such
> as in detached threading where it might make sense for the detached child
> to be responsible for releasing resources allocated to it by the parent,
> but in stacking this model is very messy and unnatural.  This is why the
> purpose of VOP_ABORTOP appears to be to release cnbufs but this is really
> just an ugly side effect.  With stacking the code that allocates should be
> the code that deallocates. Substitute, "code"  with "layer" to be more
> correct. 
> 
> I fixed a lot of the vnode and locking cases, unfortunately the ones that
> remain are probably ugly cases where you have to reacquire locks that had
> to be unlocked somewhere in the executing layer.  See VOP_RENAME for an
> example.  Compare the number of WILLRELEs in vnode_if.src in FreeBSD and
> NetBSD, ideally there'd be none.

I've compared the two, and making the NetBSD number match the FreeBSD
number is one of my goals. :-)

Any suggestions, or just plod&fix?

Take care,

Bill



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Aug 17  9:59:43 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from sv01.cet.co.jp (sv01.cet.co.jp [210.171.56.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 998E414F32; Tue, 17 Aug 1999 09:59:33 -0700 (PDT)
	(envelope-from michaelh@cet.co.jp)
Received: from localhost (michaelh@localhost)
	by sv01.cet.co.jp (8.9.3/8.9.3) with SMTP id RAA18271;
	Tue, 17 Aug 1999 17:00:02 GMT
Date: Wed, 18 Aug 1999 02:00:02 +0900 (JST)
From: Michael Hancock <michaelh@cet.co.jp>
To: Bill Studenmund <wrstuden@nas.nasa.gov>
Cc: Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
In-Reply-To: <Pine.SOL.3.96.990817091538.6014B-100000@marcy.nas.nasa.gov>
Message-ID: <Pine.BSF.3.95LJ1.1b3.990818014355.18030A-100000@sv01.cet.co.jp>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Tue, 17 Aug 1999, Bill Studenmund wrote:

> I've compared the two, and making the NetBSD number match the FreeBSD
> number is one of my goals. :-)
> 
> Any suggestions, or just plod&fix?

It can be very cumbersome tracking down references being bumped by
vref/VREF and other operations.

Among the uncompleted operations are VOPs that pre-release the returned
vpp to the caller.  I think in VOP_MKNOD this was done as a convenience
and you might have to add code to handle device vp aliases correctly.

Just remember the rule, the allocating layer must be the layer that
deallocates.

Regards,


Mike



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Aug 17 13:45:48 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17])
	by hub.freebsd.org (Postfix) with ESMTP
	id CDD94157FD; Tue, 17 Aug 1999 13:44:24 -0700 (PDT)
	(envelope-from wrstuden@marcy.nas.nasa.gov)
Received: from localhost (wrstuden@localhost)
	by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id NAA27035;
	Tue, 17 Aug 1999 13:44:34 -0700 (PDT)
Date: Tue, 17 Aug 1999 13:44:34 -0700 (PDT)
From: Bill Studenmund <wrstuden@nas.nasa.gov>
Reply-To: Bill Studenmund <wrstuden@nas.nasa.gov>
To: Terry Lambert <tlambert@primenet.com>
Cc: Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
In-Reply-To: <199908170231.TAA08526@usr02.primenet.com>
Message-ID: <Pine.SOL.3.96.990817092121.6014C-100000@marcy.nas.nasa.gov>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Tue, 17 Aug 1999, Terry Lambert wrote:

> > > > > 2.	Advisory locks are hung off private backing objects.
> > I'm not sure. The struct lock * is only used by layered filesystems, so
> > they can keep track both of the underlying vnode lock, and if needed their
> > own vnode lock. For advisory locks, would we want to keep track both of
> > locks on our layer and the layer below? Don't we want either one or the
> > other? i.e. layers bypass to the one below, or deal with it all
> > themselves.
> 
> I think you want the lock on the intermediate layer: basically, on
> every vnode that has data associated with it that is unique to a
> layer.  Let's not forget, also, that you can expose a layer into
> the namespace in one place, and expose it covered under another
> layer, at another.  If you locked down to the backing object, then
> the only issue you would be left with is one or more intermediate
> backing objects.

Right. That exported struct lock * makes locking down to the lowest-level
file easy - you just feed it to the lock manager, and you're locking the
same lock the lowest level fs uses. You then lock all vnodes stacked over
this one at the same time. Otherwise, you just call VOP_LOCK below and
then lock yourself.

> For a layer with an intermediate backing object, I'm prepared to
> declare it "special", and proxy the operation down to any inferior
> backing object (e.g. a union FS that adds files from two FS's
> together, rather than just directoriy entry lists).  I think such
> layers are the exception, not the rule.

Actually isn't the only problem when you have vnode fan-in (union FS)? 
i.e.  a plain compressing layer should not introduce vnode locking
problems. 

> I think that export policies are the realm of /etc/exports.
> 
> The problem with each FS implementing its own policy, is that this
> is another place that copyinstr() gets called, when it shouldn't.

Well, my thought was that, like with current code, most every fs would
just call vfs_export() when it's presented an export operation. But by
retaining the option of having the fs do its own thing, we can support
different export semantics if desired.

> Right.  The "covering" operation is not the same as the "marking as
> covered" operation.  Both need to be at the higher level.
> Not really.  Julian Elisher had code that mounted a /devfs under
> / automatically, before the user was ever allowed to see /.  As a
> result, the FS that you were left with was indistinguishable from
> what I describe.
> 
> The only real difference is that, as a translucent mount over /devfs,
> the one I describe would be capable of implementing persistant changes
> to the /devfs, as whiteouts.  I don't think this is really that
> desirable, but some people won't accept a devfs that doesn't have
> traditional persistance semantics (e.g. "chmod" vs. modifying a
> well known kernel data structure as an administrative operation).

That wouldn't be hard to do. :-)

> I guess the other difference is that you don't have to worry about
> large minor numbers when you are bringing up a new platform via
> NFS from an old platform that can't support large minors in its FS
> at all.  ;-).

True. :-)

> I would resolve this by passing a standard option to the mount code
> in user space.  For root mounts, a vnode is passed down.  For other
> mounts, the vnode is parsed and passed if the option is specified.

Or maybe add a field to vfsops. This info says what the mount call will
expect (I want a block device, a regular file, a directory, etc), so it
fits. :-)

Also, if we leave it to userland, what happens if someone writes a
program which calls sys_mount with something the fs doesn't expect. :-)

> I think that you will only be able to find rare examples of FS's
> that don't take device names as arguments.  But for those, you
> don't specify the option, and it gets "NULL", and whatever local
> options you specify.

I agree I can't see a leaf fs not taking a device node. But layered fs's
certainly will want something else. :-)

> The point is that, for FS's that can be both root and sub-root,
> the mount code doesn't have to make the decision, it can be punted
> to higher level code, in one place, where the code can be centrally
> maintained and kept from getting "stale" when things change out
> from under it.

True.

And with good comments we can catch the times when the centrally located
code changes & brakes an assumption made by the fs. :-)

> > Except for a minor buglet with device nodes, stacking works in NetBSD at
> > present. :-)
> 
> Have you tried Heidemann's student's stacking layers?  There is one
> encryption, and one per-file compression with namespace hiding, that
> I think it would be hard pressed to keep up with.  But I'll give it
> the benefit of the doubt.  8-).

Nope. The problem is that while stacking (null, umap, and overlay fs's)
work, we don't have the coherency issues worked out so that upper layers
can cache data. i.e. so that the lower fs knows it has to ask the uper
layers to give pages back. :-) But multiple ls -lR's work fine. :-)

> > I agree it's ugly, but it has the advantage that it doesn't grow the
> > on-disk inode. A lot of flks have designs on the remaining 64 bits free.
> > :-)
> 
> Well, so long as we can resolve the issue for a long, long time;
> I plan on being around to have to put up with the bugs, if I can
> wrangle it... 8-).

:-)

I bet by then (559447 AD) we won't be using ffs, so the problem will be
moot. :-)

Take care,

Bill



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Aug 17 14: 6:21 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from sv01.cet.co.jp (sv01.cet.co.jp [210.171.56.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id AD45B157DF; Tue, 17 Aug 1999 14:06:01 -0700 (PDT)
	(envelope-from michaelh@cet.co.jp)
Received: from localhost (michaelh@localhost)
	by sv01.cet.co.jp (8.9.3/8.9.3) with SMTP id VAA18756;
	Tue, 17 Aug 1999 21:05:08 GMT
Date: Wed, 18 Aug 1999 06:05:08 +0900 (JST)
From: Michael Hancock <michaelh@cet.co.jp>
To: Bill Studenmund <wrstuden@nas.nasa.gov>
Cc: Terry Lambert <tlambert@primenet.com>, Hackers@FreeBSD.ORG,
	fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
In-Reply-To: <Pine.SOL.3.96.990817092121.6014C-100000@marcy.nas.nasa.gov>
Message-ID: <Pine.BSF.3.95LJ1.1b3.990818055720.18717A-100000@sv01.cet.co.jp>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> > Have you tried Heidemann's student's stacking layers?  There is one
> > encryption, and one per-file compression with namespace hiding, that
> > I think it would be hard pressed to keep up with.  But I'll give it
> > the benefit of the doubt.  8-).
> 
> Nope. The problem is that while stacking (null, umap, and overlay fs's)
> work, we don't have the coherency issues worked out so that upper layers
> can cache data. i.e. so that the lower fs knows it has to ask the uper
> layers to give pages back. :-) But multiple ls -lR's work fine. :-)

Interesting, have you read the Heidemann paper that outlines a solution
that uses a cache manager?

You can probably find it somewhere here,
http://www.isi.edu/~johnh/SOFTWARE/UCLA_STACKING/





To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Aug 17 14:12:15 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17])
	by hub.freebsd.org (Postfix) with ESMTP
	id 736DA157F4; Tue, 17 Aug 1999 14:12:11 -0700 (PDT)
	(envelope-from wrstuden@marcy.nas.nasa.gov)
Received: from localhost (wrstuden@localhost)
	by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id OAA29074;
	Tue, 17 Aug 1999 14:12:22 -0700 (PDT)
Date: Tue, 17 Aug 1999 14:12:22 -0700 (PDT)
From: Bill Studenmund <wrstuden@nas.nasa.gov>
To: Michael Hancock <michaelh@cet.co.jp>
Cc: Terry Lambert <tlambert@primenet.com>, Hackers@FreeBSD.ORG,
	fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
In-Reply-To: <Pine.BSF.3.95LJ1.1b3.990818055720.18717A-100000@sv01.cet.co.jp>
Message-ID: <Pine.SOL.3.96.990817141101.22897B-100000@marcy.nas.nasa.gov>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Wed, 18 Aug 1999, Michael Hancock wrote:

> Interesting, have you read the Heidemann paper that outlines a solution
> that uses a cache manager?
> 
> You can probably find it somewhere here,
> http://www.isi.edu/~johnh/SOFTWARE/UCLA_STACKING/

Nope. I've read his dissertation, and his discussion of the lock
management inspired the struct lock * work I did for NetBSD (we use the
address of the lock, not the vnode, but other than that it's the same).

Thanks for the ref!

Take care,

Bill



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Aug 17 14:17:25 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from sv01.cet.co.jp (sv01.cet.co.jp [210.171.56.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 1E41115818; Tue, 17 Aug 1999 14:17:12 -0700 (PDT)
	(envelope-from michaelh@cet.co.jp)
Received: from localhost (michaelh@localhost)
	by sv01.cet.co.jp (8.9.3/8.9.3) with SMTP id VAA18787;
	Tue, 17 Aug 1999 21:14:47 GMT
Date: Wed, 18 Aug 1999 06:14:47 +0900 (JST)
From: Michael Hancock <michaelh@cet.co.jp>
To: Bill Studenmund <wrstuden@nas.nasa.gov>
Cc: Terry Lambert <tlambert@primenet.com>, Hackers@FreeBSD.ORG,
	fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
In-Reply-To: <Pine.BSF.3.95LJ1.1b3.990818055720.18717A-100000@sv01.cet.co.jp>
Message-ID: <Pine.BSF.3.95LJ1.1b3.990818060939.18769A-100000@sv01.cet.co.jp>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

I forgot I had some old diffs that may be of help,
http://www.freebsd.org/~mch/vop1a.diff

You'll notice that just about everywhere that I moved vput() to the
appropriate layer a path component buffer was also freed in the wrong
place.  John Dyson put these buffers in zones so the free routine probably
looks very different than in netbsd.

zfree(namei_zone, cnp->cn_pnbuf);
-       vput(dvp);

Regards,


Mike



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Aug 17 14:49:45 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from gatekeeper.tsc.tdk.com (gatekeeper.tsc.tdk.com [207.113.159.21])
	by hub.freebsd.org (Postfix) with ESMTP
	id 17C23157F3; Tue, 17 Aug 1999 14:49:39 -0700 (PDT)
	(envelope-from gdonl@tsc.tdk.com)
Received: from sunrise.gv.tsc.tdk.com (root@sunrise.gv.tsc.tdk.com [192.168.241.191])
	by gatekeeper.tsc.tdk.com (8.8.8/8.8.8) with ESMTP id OAA15932;
	Tue, 17 Aug 1999 14:48:45 -0700 (PDT)
	(envelope-from gdonl@tsc.tdk.com)
Received: from salsa.gv.tsc.tdk.com (salsa.gv.tsc.tdk.com [192.168.241.194])
	by sunrise.gv.tsc.tdk.com (8.8.5/8.8.5) with ESMTP id OAA21621;
	Tue, 17 Aug 1999 14:48:44 -0700 (PDT)
Received: (from gdonl@localhost)
	by salsa.gv.tsc.tdk.com (8.8.5/8.8.5) id OAA02073;
	Tue, 17 Aug 1999 14:48:39 -0700 (PDT)
From: Don Lewis <Don.Lewis@tsc.tdk.com>
Message-Id: <199908172148.OAA02073@salsa.gv.tsc.tdk.com>
Date: Tue, 17 Aug 1999 14:48:39 -0700
In-Reply-To: Terry Lambert <tlambert@primenet.com>
       "Re: BSD XFS Port & BSD VFS Rewrite" (Aug 16,  9:18pm)
X-Mailer: Mail User's Shell (7.2.6 alpha(3) 7/19/95)
To: Terry Lambert <tlambert@primenet.com>, wrstuden@nas.nasa.gov
Subject: Re: BSD XFS Port & BSD VFS Rewrite
Cc: Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG,
	fs@FreeBSD.ORG
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Aug 16,  9:18pm, Terry Lambert wrote:
} Subject: Re: BSD XFS Port & BSD VFS Rewrite

} > I don't see how the namei recursion method prevents catching // as a
} > namespace escape.
} 
} 
} //apple-resource-fork/intermediate_dir/some_other_dir/file_with_fork
} 
} You can't inherit the fact that you are looking at the resource fork
} in the terminal component, ONLY.

I don't think this is a good example.  How would you access the resource
fork of a file relative to the current directory?  IMHO, the necessary
goop needs to go at the end of the path name.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Aug 17 15:46:54 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from xylan.com (postal.xylan.com [208.8.0.248])
	by hub.freebsd.org (Postfix) with ESMTP
	id 0B11F14CC7; Tue, 17 Aug 1999 15:46:48 -0700 (PDT)
	(envelope-from wes@softweyr.com)
Received: from mailhub.xylan.com by xylan.com (8.8.7/SMI-SVR4 (xylan-mgw 2.2 [OUT]))
	id PAA13293; Tue, 17 Aug 1999 15:44:35 -0700 (PDT)
Received: from utah.XYLAN.COM by mailhub.xylan.com (SMI-8.6/SMI-SVR4 (mailhub 2.1 [HUB]))
	id PAA13692; Tue, 17 Aug 1999 15:38:34 -0700
Received: from softweyr.com by utah.XYLAN.COM (SMI-8.6/SMI-SVR4 (xylan utah [SPOOL]))
	id QAA27793; Tue, 17 Aug 1999 16:44:30 -0600
Message-ID: <37B9E5CE.8E7B8AFD@softweyr.com>
Date: Tue, 17 Aug 1999 16:44:30 -0600
From: Wes Peters <wes@softweyr.com>
Organization: Softweyr LLC
X-Mailer: Mozilla 4.5 [en] (X11; U; FreeBSD 3.1-RELEASE i386)
X-Accept-Language: en
MIME-Version: 1.0
To: Don Lewis <Don.Lewis@tsc.tdk.com>
Cc: Terry Lambert <tlambert@primenet.com>, wrstuden@nas.nasa.gov,
	Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
References: <199908172148.OAA02073@salsa.gv.tsc.tdk.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Don Lewis wrote:
> 
> On Aug 16,  9:18pm, Terry Lambert wrote:
> } Subject: Re: BSD XFS Port & BSD VFS Rewrite
> 
> } > I don't see how the namei recursion method prevents catching // as a
> } > namespace escape.
> }
> }
> } //apple-resource-fork/intermediate_dir/some_other_dir/file_with_fork
> }
> } You can't inherit the fact that you are looking at the resource fork
> } in the terminal component, ONLY.
> 
> I don't think this is a good example.  How would you access the resource
> fork of a file relative to the current directory?  IMHO, the necessary
> goop needs to go at the end of the path name.

Pick a separator character that nobody in their right mind would use in
a file path.  "\" strikes me as a good candidate.  ;^)

-- 
            "Where am I, and what am I doing in this handbasket?"

Wes Peters                                                         Softweyr LLC
http://softweyr.com/                                           wes@softweyr.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18  5:56:24 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from critter.freebsd.dk (wandering-wizard.cybercity.dk [212.242.41.238])
	by hub.freebsd.org (Postfix) with ESMTP
	id 7B30C14E6B; Wed, 18 Aug 1999 05:56:17 -0700 (PDT)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id JAA00832;
	Wed, 18 Aug 1999 09:32:52 +0200 (CEST)
	(envelope-from phk@critter.freebsd.dk)
To: Bill Studenmund <wrstuden@nas.nasa.gov>
Cc: Terry Lambert <tlambert@primenet.com>,
	Alton Matthew <Matthew.Alton@anheuser-busch.com>,
	Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite 
In-reply-to: Your message of "Mon, 16 Aug 1999 13:48:16 PDT."
             <Pine.SOL.3.96.990816105106.27345H-100000@marcy.nas.nasa.gov> 
Date: Wed, 18 Aug 1999 09:32:52 +0200
Message-ID: <830.934961572@critter.freebsd.dk>
From: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

In message <Pine.SOL.3.96.990816105106.27345H-100000@marcy.nas.nasa.gov>, Bill 
Studenmund writes:
>On Sat, 14 Aug 1999, Terry Lambert wrote:

>> > I am currently conducting a thorough study of the VFS subsystem
>> > in preparation for an all-out effort to port SGI's XFS filesystem to
>> > FreeBSD 4.x at such time as SGI gives up the code.  Matt Dillon
>> > has written in hackers- that the VFS subsystem is presently not
>> > well understood by any of the active kernel code contributers and
>> > that it will be rewritten later this year.  This is obviously of great
>> > concern to me in this port.
>> 
>> It is of great concern to me that a rewrite, apparently because of
>> non-understanding, is taking place at all.
>
>That concerns me too. Many aspects of the 4.4 vnode interface were there  
>for specific reasons. Even if they were hack solutions, to re-write them  
>because of a lack of understanding is dangerous as the new code will
>likely run into the same problems as before. :-)

Matt doesn't represent the FreeBSD project, and even if he rewrites
the VFS subsystem so he can understand it, his rewrite would face
considerable resistance on its way into FreeBSD.  I don't think
there is reason to rewrite it, but there certainly are areas
that need fixing.

>> 	The use of the "vfs_default" to make unimplemented VOP's
>> 	fall through to code which implements function, while well
>> 	intentioned, is misguided.

I beg to differ.  The only difference is that we pass through
multiple layers before we hit the bottom of the stack.  There is
no loss of functionality but significant gain of clarity and
modularity.

Adding a new VOP entails the same thing as it has always done.

>> 3.	The filesystem itself is broken for Y2038
>> 
>> 	The space which was historically reserved for the Y2038 fix
>> 	(a 64 bit time_t) was absconeded with for subsecond resoloution.
>> 
>> 	This change should be reverted, and fsck modified to re-zero
>> 	the values, given a specific argument.

That would break make(1) on contemporary machines.

>One other suggestion I've heard is to split the 64 bits we have for time
>into 44 bits for seconds, and 20 bits for microseconds. That's more than
>enough modification resolution, and also pushes things to past year
>500,000 AD. Versioning the indoe would cover this easily.

This would be misguided, and given the current speed of evolution
lead to other problems far before 2038.

Both struct timespec and struct timeval are major mistakes, they
make arithmetic on timestamps an expensive operation.  Timestamps
should be stored as integers using an fix-point notations, for
instance 64bits with 32bit fractional seconds (the NTP timestamp),
or in the future 128/48.

Extending from 64 to 128bits would be a cheap shift and increased
precision and range could go hand in hand.

If we don't want to extend the size of the timestamps before 2038,
(and we should not only look at filesystems here), then the correct
fix will be to move the epoch and use the inode version to mark
this fact.

--
Poul-Henning Kamp             FreeBSD coreteam member
phk@FreeBSD.ORG               "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 10:19: 3 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132])
	by hub.freebsd.org (Postfix) with ESMTP
	id C52FF14C80; Wed, 18 Aug 1999 10:18:58 -0700 (PDT)
	(envelope-from tlambert@usr02.primenet.com)
Received: (from daemon@localhost)
	by smtp02.primenet.com (8.8.8/8.8.8) id KAA01747;
	Wed, 18 Aug 1999 10:16:59 -0700 (MST)
Received: from usr02.primenet.com(206.165.6.202)
 via SMTP by smtp02.primenet.com, id smtpd001627; Wed Aug 18 10:16:51 1999
Received: (from tlambert@localhost)
	by usr02.primenet.com (8.8.5/8.8.5) id KAA12220;
	Wed, 18 Aug 1999 10:16:46 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908181716.KAA12220@usr02.primenet.com>
Subject: Re: BSD XFS Port & BSD VFS Rewrite
To: michaelh@cet.co.jp (Michael Hancock)
Date: Wed, 18 Aug 1999 17:16:46 +0000 (GMT)
Cc: tlambert@primenet.com, wrstuden@nas.nasa.gov,
	Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
In-Reply-To: <Pine.BSF.3.95LJ1.1b3.990817224323.17508B-100000@sv01.cet.co.jp> from "Michael Hancock" at Aug 17, 99 11:18:06 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> > > I'm not familiar with the VFS_default stuff. All the vop_default_desc
> > > routines in NetBSD point to error routines.
> > 
> > In FreeBSD, they now point to default routines that are *not* error
> > routines.  This is the problem.  I admit the change was very well
> > intentioned, since it made the code a hell of a lot more readable,
> > but choosing between readable and additional function, I take function
> > over form (I think the way I would have "fixed" the readability is by
> > making the operations that result in the descriptor set for a mounted
> > FS instance be both discrete, and named for their specific function).
> 
> As I recall most of FBSD's default routines are also error routines, if
> the exceptions were a problem it would would be trivial to fix.

You would have to de-collapse several VOP lists that have been
pre-collapsed.  The pre-collapse is also an issue for stacking,
since the collapse is supposed to be late bound to the stacking
operation itself.  This lets you revisit it later when you need
to add a new VOP into the system, so that there's a NULL pointer
in the VOP slot for older FS's, in case you stack on top of them.
This is particularly true of an FS stacked on an FS stacked on a
proxy layer.


> I think fixing resource allocation/deallocation for things like vnodes,
> cnbufs, and locks are a higher priority for now.  There are examples such
> as in detached threading where it might make sense for the detached child
> to be responsible for releasing resources allocated to it by the parent,
> but in stacking this model is very messy and unnatural.  This is why the
> purpose of VOP_ABORTOP appears to be to release cnbufs but this is really
> just an ugly side effect.  With stacking the code that allocates should be
> the code that deallocates. Substitute, "code"  with "layer" to be more
> correct. 

Yes.  That's actually maintenance, not rewrite, and I think it's
very important to address.  I'm rather pleased with the way the
NFS stuff has turned out (so far), and I was the one calling for
a return to first principles (i.e. a rewrite from the specification).


> I fixed a lot of the vnode and locking cases, unfortunately the ones that
> remain are probably ugly cases where you have to reacquire locks that had
> to be unlocked somewhere in the executing layer.  See VOP_RENAME for an
> example.  Compare the number of WILLRELEs in vnode_if.src in FreeBSD and
> NetBSD, ideally there'd be none.

The way I handled this in the rename case on my hacking box was by
adding a flag to the namei() call.  You could call this flag the
same as WILLRELE, but it had inverse semantics.

Really, this is another issue of reflexivity being absent from an
interface.  You really don't want asymmetric interfaces (VOP_LOCK
is an example, in many cases, based on internal use in the FFS).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 10:27:25 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131])
	by hub.freebsd.org (Postfix) with ESMTP
	id 58B2B14C80; Wed, 18 Aug 1999 10:27:20 -0700 (PDT)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id TAA01171;
	Wed, 18 Aug 1999 19:24:04 +0200 (CEST)
	(envelope-from phk@critter.freebsd.dk)
To: Terry Lambert <tlambert@primenet.com>
Cc: michaelh@cet.co.jp (Michael Hancock), wrstuden@nas.nasa.gov,
	Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite 
In-reply-to: Your message of "Wed, 18 Aug 1999 17:16:46 -0000."
             <199908181716.KAA12220@usr02.primenet.com> 
Date: Wed, 18 Aug 1999 19:24:04 +0200
Message-ID: <1169.934997044@critter.freebsd.dk>
From: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

In message <199908181716.KAA12220@usr02.primenet.com>, Terry Lambert writes:
>> > > I'm not familiar with the VFS_default stuff. All the vop_default_desc
>> > > routines in NetBSD point to error routines.
>> > 
>> > In FreeBSD, they now point to default routines that are *not* error
>> > routines.  This is the problem.  I admit the change was very well
>> > intentioned, since it made the code a hell of a lot more readable,
>> > but choosing between readable and additional function, I take function
>> > over form (I think the way I would have "fixed" the readability is by
>> > making the operations that result in the descriptor set for a mounted
>> > FS instance be both discrete, and named for their specific function).
>> 
>> As I recall most of FBSD's default routines are also error routines, if
>> the exceptions were a problem it would would be trivial to fix.
>
>You would have to de-collapse several VOP lists that have been
>pre-collapsed.

You are talking gibberish here.  Please show code where this is
a problem.

--
Poul-Henning Kamp             FreeBSD coreteam member
phk@FreeBSD.ORG               "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 10:31: 5 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17])
	by hub.freebsd.org (Postfix) with ESMTP
	id CAB8914F2D; Wed, 18 Aug 1999 10:31:02 -0700 (PDT)
	(envelope-from wrstuden@marcy.nas.nasa.gov)
Received: from localhost (wrstuden@localhost)
	by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id KAA16496;
	Wed, 18 Aug 1999 10:30:39 -0700 (PDT)
Date: Wed, 18 Aug 1999 10:30:39 -0700 (PDT)
From: Bill Studenmund <wrstuden@nas.nasa.gov>
To: Poul-Henning Kamp <phk@critter.freebsd.dk>
Cc: Terry Lambert <tlambert@primenet.com>,
	Alton Matthew <Matthew.Alton@anheuser-busch.com>,
	Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite 
In-Reply-To: <830.934961572@critter.freebsd.dk>
Message-ID: <Pine.SOL.3.96.990818101005.14430B-100000@marcy.nas.nasa.gov>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Wed, 18 Aug 1999, Poul-Henning Kamp wrote:

> In message <Pine.SOL.3.96.990816105106.27345H-100000@marcy.nas.nasa.gov>, Bill 
> Studenmund writes:
> >On Sat, 14 Aug 1999, Terry Lambert wrote:
> 
> Matt doesn't represent the FreeBSD project, and even if he rewrites
> the VFS subsystem so he can understand it, his rewrite would face
> considerable resistance on its way into FreeBSD.  I don't think
> there is reason to rewrite it, but there certainly are areas
> that need fixing.

Whew! That's reasuring. I agree there are things which need fixing. It'd
be nice if both NetBSD and FreeBSD could fix things in the same way.

> >> 	The use of the "vfs_default" to make unimplemented VOP's
> >> 	fall through to code which implements function, while well
> >> 	intentioned, is misguided.
> 
> I beg to differ.  The only difference is that we pass through
> multiple layers before we hit the bottom of the stack.  There is
> no loss of functionality but significant gain of clarity and
> modularity.

If I understood the issue, it is that the leaf fs's (the bottom ones)
would use a default routine for non-error functionality. I think Terry's
point (which I agree with) was that a leaf fs's default routine should
only return errors.

> >> 3.	The filesystem itself is broken for Y2038
> >One other suggestion I've heard is to split the 64 bits we have for time
> >into 44 bits for seconds, and 20 bits for microseconds. That's more than
> >enough modification resolution, and also pushes things to past year
> >500,000 AD. Versioning the indoe would cover this easily.
> 
> This would be misguided, and given the current speed of evolution
> lead to other problems far before 2038.
> 
> Both struct timespec and struct timeval are major mistakes, they
> make arithmetic on timestamps an expensive operation.  Timestamps
> should be stored as integers using an fix-point notations, for
> instance 64bits with 32bit fractional seconds (the NTP timestamp),
> or in the future 128/48.

I like that idea.

One thing I should probably mention is that I'm not suggesting we ever do
arighmetic on the 44/20 number, just we store it that way. struct inode
would contain time fields in whatever format the host prefers, with the
44/20 stuff only being in struct dinode. Converting from 44/20 would only
happen on initial read. Math would happen on the host format version. :-)

If time structures go to 64/32 fixed-point math, then my suggestion can be
re-phrased as storing 44.20 worth of that number in the on-disk inode.

> Extending from 64 to 128bits would be a cheap shift and increased
> precision and range could go hand in hand.

I doubt we need more than 64 bit times. 2^63 seconds works out to
292,279,025,208 years, or 292 (american) billion years. Current theories
put the age of the universe at I think 12 to 16 billion years. So 64-bit
signed times in seconds will cover from before the big bang to way past
any time we'll be caring about. :-)

Take care,

Bill



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 10:36:24 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131])
	by hub.freebsd.org (Postfix) with ESMTP
	id 6636614C90; Wed, 18 Aug 1999 10:36:14 -0700 (PDT)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id TAA01242;
	Wed, 18 Aug 1999 19:36:22 +0200 (CEST)
	(envelope-from phk@critter.freebsd.dk)
To: Bill Studenmund <wrstuden@nas.nasa.gov>
Cc: Terry Lambert <tlambert@primenet.com>,
	Alton Matthew <Matthew.Alton@anheuser-busch.com>,
	Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite 
In-reply-to: Your message of "Wed, 18 Aug 1999 10:30:39 PDT."
             <Pine.SOL.3.96.990818101005.14430B-100000@marcy.nas.nasa.gov> 
Date: Wed, 18 Aug 1999 19:36:22 +0200
Message-ID: <1240.934997782@critter.freebsd.dk>
From: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

In message <Pine.SOL.3.96.990818101005.14430B-100000@marcy.nas.nasa.gov>, Bill Studenmund writes:

>Whew! That's reasuring. I agree there are things which need fixing. It'd
>be nice if both NetBSD and FreeBSD could fix things in the same way.

Well, >that< still remains to be seen...

>> >> 	The use of the "vfs_default" to make unimplemented VOP's
>> >> 	fall through to code which implements function, while well
>> >> 	intentioned, is misguided.
>> 
>> I beg to differ.  The only difference is that we pass through
>> multiple layers before we hit the bottom of the stack.  There is
>> no loss of functionality but significant gain of clarity and
>> modularity.
>
>If I understood the issue, it is that the leaf fs's (the bottom ones)
>would use a default routine for non-error functionality. I think Terry's
>point (which I agree with) was that a leaf fs's default routine should
>only return errors.

I beg to differ.  It is far more likely, in my mind, that you will
want to handle a currently existing, unimplemented VOP than add a
new one.  Using the default for >all< unimplemented VOPs makes this
possible, using the same logic which makes adding a VOP possible.

Go back and review the diffs from when I did this, and my other
argument why this is a good idea should be obvious.

>I doubt we need more than 64 bit times. 2^63 seconds works out to
>292,279,025,208 years, or 292 (american) billion years. Current theories
>put the age of the universe at I think 12 to 16 billion years. So 64-bit
>signed times in seconds will cover from before the big bang to way past
>any time we'll be caring about. :-)

But we cannot do time in seconds resolution, we need to resolve at least
the cpu clock frequency, which right now is approaching 1GHz (30bit!)

--
Poul-Henning Kamp             FreeBSD coreteam member
phk@FreeBSD.ORG               "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 10:55:58 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17])
	by hub.freebsd.org (Postfix) with ESMTP
	id F1C6C14D15; Wed, 18 Aug 1999 10:55:54 -0700 (PDT)
	(envelope-from wrstuden@marcy.nas.nasa.gov)
Received: from localhost (wrstuden@localhost)
	by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id KAA19003;
	Wed, 18 Aug 1999 10:56:27 -0700 (PDT)
Date: Wed, 18 Aug 1999 10:56:27 -0700 (PDT)
From: Bill Studenmund <wrstuden@nas.nasa.gov>
To: Poul-Henning Kamp <phk@critter.freebsd.dk>
Cc: Terry Lambert <tlambert@primenet.com>, Hackers@FreeBSD.ORG,
	fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite 
In-Reply-To: <1240.934997782@critter.freebsd.dk>
Message-ID: <Pine.SOL.3.96.990818104932.14430D-100000@marcy.nas.nasa.gov>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Wed, 18 Aug 1999, Poul-Henning Kamp wrote:

> In message <Pine.SOL.3.96.990818101005.14430B-100000@marcy.nas.nasa.gov>, Bill Studenmund writes:
> 
> >Whew! That's reasuring. I agree there are things which need fixing. It'd
> >be nice if both NetBSD and FreeBSD could fix things in the same way.
> 
> Well, >that< still remains to be seen...

:-)

> >I doubt we need more than 64 bit times. 2^63 seconds works out to
> >292,279,025,208 years, or 292 (american) billion years. Current theories
> >put the age of the universe at I think 12 to 16 billion years. So 64-bit
> >signed times in seconds will cover from before the big bang to way past
> >any time we'll be caring about. :-)

I was unclear. I was refering to the seconds side of things. Sub-second
resolution would need other bits.

Take care,

Bill



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 11: 6:14 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131])
	by hub.freebsd.org (Postfix) with ESMTP
	id 0FF5C14F5D; Wed, 18 Aug 1999 11:06:09 -0700 (PDT)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id UAA01364;
	Wed, 18 Aug 1999 20:04:49 +0200 (CEST)
	(envelope-from phk@critter.freebsd.dk)
To: Bill Studenmund <wrstuden@nas.nasa.gov>
Cc: Terry Lambert <tlambert@primenet.com>, Hackers@FreeBSD.ORG,
	fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite 
In-reply-to: Your message of "Wed, 18 Aug 1999 10:56:27 PDT."
             <Pine.SOL.3.96.990818104932.14430D-100000@marcy.nas.nasa.gov> 
Date: Wed, 18 Aug 1999 20:04:49 +0200
Message-ID: <1362.934999489@critter.freebsd.dk>
From: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

In message <Pine.SOL.3.96.990818104932.14430D-100000@marcy.nas.nasa.gov>, Bill Studenmund writes:

>> >I doubt we need more than 64 bit times. 2^63 seconds works out to
>> >292,279,025,208 years, or 292 (american) billion years. Current theories
>> >put the age of the universe at I think 12 to 16 billion years. So 64-bit
>> >signed times in seconds will cover from before the big bang to way past
>> >any time we'll be caring about. :-)
>
>I was unclear. I was refering to the seconds side of things. Sub-second
>resolution would need other bits.

Yes, but we need subsecond in the filesystems.  Think about make(1) on
a blinding fast machine...

--
Poul-Henning Kamp             FreeBSD coreteam member
phk@FreeBSD.ORG               "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 11: 8: 2 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from alpo.whistle.com (alpo.whistle.com [207.76.204.38])
	by hub.freebsd.org (Postfix) with ESMTP
	id 13FD81505D; Wed, 18 Aug 1999 11:07:53 -0700 (PDT)
	(envelope-from julian@whistle.com)
Received: from current1.whistle.com (current1.whistle.com [207.76.205.22])
	by alpo.whistle.com (8.9.1a/8.9.1) with SMTP id LAA09356;
	Wed, 18 Aug 1999 11:00:47 -0700 (PDT)
Date: Wed, 18 Aug 1999 11:01:58 -0700 (PDT)
From: Julian Elischer <julian@whistle.com>
To: Poul-Henning Kamp <phk@critter.freebsd.dk>
Cc: Bill Studenmund <wrstuden@nas.nasa.gov>,
	Terry Lambert <tlambert@primenet.com>,
	Alton Matthew <Matthew.Alton@anheuser-busch.com>,
	Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite 
In-Reply-To: <830.934961572@critter.freebsd.dk>
Message-ID: <Pine.BSF.3.95.990818105716.12306A-100000@current1.whistle.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org



On Wed, 18 Aug 1999, Poul-Henning Kamp wrote:

> Matt doesn't represent the FreeBSD project, and even if he rewrites
> the VFS subsystem so he can understand it, his rewrite would face
> considerable resistance on its way into FreeBSD.  I don't think
> there is reason to rewrite it, but there certainly are areas
> that need fixing.

You are misinformed as far as I know.. From discussions I saw, th
main architect of a VFS rewrite would be Kirk, and Matt would be acting as
Kirk's right-hand-man.

> 
> >> 	The use of the "vfs_default" to make unimplemented VOP's
> >> 	fall through to code which implements function, while well
> >> 	intentioned, is misguided.
> 
> I beg to differ.  The only difference is that we pass through
> multiple layers before we hit the bottom of the stack.  There is
> no loss of functionality but significant gain of clarity and
> modularity.

Well I believe that Kirk considers them misguided too, but he stated that
he wasn't going to remove them without serious thought about the alternatives.
 



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 11:16: 1 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131])
	by hub.freebsd.org (Postfix) with ESMTP
	id 68E981505D; Wed, 18 Aug 1999 11:15:54 -0700 (PDT)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id UAA01443;
	Wed, 18 Aug 1999 20:15:59 +0200 (CEST)
	(envelope-from phk@critter.freebsd.dk)
To: Julian Elischer <julian@whistle.com>
Cc: Bill Studenmund <wrstuden@nas.nasa.gov>,
	Terry Lambert <tlambert@primenet.com>,
	Alton Matthew <Matthew.Alton@anheuser-busch.com>,
	Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite 
In-reply-to: Your message of "Wed, 18 Aug 1999 11:01:58 PDT."
             <Pine.BSF.3.95.990818105716.12306A-100000@current1.whistle.com> 
Date: Wed, 18 Aug 1999 20:15:59 +0200
Message-ID: <1441.935000159@critter.freebsd.dk>
From: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

In message <Pine.BSF.3.95.990818105716.12306A-100000@current1.whistle.com>, Julian Elischer writes:
>On Wed, 18 Aug 1999, Poul-Henning Kamp wrote:
>
>> Matt doesn't represent the FreeBSD project, and even if he rewrites
>> the VFS subsystem so he can understand it, his rewrite would face
>> considerable resistance on its way into FreeBSD.  I don't think
>> there is reason to rewrite it, but there certainly are areas
>> that need fixing.
>
>You are misinformed as far as I know.. From discussions I saw, th
>main architect of a VFS rewrite would be Kirk, and Matt would be acting as
>Kirk's right-hand-man.

I bet that Matt and Kirk uses "rewrite" for two very different
concepts.  The resulting reviews will be equally different.

>> >> 	The use of the "vfs_default" to make unimplemented VOP's
>> >> 	fall through to code which implements function, while well
>> >> 	intentioned, is misguided.
>> 
>> I beg to differ.  The only difference is that we pass through
>> multiple layers before we hit the bottom of the stack.  There is
>> no loss of functionality but significant gain of clarity and
>> modularity.
>
>Well I believe that Kirk considers them misguided too, but he stated that
>he wasn't going to remove them without serious thought about the alternatives.

I'll be more than ready to discuss this with Kirk.

--
Poul-Henning Kamp             FreeBSD coreteam member
phk@FreeBSD.ORG               "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 11:20:35 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134])
	by hub.freebsd.org (Postfix) with ESMTP
	id C51B81595D; Wed, 18 Aug 1999 11:20:18 -0700 (PDT)
	(envelope-from tlambert@usr02.primenet.com)
Received: (from daemon@localhost)
	by smtp04.primenet.com (8.9.3/8.9.3) id LAA04794;
	Wed, 18 Aug 1999 11:19:40 -0700 (MST)
Received: from usr02.primenet.com(206.165.6.202)
 via SMTP by smtp04.primenet.com, id smtpdAAAFFaOvj; Wed Aug 18 11:19:33 1999
Received: (from tlambert@localhost)
	by usr02.primenet.com (8.8.5/8.8.5) id LAA14096;
	Wed, 18 Aug 1999 11:19:43 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908181819.LAA14096@usr02.primenet.com>
Subject: Re: BSD XFS Port & BSD VFS Rewrite
To: wrstuden@nas.nasa.gov
Date: Wed, 18 Aug 1999 18:19:42 +0000 (GMT)
Cc: tlambert@primenet.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
In-Reply-To: <Pine.SOL.3.96.990817092121.6014C-100000@marcy.nas.nasa.gov> from "Bill Studenmund" at Aug 17, 99 01:44:34 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> > > > > > 2.	Advisory locks are hung off private backing objects.
> > > I'm not sure. The struct lock * is only used by layered filesystems, so
> > > they can keep track both of the underlying vnode lock, and if needed their
> > > own vnode lock. For advisory locks, would we want to keep track both of
> > > locks on our layer and the layer below? Don't we want either one or the
> > > other? i.e. layers bypass to the one below, or deal with it all
> > > themselves.
> > 
> > I think you want the lock on the intermediate layer: basically, on
> > every vnode that has data associated with it that is unique to a
> > layer.  Let's not forget, also, that you can expose a layer into
> > the namespace in one place, and expose it covered under another
> > layer, at another.  If you locked down to the backing object, then
> > the only issue you would be left with is one or more intermediate
> > backing objects.
> 
> Right. That exported struct lock * makes locking down to the lowest-level
> file easy - you just feed it to the lock manager, and you're locking the
> same lock the lowest level fs uses. You then lock all vnodes stacked over
> this one at the same time. Otherwise, you just call VOP_LOCK below and
> then lock yourself.

I think this defeats the purpose of the stacking architecture; I
think that if you look at an unadulterated NULLFS, you'll see what I
mean.

Intermediate FS's should not trap VOP's that are not applicable
to them.

One of the purposes of doing a VOP_LOCK on intermediate vnodes
that aren't backing objects is to deal with the global vnode
pool management.  I'd really like FS's to own their vnode pools,
but even without that, you don't need the locking, since you
only need to flush data on vnodes that are backing objects.

If we look at a stack of FS's with intermediate exposure into the
namespace, then it's clear that the issue is really only applicable
to objects that act as a backing store:


----------------------	----------------------	--------------------
FS			Exposed in hierarchy	Backing object
----------------------	----------------------	--------------------
top			yes			no
intermediate_1		no			no
intermediate_2		no			yes
intermediate_3		yes			no
bottom			no			yes
----------------------	----------------------	--------------------

So when we lock "top", we only lock in intermediate_2 and in bottom.

Then we attempt to lock in intermediate_3, but it fails: not because
there is a lock on the vnode in intermediate_3, but because there is
a lock in bottom.

It's unnecessary to lock the vnodes in the intermediate path, or
even at the exposure level, unless they are vnodes that have an
associated backing store.

The need to lock in intermediate_2 exists because it is a translation
layer or a namespace escape.  It deals with compression, or it deals
with file-as-a-directory folding, or it deals with file-hiding
(perhaps for a quoata file), etc..  If it didn't, it wouldn't need
backing store (and therefore wouldn't need to be locked).


> > For a layer with an intermediate backing object, I'm prepared to
> > declare it "special", and proxy the operation down to any inferior
> > backing object (e.g. a union FS that adds files from two FS's
> > together, rather than just directoriy entry lists).  I think such
> > layers are the exception, not the rule.
> 
> Actually isn't the only problem when you have vnode fan-in (union FS)? 
> i.e.  a plain compressing layer should not introduce vnode locking
> problems. 

If it's a block compression layer, it will.  Also a translation layer;
consider a pure Unicode system that wants to remotely mount an FS
from a legacy system.  To do this, it needs to expand the pages from
the legacy system [only it can, since the legacy system doesn't know
about Unicode] in a 2:1 ratio.  Now consider doing a byte-range lock
on a file on such a system.  To propogate the lock, you have to do
an arithmetic conversion at the translation layer.  This gets worse
if the lower end FS is exposed in the namespace as well.

You could make the same arguments for other types of translation or
namespace escapes.


> > I think that export policies are the realm of /etc/exports.
> > 
> > The problem with each FS implementing its own policy, is that this
> > is another place that copyinstr() gets called, when it shouldn't.
> 
> Well, my thought was that, like with current code, most every fs would
> just call vfs_export() when it's presented an export operation. But by
> retaining the option of having the fs do its own thing, we can support
> different export semantics if desired.

I think this bears down on whether the NFS server VFS consumer is
allowed access to the VFS stack at the particular intermediate
layer.  I think this is really an administrative policy decision,
and not an option for the VFS.

I think it would be bad if a given VFS could refuse to participate
in a stacking operation because it didn't like who was stacking.

If we insist on the ability for a VFS to refused stacking, then
we should generalize the idea, such that an intermediate VFS could
refuse exposure into the filesystem namespace accessible to users.

Consider the case of a VFS without quota support, stacked under a
VFS layer that provided quota support by hiding a file in the top
level directory ("quota") and then folding the directory closed by
rerooting in a subdirectory of the top level directory ("root/").

It's reasonable to assume that most admins that want to enforce
quotas would *not* want the possibility of exposing the VFS without
quota support in the user accessible namespace.  Should the VFS
without quotas refuse such exposure?

I think the answer is "no", and that it is an administrative
control issue, not a VFS's preference issue.  Administrators enforce
this by protecting the path to exposure points, or by mounting
stacks over top of exposure points, which results in the exposure
being hidden under another mount.  Using the QUOTAFS example, you
mount the FS to be quota-enforced on /home, and then you mount
the QUOTAFS over top of it, and have it cover "/home" itself,
hiding the underlying FS from exposure.


> > I would resolve this by passing a standard option to the mount code
> > in user space.  For root mounts, a vnode is passed down.  For other
> > mounts, the vnode is parsed and passed if the option is specified.
> 
> Or maybe add a field to vfsops. This info says what the mount call will
> expect (I want a block device, a regular file, a directory, etc), so it
> fits. :-)

This is actually an elegant soloution to the problem.  Much of the
time, we don't consider data interfaces when they are appropriate
because of their widespread use in inappropriate ways (e.g. "ps").


> Also, if we leave it to userland, what happens if someone writes a
> program which calls sys_mount with something the fs doesn't expect. :-)

Well, that gets to another grail of mine: when a device containing
a filesystem "arrives", I believe it should trigger a mount into
the list of mounted filesystems.

I don't necessarily mean that it should also be exported into the
filesystem hierarchy at that point (but it's an option, using the
"last mounted on" information).


> > I think that you will only be able to find rare examples of FS's
> > that don't take device names as arguments.  But for those, you
> > don't specify the option, and it gets "NULL", and whatever local
> > options you specify.
> 
> I agree I can't see a leaf fs not taking a device node. But layered
> fs's certainly will want something else. :-)

I think they want a vnode of an already mounted FS.  The trick is
to enforce the "already mounted" part of that.  I'm comforable with
doing this by saying "it's not already mounted until you can look
up a vnode on it".


> > The point is that, for FS's that can be both root and sub-root,
> > the mount code doesn't have to make the decision, it can be punted
> > to higher level code, in one place, where the code can be centrally
> > maintained and kept from getting "stale" when things change out
> > from under it.
> 
> True.
> 
> And with good comments we can catch the times when the centrally located
> code changes & brakes an assumption made by the fs. :-)

8-).


> > > Except for a minor buglet with device nodes, stacking works in NetBSD at
> > > present. :-)
> > 
> > Have you tried Heidemann's student's stacking layers?  There is one
> > encryption, and one per-file compression with namespace hiding, that
> > I think it would be hard pressed to keep up with.  But I'll give it
> > the benefit of the doubt.  8-).
> 
> Nope. The problem is that while stacking (null, umap, and overlay fs's)
> work, we don't have the coherency issues worked out so that upper layers
> can cache data. i.e. so that the lower fs knows it has to ask the uper
> layers to give pages back. :-) But multiple ls -lR's work fine. :-)

With UVM in NetBSD, this is (supposedly) not an issue.

You could actually think of it this way, as well: only FS's that
contain vnodes that provide backing should implement VOP_GETPAGES
and VOP_PUTPAGES, and all I/O should be done through paging.


> > > I agree it's ugly, but it has the advantage that it doesn't grow the
> > > on-disk inode. A lot of flks have designs on the remaining 64 bits free.
> > > :-)
> > 
> > Well, so long as we can resolve the issue for a long, long time;
> > I plan on being around to have to put up with the bugs, if I can
> > wrangle it... 8-).
> 
> :-)
> 
> I bet by then (559447 AD) we won't be using ffs, so the problem will be
> moot. :-)

Unless I'm the curator of a computer museum... 8-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 11:23:21 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [209.157.86.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 7878C1597A; Wed, 18 Aug 1999 11:23:10 -0700 (PDT)
	(envelope-from dillon@apollo.backplane.com)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.9.3/8.9.1) id LAA48344;
	Wed, 18 Aug 1999 11:22:20 -0700 (PDT)
	(envelope-from dillon)
Date: Wed, 18 Aug 1999 11:22:20 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <199908181822.LAA48344@apollo.backplane.com>
To: Julian Elischer <julian@whistle.com>
Cc: Poul-Henning Kamp <phk@critter.freebsd.dk>,
	Bill Studenmund <wrstuden@nas.nasa.gov>,
	Terry Lambert <tlambert@primenet.com>,
	Alton Matthew <Matthew.Alton@anheuser-busch.com>,
	Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite 
References:  <Pine.BSF.3.95.990818105716.12306A-100000@current1.whistle.com>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

:On Wed, 18 Aug 1999, Poul-Henning Kamp wrote:
:
:> Matt doesn't represent the FreeBSD project, and even if he rewrites
:> the VFS subsystem so he can understand it, his rewrite would face
:> considerable resistance on its way into FreeBSD.  I don't think
:> there is reason to rewrite it, but there certainly are areas
:> that need fixing.
:
:You are misinformed as far as I know.. From discussions I saw, th
:main architect of a VFS rewrite would be Kirk, and Matt would be acting as
:Kirk's right-hand-man.

    Yes, this is correct.  Kirk is going to be the main architect.  I have
    been heavily involved and will continue to be.

:> >> 	The use of the "vfs_default" to make unimplemented VOP's
:
:> I beg to differ.  The only difference is that we pass through
:> multiple layers before we hit the bottom of the stack.  There is
:...
:Well I believe that Kirk considers them misguided too, but he stated that
:he wasn't going to remove them without serious thought about the alternatives.

    The vfs op callout layering has not been on the radar screen.  There
    are much too many other more serious problems.  I really doubt that any
    changes will be made to this piece any time in the next year or even two,
    if at all.

    The main items on the radar screen are related to buffer management
    (struct buf stuff.  For example, preventing VM blockages due to pages
    being wired by write I/O's), VFS locking and reference count issues 
    (for example, namei lookups, blockages in the pager and syncer due to
    vnode locks held by blocked processes, etc...), and interactions 
    between VFS and VM (for example: moving away from VOP_READ/VOP_WRITE 
    and moving more towards a getpages/putpages model).

    None of the items have been set in stone yet.  We're waiting for Kirk
    to get back from vacation and get back into the groove.

						-Matt



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 11:48: 4 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132])
	by hub.freebsd.org (Postfix) with ESMTP
	id 9761B14F8A; Wed, 18 Aug 1999 11:47:54 -0700 (PDT)
	(envelope-from tlambert@usr02.primenet.com)
Received: (from daemon@localhost)
	by smtp02.primenet.com (8.8.8/8.8.8) id LAA08775;
	Wed, 18 Aug 1999 11:48:06 -0700 (MST)
Received: from usr02.primenet.com(206.165.6.202)
 via SMTP by smtp02.primenet.com, id smtpd008709; Wed Aug 18 11:48:03 1999
Received: (from tlambert@localhost)
	by usr02.primenet.com (8.8.5/8.8.5) id LAA14960;
	Wed, 18 Aug 1999 11:48:01 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908181848.LAA14960@usr02.primenet.com>
Subject: Re: BSD XFS Port & BSD VFS Rewrite
To: phk@critter.freebsd.dk (Poul-Henning Kamp)
Date: Wed, 18 Aug 1999 18:48:01 +0000 (GMT)
Cc: tlambert@primenet.com, michaelh@cet.co.jp, wrstuden@nas.nasa.gov,
	Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
In-Reply-To: <1169.934997044@critter.freebsd.dk> from "Poul-Henning Kamp" at Aug 18, 99 07:24:04 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> >> > > I'm not familiar with the VFS_default stuff. All the vop_default_desc
> >> > > routines in NetBSD point to error routines.
> >> > 
> >> > In FreeBSD, they now point to default routines that are *not* error
> >> > routines.  This is the problem.  I admit the change was very well
> >> > intentioned, since it made the code a hell of a lot more readable,
> >> > but choosing between readable and additional function, I take function
> >> > over form (I think the way I would have "fixed" the readability is by
> >> > making the operations that result in the descriptor set for a mounted
> >> > FS instance be both discrete, and named for their specific function).
> >> 
> >> As I recall most of FBSD's default routines are also error routines, if
> >> the exceptions were a problem it would would be trivial to fix.
> >
> >You would have to de-collapse several VOP lists that have been
> >pre-collapsed.
> 
> You are talking gibberish here.  Please show code where this is
> a problem.

When you write a proxy stacking layer, such as John Heidemann's
network proxy stacking layer (an NFS alternative), VOP's which
would normally be handled by vfs_default have to be handled on
the other end of the proxy, instead, in the same way that they
would be handled by the vfs_default stuff.

Some VOP's, like advisory locking, need both local assertion and
remote proxy of the VOP to avoid introducing race windows.

The result of this is that, if you rely on the vfs_default stuff,
then you can't proxy those VOP's into a different address space,
either on another machine, or to a user space VFS stacking layer
developement environment.

This is the same problem that embedding VM references directly
into any FS causes, and that vm_object_t aliases would exacerbate.

John has, in the past, sent me a number of stacking layers done
by various people, with the requirement that I not redistribute
them, as they are not what he would consider to be properly
representative of finished work.

Since John himself did the network proxy, you could perhaps get
him to send you a copy, so you could have direct access to code
where this was a problem.

Make sure that the system you are talking to over the proxy is
not assumed to be a FreeBSD system (e.g. don't assume that the
vfs_default stuff exists on the other end of the proxy, or that
it would be functional).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 11:57:27 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131])
	by hub.freebsd.org (Postfix) with ESMTP
	id DC46A158DC; Wed, 18 Aug 1999 11:57:22 -0700 (PDT)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id UAA01776;
	Wed, 18 Aug 1999 20:56:58 +0200 (CEST)
	(envelope-from phk@critter.freebsd.dk)
To: Terry Lambert <tlambert@primenet.com>
Cc: michaelh@cet.co.jp, wrstuden@nas.nasa.gov,
	Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite 
In-reply-to: Your message of "Wed, 18 Aug 1999 18:48:01 -0000."
             <199908181848.LAA14960@usr02.primenet.com> 
Date: Wed, 18 Aug 1999 20:56:58 +0200
Message-ID: <1774.935002618@critter.freebsd.dk>
From: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

In message <199908181848.LAA14960@usr02.primenet.com>, Terry Lambert writes:

>> >You would have to de-collapse several VOP lists that have been
>> >pre-collapsed.
>> 
>> You are talking gibberish here.  Please show code where this is
>> a problem.
>
>When you write a proxy stacking layer, such as John Heidemann's
>network proxy stacking layer (an NFS alternative), VOP's which
>would normally be handled by vfs_default have to be handled on
>the other end of the proxy, instead, in the same way that they
>would be handled by the vfs_default stuff.

And what prevents you from taking over the default op ?

--
Poul-Henning Kamp             FreeBSD coreteam member
phk@FreeBSD.ORG               "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 11:59:16 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17])
	by hub.freebsd.org (Postfix) with ESMTP
	id 18C031513F; Wed, 18 Aug 1999 11:59:06 -0700 (PDT)
	(envelope-from wrstuden@marcy.nas.nasa.gov)
Received: from localhost (wrstuden@localhost)
	by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id LAA23974;
	Wed, 18 Aug 1999 11:59:01 -0700 (PDT)
Date: Wed, 18 Aug 1999 11:59:01 -0700 (PDT)
From: Bill Studenmund <wrstuden@nas.nasa.gov>
Reply-To: Bill Studenmund <wrstuden@nas.nasa.gov>
To: Terry Lambert <tlambert@primenet.com>
Cc: Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
In-Reply-To: <199908181819.LAA14096@usr02.primenet.com>
Message-ID: <Pine.SOL.3.96.990818112953.14430G-100000@marcy.nas.nasa.gov>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Wed, 18 Aug 1999, Terry Lambert wrote:

> > Right. That exported struct lock * makes locking down to the lowest-level
> > file easy - you just feed it to the lock manager, and you're locking the
> > same lock the lowest level fs uses. You then lock all vnodes stacked over
> > this one at the same time. Otherwise, you just call VOP_LOCK below and
> > then lock yourself.
> 
> I think this defeats the purpose of the stacking architecture; I
> think that if you look at an unadulterated NULLFS, you'll see what I
> mean.

Please be more precise. I have looked at an unadulterated NULLFS, and
found it lacking. I don't see how this change breaks stacking.

> Intermediate FS's should not trap VOP's that are not applicable
> to them.

True. But VOP_LOCK is applicable to layered fs's. :-)

> One of the purposes of doing a VOP_LOCK on intermediate vnodes
> that aren't backing objects is to deal with the global vnode
> pool management.  I'd really like FS's to own their vnode pools,
> but even without that, you don't need the locking, since you
> only need to flush data on vnodes that are backing objects.
> 
> If we look at a stack of FS's with intermediate exposure into the
> namespace, then it's clear that the issue is really only applicable
> to objects that act as a backing store:
> 
> 
> ----------------------	----------------------	--------------------
> FS			Exposed in hierarchy	Backing object
> ----------------------	----------------------	--------------------
> top			yes			no
> intermediate_1		no			no
> intermediate_2		no			yes
> intermediate_3		yes			no
> bottom			no			yes
> ----------------------	----------------------	--------------------
> 
> So when we lock "top", we only lock in intermediate_2 and in bottom.

No. One of the things Heidemann notes in his dissertation is that to
prevent deadlock, you have to lock the whole stack of vnodes at once, not
bit by bit.

i.e. there is one lock for the whole thing.

> > Actually isn't the only problem when you have vnode fan-in (union FS)? 
> > i.e.  a plain compressing layer should not introduce vnode locking
> > problems. 
> 
> If it's a block compression layer, it will.  Also a translation layer;
> consider a pure Unicode system that wants to remotely mount an FS
> from a legacy system.  To do this, it needs to expand the pages from
> the legacy system [only it can, since the legacy system doesn't know
> about Unicode] in a 2:1 ratio.  Now consider doing a byte-range lock
> on a file on such a system.  To propogate the lock, you have to do
> an arithmetic conversion at the translation layer.  This gets worse
> if the lower end FS is exposed in the namespace as well.

Wait. byte-range locking is different from vnode locking. I've been
talking about vnode locking, which is different from the byte-range
locking you're discussing above.

> > Nope. The problem is that while stacking (null, umap, and overlay fs's)
> > work, we don't have the coherency issues worked out so that upper layers
> > can cache data. i.e. so that the lower fs knows it has to ask the uper
> > layers to give pages back. :-) But multiple ls -lR's work fine. :-)
> 
> With UVM in NetBSD, this is (supposedly) not an issue.

UBC. UVM is a new memory manager. UBC unifies the buffer cache with the VM
system.

> You could actually think of it this way, as well: only FS's that
> contain vnodes that provide backing should implement VOP_GETPAGES
> and VOP_PUTPAGES, and all I/O should be done through paging.

Right. That's part of UBC. :-)

Take care,

Bill



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 12: 8:34 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from marcy.nas.nasa.gov (marcy.nas.nasa.gov [129.99.113.17])
	by hub.freebsd.org (Postfix) with ESMTP
	id CABFA15870; Wed, 18 Aug 1999 12:08:30 -0700 (PDT)
	(envelope-from wrstuden@marcy.nas.nasa.gov)
Received: from localhost (wrstuden@localhost)
	by marcy.nas.nasa.gov (8.9.3/NAS8.8.7) with SMTP id MAA25525;
	Wed, 18 Aug 1999 12:08:22 -0700 (PDT)
Date: Wed, 18 Aug 1999 12:08:22 -0700 (PDT)
From: Bill Studenmund <wrstuden@nas.nasa.gov>
Reply-To: Bill Studenmund <wrstuden@nas.nasa.gov>
To: Poul-Henning Kamp <phk@critter.freebsd.dk>
Cc: Terry Lambert <tlambert@primenet.com>, Hackers@FreeBSD.ORG,
	fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite 
In-Reply-To: <1362.934999489@critter.freebsd.dk>
Message-ID: <Pine.SOL.3.96.990818110645.14430F-100000@marcy.nas.nasa.gov>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Wed, 18 Aug 1999, Poul-Henning Kamp wrote:

> Yes, but we need subsecond in the filesystems.  Think about make(1) on
> a blinding fast machine...

Oh yes, I realize that. :-) It's just that I thought you were at one point
suggesting having 128 bits to the left of the decimal point (128 bits
worth of seconds). I was trying to say that'd be a bit much. :-)

Take care,

Bill



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 13:44:48 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135])
	by hub.freebsd.org (Postfix) with ESMTP
	id EA4F215D9F; Wed, 18 Aug 1999 13:43:47 -0700 (PDT)
	(envelope-from tlambert@usr06.primenet.com)
Received: (from daemon@localhost)
	by smtp05.primenet.com (8.9.1/8.9.1) id NAA113206;
	Wed, 18 Aug 1999 13:43:27 -0700
Received: from usr06.primenet.com(206.165.6.206)
 via SMTP by smtp05.primenet.com, id smtpdDReHUa; Wed Aug 18 13:43:17 1999
Received: (from tlambert@localhost)
	by usr06.primenet.com (8.8.5/8.8.5) id NAA28863;
	Wed, 18 Aug 1999 13:43:14 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908182043.NAA28863@usr06.primenet.com>
Subject: Re: BSD XFS Port & BSD VFS Rewrite
To: wrstuden@nas.nasa.gov
Date: Wed, 18 Aug 1999 20:43:14 +0000 (GMT)
Cc: tlambert@primenet.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
In-Reply-To: <Pine.SOL.3.96.990818112953.14430G-100000@marcy.nas.nasa.gov> from "Bill Studenmund" at Aug 18, 99 11:59:01 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> > > Right. That exported struct lock * makes locking down to the lowest-level
> > > file easy - you just feed it to the lock manager, and you're locking the
> > > same lock the lowest level fs uses. You then lock all vnodes stacked over
> > > this one at the same time. Otherwise, you just call VOP_LOCK below and
> > > then lock yourself.
> > 
> > I think this defeats the purpose of the stacking architecture; I
> > think that if you look at an unadulterated NULLFS, you'll see what I
> > mean.
> 
> Please be more precise. I have looked at an unadulterated NULLFS, and
> found it lacking. I don't see how this change breaks stacking.


OK, there's the concept of "collapse" of stacking layer.  This was
first introduced in the Rosenthal stacking vnode architecture, out
of Sun Microsystems.

Rosenthal was concerned that, when you stack 500 putatively "null"
NULLFS's, that the amount of function call overhead not increase
proportionally.

To resolve this, he introduced the concept of a "collapsed" VFS
stack.  That is, the actual array of function vectors is actually
a one dimensional projection of a two dimensional stack, and that
the visible portion is actually where the first layer on the way
down the stack that implements a VOP occurs.

We can visualize this like so:

			    VOPs
Layer |	VOP1	VOP2	VOP3	VOP4	VOP5	VOP6	...
-----------------------------------------------------------
L1	-	-	-	imp	-	-	...
L2	imp	-	-	imp	-	imp	...
L3	imp	-	-	imp	imp	-	...
L4	-	-	imp	-	-	-	...
L5	imp	imp	imp	imp	imp	imp	...

The resulting "collapsed" array of entry vectors looks like so:

	L2VOP1	L5VOP2	L4VOP3	L1VOP4	L3VOP5	L2VOP6	...

There is an implicit assumption here that most stacks will not be
randomly staggered like this example.  The idea behind this
assumption is that additional layers will most frequently add
functionality, rather than replacing it.

Heidemann carried this idea over into his architecture, to be
employed at the point that a VFS stack is first instanced.

The BSD4.4 implementation of this is partially flawed.  There is
an implicit implementation of this for the UFS/FFS "stack" of
layers, in the VOP's descriptor array exported by the combination
of the two being hard coded as being a precollapsed stack.  This
is actually antithetical to the design.

The second place this flaw is apparent is in the inability to
add VOP's into an existing kernel, since the entry point vector
is a fixed size, and is not expanded implicitly by the act of
adding a VFS layer containing a new VOP.

For the use of non-error vfs_defaults, this is also flawed for
proxies, but not for the consumer of the VFS stack, only for the
producer end on the other side of the proxy, which although it
does not implement a particular VOP, needs to _NOT_ use the
local vfs_default for the VOP, but instead needs to proxy the
VOP over to the other side for remote processing.

The act of getting a vfs_default VOP after a collapse, instead
of having a NULL entry point that the descriptor call mechanism
treats as a call failure, damages the ability to proxy unknown
VOP's.


> > Intermediate FS's should not trap VOP's that are not applicable
> > to them.
> 
> True. But VOP_LOCK is applicable to layered fs's. :-)

Only for translation layers that require local backing store.  I'm
prepared to make an exception for them, and require that they
explicitly call the VOP in the underlying vnode over which they are
stacked.  This is the same compromise that both Rosenthal and
Heidemann consciously chose.


> > One of the purposes of doing a VOP_LOCK on intermediate vnodes
> > that aren't backing objects is to deal with the global vnode
> > pool management.  I'd really like FS's to own their vnode pools,
> > but even without that, you don't need the locking, since you
> > only need to flush data on vnodes that are backing objects.
> > 
> > If we look at a stack of FS's with intermediate exposure into the
> > namespace, then it's clear that the issue is really only applicable
> > to objects that act as a backing store:
> > 
> > 
> > ----------------------	----------------------	--------------------
> > FS			Exposed in hierarchy	Backing object
> > ----------------------	----------------------	--------------------
> > top			yes			no
> > intermediate_1		no			no
> > intermediate_2		no			yes
> > intermediate_3		yes			no
> > bottom			no			yes
> > ----------------------	----------------------	--------------------
> > 
> > So when we lock "top", we only lock in intermediate_2 and in bottom.
> 
> No. One of the things Heidemann notes in his dissertation is that to
> prevent deadlock, you have to lock the whole stack of vnodes at once, not
> bit by bit.
> 
> i.e. there is one lock for the whole thing.

This is not true for a unified VM and buffer cache environment,
and a significant reduction in overhead can be achieved thereby.

Heidemann did his work on SVR4, which does not have a unified VM
and buffer cache.  The deadlock discussion in his dissertation is
only applicable to systems where the coherency model is such that
each and every vnode has buffers associated with it.  That is, it
applies to vnodes which act as backing store (buffer cache object
references).

If you seperate the concept, such that you don't have to deal with
vnodes that do not have coherency issues, then you can drastically
reduce the number of coherency operations required (locking is a
coherency operation).

In addition to this, you can effectively obtain what neither the
Rosenthal or the SVR4 version of the Heidemann stacking framework
can otherwise obtain: intermediate VFS layer NULL VOP call collapse.
The way you obtain this is by caching the vnode of the backing object
in the intermediate layer, and dereferencing it to get at it's VOP
vector directly.

This means that a functional layer that shodows an underlying VOP,
seperated by 1,000 NULLFS layers, does not result in a 1,000 function
call overhead.


> > > Actually isn't the only problem when you have vnode fan-in (union FS)? 
> > > i.e.  a plain compressing layer should not introduce vnode locking
> > > problems. 
> > 
> > If it's a block compression layer, it will.  Also a translation layer;
> > consider a pure Unicode system that wants to remotely mount an FS
> > from a legacy system.  To do this, it needs to expand the pages from
> > the legacy system [only it can, since the legacy system doesn't know
> > about Unicode] in a 2:1 ratio.  Now consider doing a byte-range lock
> > on a file on such a system.  To propogate the lock, you have to do
> > an arithmetic conversion at the translation layer.  This gets worse
> > if the lower end FS is exposed in the namespace as well.
> 
> Wait. byte-range locking is different from vnode locking. I've been
> talking about vnode locking, which is different from the byte-range
> locking you're discussing above.

Conceptually, they're not really different at all.  You want to
apply an operation against a stack of vnodes, and only involve
the relevent vnodes when you do it.


> > > Nope. The problem is that while stacking (null, umap, and overlay fs's)
> > > work, we don't have the coherency issues worked out so that upper layers
> > > can cache data. i.e. so that the lower fs knows it has to ask the uper
> > > layers to give pages back. :-) But multiple ls -lR's work fine. :-)
> > 
> > With UVM in NetBSD, this is (supposedly) not an issue.
> 
> UBC. UVM is a new memory manager. UBC unifies the buffer cache with the VM
> system.

I was under the impression that th "U" in "UVM" was for "Unified".

Does NetBSD not have a unified VM and buffer cache?  is th "U" in
"UVM" referring not to buffer cache unification, but to platform
unification?

It was my understanding from John Dyson, who had to work on NetBSD
for NCI, that the new NetBSD stuff actually unified the VM and the
buffer cache.

If this isn't the case, then, yes, you will need to lock all the way
up and down, and eat the copy overhead for the concurrency for the
intermediate vnodes.  8-(.


> > You could actually think of it this way, as well: only FS's that
> > contain vnodes that provide backing should implement VOP_GETPAGES
> > and VOP_PUTPAGES, and all I/O should be done through paging.
> 
> Right. That's part of UBC. :-)

Yep.  Again, if NetBSD doesn't have this, it's really important
that it obtain it.  8-(.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 14: 2:47 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135])
	by hub.freebsd.org (Postfix) with ESMTP
	id 472E2151E1; Wed, 18 Aug 1999 14:02:05 -0700 (PDT)
	(envelope-from tlambert@usr06.primenet.com)
Received: (from daemon@localhost)
	by smtp05.primenet.com (8.9.1/8.9.1) id OAA435022;
	Wed, 18 Aug 1999 14:02:07 -0700
Received: from usr06.primenet.com(206.165.6.206)
 via SMTP by smtp05.primenet.com, id smtpdaToVMa; Wed Aug 18 14:01:59 1999
Received: (from tlambert@localhost)
	by usr06.primenet.com (8.8.5/8.8.5) id OAA29646;
	Wed, 18 Aug 1999 14:01:54 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908182101.OAA29646@usr06.primenet.com>
Subject: Re: BSD XFS Port & BSD VFS Rewrite
To: phk@critter.freebsd.dk (Poul-Henning Kamp)
Date: Wed, 18 Aug 1999 21:01:53 +0000 (GMT)
Cc: tlambert@primenet.com, michaelh@cet.co.jp, wrstuden@nas.nasa.gov,
	Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
In-Reply-To: <1774.935002618@critter.freebsd.dk> from "Poul-Henning Kamp" at Aug 18, 99 08:56:58 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> >> >You would have to de-collapse several VOP lists that have been
> >> >pre-collapsed.
> >> 
> >> You are talking gibberish here.  Please show code where this is
> >> a problem.
> >
> >When you write a proxy stacking layer, such as John Heidemann's
> >network proxy stacking layer (an NFS alternative), VOP's which
> >would normally be handled by vfs_default have to be handled on
> >the other end of the proxy, instead, in the same way that they
> >would be handled by the vfs_default stuff.
> 
> And what prevents you from taking over the default op ?

It needs to be NULL, not taken over.


machine 1		machine2		machine 3

vfs consumer
upper proxy <---------> lower proxy
			vfs stacking layer
			upper proxy <---------> lower proxy
						vfs producer

How do I get a VOP, unknown to machine 2, from the vfs consumer
on machine 1 that does know about it, to the vfs producer on
machine 3 that also knows about it?

My understanding is that it is very hard, given vfs_default:

On machine 1, since the upper proxy doesn't know from VOP's, it
wants to locally satisfy it from vfs_default on machine 1.  Taking
over the default op doesn't really help me; I have to do surgery
to the in core dispatch vector instance to do the job properly
(e.g. zapping it out, not taking it over).

On machine 2, it is out of range, but still needs to be passed
through the stacking layer, from the lower porxy to the upper
proxy (and the response, back).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 14:17:57 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.40.131])
	by hub.freebsd.org (Postfix) with ESMTP
	id 9C92214ED3; Wed, 18 Aug 1999 14:17:30 -0700 (PDT)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.9.3/8.9.2) with ESMTP id XAA02921;
	Wed, 18 Aug 1999 23:15:49 +0200 (CEST)
	(envelope-from phk@critter.freebsd.dk)
To: Terry Lambert <tlambert@primenet.com>
Cc: michaelh@cet.co.jp, wrstuden@nas.nasa.gov,
	Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite 
In-reply-to: Your message of "Wed, 18 Aug 1999 21:01:53 -0000."
             <199908182101.OAA29646@usr06.primenet.com> 
Date: Wed, 18 Aug 1999 23:15:49 +0200
Message-ID: <2919.935010949@critter.freebsd.dk>
From: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


Terry,

It is very fine with this example, but I'm not even going to bother
much with it for several reasons, most of which you can find codified
in the development rules for X11 which you can find in Scheiflers
book.

But for the record: your example would get even shorter on
the code we had before I started using the default op sensibly
because all the layers tended to shunt things they didn't 
understand to errno rather than pass them through, so in
fact my change took us closer to being able to handle the
rather lofty example you have here.

Once you show me an actual implementation which has a problem
with it, I will look at it again, until then, I think pretty
much everything else is more important (Scheiflers 1st rule :-)

Poul-Henning

>> And what prevents you from taking over the default op ?
>
>It needs to be NULL, not taken over.
>
>
>machine 1		machine2		machine 3
>
>vfs consumer
>upper proxy <---------> lower proxy
>			vfs stacking layer
>			upper proxy <---------> lower proxy
>						vfs producer
>
>How do I get a VOP, unknown to machine 2, from the vfs consumer
>on machine 1 that does know about it, to the vfs producer on
>machine 3 that also knows about it?

--
Poul-Henning Kamp             FreeBSD coreteam member
phk@FreeBSD.ORG               "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 17:18:55 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133])
	by hub.freebsd.org (Postfix) with ESMTP
	id 6F08615987; Wed, 18 Aug 1999 17:18:51 -0700 (PDT)
	(envelope-from tlambert@usr06.primenet.com)
Received: (from daemon@localhost)
	by smtp03.primenet.com (8.9.3/8.9.3) id RAA11206;
	Wed, 18 Aug 1999 17:18:41 -0700 (MST)
Received: from usr06.primenet.com(206.165.6.206)
 via SMTP by smtp03.primenet.com, id smtpdAAA4ka42v; Wed Aug 18 17:18:36 1999
Received: (from tlambert@localhost)
	by usr06.primenet.com (8.8.5/8.8.5) id RAA09816;
	Wed, 18 Aug 1999 17:18:41 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908190018.RAA09816@usr06.primenet.com>
Subject: Re: BSD XFS Port & BSD VFS Rewrite
To: phk@critter.freebsd.dk (Poul-Henning Kamp)
Date: Thu, 19 Aug 1999 00:18:41 +0000 (GMT)
Cc: tlambert@primenet.com, michaelh@cet.co.jp, wrstuden@nas.nasa.gov,
	Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
In-Reply-To: <2919.935010949@critter.freebsd.dk> from "Poul-Henning Kamp" at Aug 18, 99 11:15:49 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> Terry,
> 
> It is very fine with this example, but I'm not even going to bother
> much with it for several reasons, most of which you can find codified
> in the development rules for X11 which you can find in Scheiflers
> book.
> 
> But for the record: your example would get even shorter on
> the code we had before I started using the default op sensibly
> because all the layers tended to shunt things they didn't 
> understand to errno rather than pass them through, so in
> fact my change took us closer to being able to handle the
> rather lofty example you have here.
> 
> Once you show me an actual implementation which has a problem
> with it, I will look at it again, until then, I think pretty
> much everything else is more important (Scheiflers 1st rule :-)
> 
> Poul-Henning


That's a fair requirement.  I have some of Heidemann's code that
runs into the problem, but I don't have any that I can redistribute.

Would it be OK if I asked John to send you his code as well, if
you will abide with the non-redistribution requirement?

I understand the prioritization process, and FWIW, I agree with
it, in a resource-starved situation (e.g.g FreeBSD).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Aug 18 22:41:27 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from peach.ocn.ne.jp (peach.ocn.ne.jp [210.145.254.87])
	by hub.freebsd.org (Postfix) with ESMTP
	id 7C26E14F15; Wed, 18 Aug 1999 22:41:17 -0700 (PDT)
	(envelope-from dcs@newsguy.com)
Received: from newsguy.com by peach.ocn.ne.jp (8.9.1a/OCN) id OAA11981; Thu, 19 Aug 1999 14:39:39 +0900 (JST)
Message-ID: <37BB88F3.7184305@newsguy.com>
Date: Thu, 19 Aug 1999 13:32:51 +0900
From: "Daniel C. Sobral" <dcs@newsguy.com>
X-Mailer: Mozilla 4.6 [en] (Win98; I)
X-Accept-Language: en,pt-BR,ja
MIME-Version: 1.0
To: Terry Lambert <tlambert@primenet.com>
Cc: Poul-Henning Kamp <phk@critter.freebsd.dk>, michaelh@cet.co.jp,
	wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com,
	Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
References: <199908181848.LAA14960@usr02.primenet.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Terry Lambert wrote:
> 
> Make sure that the system you are talking to over the proxy is
> not assumed to be a FreeBSD system (e.g. don't assume that the
> vfs_default stuff exists on the other end of the proxy, or that
> it would be functional).

Now, Terry, that is ridiculous. One has to assume that both ends
play by the same rules. That is not only a reasonably expectation,
it's minimum requirement for any protocol to work.

--
Daniel C. Sobral			(8-DCS)
dcs@newsguy.com
dcs@freebsd.org

	- Can I speak to your superior?
	- There's some religious debate on that question.




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 19  7:21:25 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from lupo.thebarn.com (x101-182-203.unreg.umn.edu [128.101.182.203])
	by hub.freebsd.org (Postfix) with ESMTP
	id 6599C150F5; Thu, 19 Aug 1999 07:21:15 -0700 (PDT)
	(envelope-from cattelan@thebarn.com)
Received: from thebarn.com ([128.101.182.201])
	by lupo.thebarn.com (8.9.3/8.9.1) with ESMTP id BAA86916;
	Thu, 19 Aug 1999 01:01:15 -0500 (CDT)
Message-ID: <37BB9DAB.E7F0FED0@thebarn.com>
Date: Thu, 19 Aug 1999 01:01:15 -0500
From: Russell Cattelan <cattelan@thebarn.com>
X-Mailer: Mozilla 4.61 [en] (X11; I; FreeBSD 4.0-CURRENT i386)
X-Accept-Language: en
MIME-Version: 1.0
To: "Alton, Matthew" <Matthew.Alton@anheuser-busch.com>
Cc: "'Hackers@FreeBSD.ORG'" <Hackers@FreeBSD.ORG>,
	"'fs@FreeBSD.ORG'" <fs@FreeBSD.ORG>
Subject: Re: BSD-XFS Update
References: <0740CBD1D149D31193EB0008C7C56836EB8B05@STLABCEXG012>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

"Alton, Matthew" wrote:

> SGI has released a portion of the XFS source code under the GPL:
>
> http://oss.sgi.com/projects/xfs/download/
>
> the source file is xfs_log.tar.gz.
>
> Of greater interest at this stage are the documents in:
>
> http://oss.sgi.com/projects/xfs/design_docs/
>
> I am currently researching methods for implementing the 64-bit
> syscalls stat64(), fstat64(), lseek64() &etc.  delineated in the
> SGI design doc _64 Bit File Access_  by Adam Sweeney.

The xxxx64 calls are no longer an issue as of IRIX 6.(something 2 I
think) all
the standard calls were converted to use 64 bit types directly.

Have a better one for you to research.
Find out if buffers can be pined? if not what is it going to take to fix
that.

>
> The BSD-XFS port will be made available as a patch to the RELEASE
> FreeBSD kernels.

Given the size of XFS it might be easier to make FreeBSD a patch to XFS.
<- major humor here.
:-) :-)

>
>
> Matthew Alton
> Computer Services - UNIX Systems Administration
> (314)632-6644   matthew.alton@anheuser-busch.com
>                 alton@plantnet.com
>
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-hackers" in the body of the message

--
Russell Cattelan
cattelan@thebarn.com





To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 19  7:21:37 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from lupo.thebarn.com (x101-182-203.unreg.umn.edu [128.101.182.203])
	by hub.freebsd.org (Postfix) with ESMTP
	id AA72D1511C; Thu, 19 Aug 1999 07:21:15 -0700 (PDT)
	(envelope-from cattelan@thebarn.com)
Received: from thebarn.com ([128.101.182.201])
	by lupo.thebarn.com (8.9.3/8.9.1) with ESMTP id AAA86822;
	Thu, 19 Aug 1999 00:41:29 -0500 (CDT)
Message-ID: <37BB9909.53D356FE@thebarn.com>
Date: Thu, 19 Aug 1999 00:41:29 -0500
From: Russell Cattelan <cattelan@thebarn.com>
X-Mailer: Mozilla 4.61 [en] (X11; I; FreeBSD 4.0-CURRENT i386)
X-Accept-Language: en
MIME-Version: 1.0
To: "Alton, Matthew" <Matthew.Alton@anheuser-busch.com>
Cc: "'Hackers@FreeBSD.ORG'" <Hackers@FreeBSD.ORG>,
	"'fs@FreeBSD.ORG'" <fs@FreeBSD.ORG>
Subject: Re: BSD XFS Port & BSD VFS Rewrite
References: <0740CBD1D149D31193EB0008C7C56836EB8AFC@STLABCEXG012>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Glad to hear somebody is willing to dive in to XFS.


Right now I am one of three people working on the XFS to linux port, so I
have
pretty good view of what is currently happening.

When is it going to be ready?
Don't hold your breath. Officially SGI has said by the end of the year,
technically... whew
frankly I can't even guess. I would hope within a month or so we will
have the basics of a FS.

There are a lot of hurtles to overcome. XFS is a very very complex file
system that relies on
some of the more advanced features of IRIX. The buffer cache and chunk
cache (chunking
buffers together to do large IO) are two  examples that come to mind. SGI
is rewriting
the buffer cache (calling it the page cache) such that is will be able to
support XFS.
chunk cache... ? not sure what we are going to do with that.

We have been having several discussions about the best way to
"interface".
IRIX uses VFS,VNODE,BEHAVIOR which is similar to the BSD's interface
but of course very  IRIX specific. Linux's vfs/vnode is different from
either.
Realizing this, a lot of our discussions have been around how to go at
making a
new/modify existing interface layer that might be more "universal"
i.e. not irix not linux not bsd not etc.... specific.

In reading Terry's  & Bill's comments seems there is a a lot of room for
improvement.

Initially we trying to make as few changes as possible to XFS to get an
initial implementation
running on linux. After we get things running we will start to analyze
where the problems exist,
and decide what direction in terms of interface to take at that time.

I would like any constructive input people have on this matter. I have a
pretty good
chance of setting design direction.
Be waned: SGI at the moment is committed to linux, development directions
will favor that platform.
They are not against other OS's being XFS'atized but SGI is in the
business of selling
hardware/solutions based on that hardware and linux one of the OS they
have decided to use for
their intel based boxes.

Also as far as the GPL issue goes,  get over it! I understand the issues
and agree with many
of the points.
My suggestion lets find a way to work with the GPL (i.e. loadable kernel
module /
softupdates model)
If somebody has a very very good argument/solution to the licensing
debate let me
know, I can present it to the people dealing with the lawyers.
The license issue has slowed the release of the actual code more than
anything else,
and will not be revisited again without great pain.


> I am currently conducting a thorough study of the VFS subsystem
> in preparation for an all-out effort to port SGI's XFS filesystem to
> FreeBSD 4.x at such time as SGI gives up the code.  Matt Dillon
> has written in hackers- that the VFS subsystem is presently not
> well understood by any of the active kernel code contributers and
> that it will be rewritten later this year.  This is obviously of great
> concern to me in this port.  I greatly appreciate all assistance in
> answering the following questions:
>
> 1)  What are the perceived problems with the current VFS?
> 2)  What options are available to us as remedies?
> 3)  To what extent will existing FS code require revision in order
>      to be useful after the rewrite?
> 4)  Will Chapters 6,7,8 & 9 of "The Design and Implementation of
>      the 4.4BSD Operating System" still pertain after the rewrite?
> 5)  How important are questions 3 & 4 in the design of the new
>      VFS?
>
> I believe that the VFS is conceptually sound and that the existing
> semantics should be strictly retained in the new code.  Any new
> functionality should be added in the form of entirely new kernel
> routines and system calls, or possibly by such means as
> converting the existing routines to the vararg format &etc.
>
> Does anyone know when SGI will release XFS?
>
>

--
Russell Cattelan
cattelan@thebarn.com





To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 19  8:47: 1 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from gatewaya.anheuser-busch.com (gatewaya.anheuser-busch.com [151.145.250.252])
	by hub.freebsd.org (Postfix) with SMTP
	id 06922150B0; Thu, 19 Aug 1999 08:46:42 -0700 (PDT)
	(envelope-from Matthew.Alton@anheuser-busch.com)
Received: by gatewaya.anheuser-busch.com; id KAA02280; Thu, 19 Aug 1999 10:47:29 -0500
Received: from stlexggtw002-pozzoli.fw-users.busch.com(151.145.101.130) by gatewaya.anheuser-busch.com via smap (V5.0)
	id xma002157; Thu, 19 Aug 99 10:46:42 -0500
Received: from stlabcexg006.anheuser-busch.com ([151.145.101.161]) by 151.145.101.130
  (Norton AntiVirus for Internet Email Gateways 1.0) ;
  Thu, 19 Aug 1999 15:44:28 0000 (GMT)
Received: by stlabcexg006.anheuser-busch.com with Internet Mail Service (5.5.2448.0)
	id <RGNX6V8Z>; Thu, 19 Aug 1999 10:44:06 -0500
Message-ID: <0740CBD1D149D31193EB0008C7C56836EB8B14@STLABCEXG012>
From: "Alton, Matthew" <Matthew.Alton@anheuser-busch.com>
To: "'Russell Cattelan'" <cattelan@thebarn.com>
Cc: "'Hackers@FreeBSD.ORG'" <Hackers@FreeBSD.ORG>,
	"'fs@FreeBSD.ORG'" <fs@FreeBSD.ORG>
Subject: RE: BSD XFS Port & BSD VFS Rewrite
Date: Thu, 19 Aug 1999 10:44:27 -0500
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2448.0)
Content-Type: text/plain;
	charset="iso-8859-1"
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Do you have access to more of the code than is currently posted on SGI's
web page?  I am willing to sign an NDA in order to get access to all
relevant source.  I would like to assist in porting XFS to Linux also.  I would
very much like to see SGI succeed by using open source software in the 
commercial realm.  As for licensing issues, I am purely agnostic -- I trust that
any legal issues can be worked out after the fact by the proper people.

> -----Original Message-----
> From:	Russell Cattelan [SMTP:cattelan@thebarn.com]
> Sent:	Thursday, August 19, 1999 12:41 AM
> To:	Alton, Matthew
> Cc:	'Hackers@FreeBSD.ORG'; 'fs@FreeBSD.ORG'
> Subject:	Re: BSD XFS Port & BSD VFS Rewrite
> 
> Glad to hear somebody is willing to dive in to XFS.
> 
> 
> Right now I am one of three people working on the XFS to linux port, so I
> have
> pretty good view of what is currently happening.
> 
> When is it going to be ready?
> Don't hold your breath. Officially SGI has said by the end of the year,
> technically... whew
> frankly I can't even guess. I would hope within a month or so we will
> have the basics of a FS.
> 
> There are a lot of hurtles to overcome. XFS is a very very complex file
> system that relies on
> some of the more advanced features of IRIX. The buffer cache and chunk
> cache (chunking
> buffers together to do large IO) are two  examples that come to mind. SGI
> is rewriting
> the buffer cache (calling it the page cache) such that is will be able to
> support XFS.
> chunk cache... ? not sure what we are going to do with that.
> 
> We have been having several discussions about the best way to
> "interface".
> IRIX uses VFS,VNODE,BEHAVIOR which is similar to the BSD's interface
> but of course very  IRIX specific. Linux's vfs/vnode is different from
> either.
> Realizing this, a lot of our discussions have been around how to go at
> making a
> new/modify existing interface layer that might be more "universal"
> i.e. not irix not linux not bsd not etc.... specific.
> 
> In reading Terry's  & Bill's comments seems there is a a lot of room for
> improvement.
> 
> Initially we trying to make as few changes as possible to XFS to get an
> initial implementation
> running on linux. After we get things running we will start to analyze
> where the problems exist,
> and decide what direction in terms of interface to take at that time.
> 
> I would like any constructive input people have on this matter. I have a
> pretty good
> chance of setting design direction.
> Be waned: SGI at the moment is committed to linux, development directions
> will favor that platform.
> They are not against other OS's being XFS'atized but SGI is in the
> business of selling
> hardware/solutions based on that hardware and linux one of the OS they
> have decided to use for
> their intel based boxes.
> 
> Also as far as the GPL issue goes,  get over it! I understand the issues
> and agree with many
> of the points.
> My suggestion lets find a way to work with the GPL (i.e. loadable kernel
> module /
> softupdates model)
> If somebody has a very very good argument/solution to the licensing
> debate let me
> know, I can present it to the people dealing with the lawyers.
> The license issue has slowed the release of the actual code more than
> anything else,
> and will not be revisited again without great pain.
> 
> 
> > I am currently conducting a thorough study of the VFS subsystem
> > in preparation for an all-out effort to port SGI's XFS filesystem to
> > FreeBSD 4.x at such time as SGI gives up the code.  Matt Dillon
> > has written in hackers- that the VFS subsystem is presently not
> > well understood by any of the active kernel code contributers and
> > that it will be rewritten later this year.  This is obviously of great
> > concern to me in this port.  I greatly appreciate all assistance in
> > answering the following questions:
> >
> > 1)  What are the perceived problems with the current VFS?
> > 2)  What options are available to us as remedies?
> > 3)  To what extent will existing FS code require revision in order
> >      to be useful after the rewrite?
> > 4)  Will Chapters 6,7,8 & 9 of "The Design and Implementation of
> >      the 4.4BSD Operating System" still pertain after the rewrite?
> > 5)  How important are questions 3 & 4 in the design of the new
> >      VFS?
> >
> > I believe that the VFS is conceptually sound and that the existing
> > semantics should be strictly retained in the new code.  Any new
> > functionality should be added in the form of entirely new kernel
> > routines and system calls, or possibly by such means as
> > converting the existing routines to the vararg format &etc.
> >
> > Does anyone know when SGI will release XFS?
> >
> >
> 
> --
> Russell Cattelan
> cattelan@thebarn.com
> 
> 
> 
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-hackers" in the body of the message



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 19  8:56: 6 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from gatewaya.anheuser-busch.com (gatewaya.anheuser-busch.com [151.145.250.252])
	by hub.freebsd.org (Postfix) with SMTP
	id 0B16914C37; Thu, 19 Aug 1999 08:56:00 -0700 (PDT)
	(envelope-from Matthew.Alton@anheuser-busch.com)
Received: by gatewaya.anheuser-busch.com; id KAA04634; Thu, 19 Aug 1999 10:57:39 -0500
Received: from stlexggtw002-pozzoli.fw-users.busch.com(151.145.101.130) by gatewaya.anheuser-busch.com via smap (V5.0)
	id xma004561; Thu, 19 Aug 99 10:57:18 -0500
Received: from stlabcexg004.anheuser-busch.com ([151.145.101.160]) by 151.145.101.130
  (Norton AntiVirus for Internet Email Gateways 1.0) ;
  Thu, 19 Aug 1999 15:55:03 0000 (GMT)
Received: by stlabcexg004.anheuser-busch.com with Internet Mail Service (5.5.2448.0)
	id <RGNYBWLK>; Thu, 19 Aug 1999 10:54:54 -0500
Message-ID: <0740CBD1D149D31193EB0008C7C56836EB8B15@STLABCEXG012>
From: "Alton, Matthew" <Matthew.Alton@anheuser-busch.com>
To: "'Russell Cattelan'" <cattelan@thebarn.com>
Cc: "'Hackers@FreeBSD.ORG'" <Hackers@FreeBSD.ORG>,
	"'fs@FreeBSD.ORG'" <fs@FreeBSD.ORG>
Subject: RE: BSD-XFS Update
Date: Thu, 19 Aug 1999 10:55:11 -0500
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2448.0)
Content-Type: text/plain
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Pinned in the AIX-style "pinned memory" sense?  Succinctly, AIX
allows userland programs to tag memory pages so as to guarantee that
they will not be swapped to backing store.  Portions of the _KERNEL_
are paged out instead if necessary.

I assume that the pinning is of the AIX sort and that it is desirable, if
not necessary, for the realtime throughput guarantee policy.  Nes pas?

> -----Original Message-----
> From:	Russell Cattelan [SMTP:cattelan@thebarn.com]
> Sent:	Thursday, August 19, 1999 1:01 AM
> To:	Alton, Matthew
> Cc:	'Hackers@FreeBSD.ORG'; 'fs@FreeBSD.ORG'
> Subject:	Re: BSD-XFS Update
> 
> "Alton, Matthew" wrote:
> 
> > SGI has released a portion of the XFS source code under the GPL:
> >
> > http://oss.sgi.com/projects/xfs/download/
> >
> > the source file is xfs_log.tar.gz.
> >
> > Of greater interest at this stage are the documents in:
> >
> > http://oss.sgi.com/projects/xfs/design_docs/
> >
> > I am currently researching methods for implementing the 64-bit
> > syscalls stat64(), fstat64(), lseek64() &etc.  delineated in the
> > SGI design doc _64 Bit File Access_  by Adam Sweeney.
> 
> The xxxx64 calls are no longer an issue as of IRIX 6.(something 2 I
> think) all
> the standard calls were converted to use 64 bit types directly.
> 
> Have a better one for you to research.
> Find out if buffers can be pined? if not what is it going to take to fix
> that.
> 
> >
> > The BSD-XFS port will be made available as a patch to the RELEASE
> > FreeBSD kernels.
> 
> Given the size of XFS it might be easier to make FreeBSD a patch to XFS.
> <- major humor here.
> :-) :-)
> 
> >
> >
> > Matthew Alton
> > Computer Services - UNIX Systems Administration
> > (314)632-6644   matthew.alton@anheuser-busch.com
> >                 alton@plantnet.com
> >
> > To Unsubscribe: send mail to majordomo@FreeBSD.org
> > with "unsubscribe freebsd-hackers" in the body of the message
> 
> --
> Russell Cattelan
> cattelan@thebarn.com
> 
> 
> 
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-fs" in the body of the message



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 19 11: 3:35 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134])
	by hub.freebsd.org (Postfix) with ESMTP
	id 4D249159DE; Thu, 19 Aug 1999 11:03:25 -0700 (PDT)
	(envelope-from tlambert@usr06.primenet.com)
Received: (from daemon@localhost)
	by smtp04.primenet.com (8.9.3/8.9.3) id LAA13024;
	Thu, 19 Aug 1999 11:02:27 -0700 (MST)
Received: from usr06.primenet.com(206.165.6.206)
 via SMTP by smtp04.primenet.com, id smtpdAAAH8aWyz; Thu Aug 19 11:02:23 1999
Received: (from tlambert@localhost)
	by usr06.primenet.com (8.8.5/8.8.5) id LAA25563;
	Thu, 19 Aug 1999 11:02:34 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908191802.LAA25563@usr06.primenet.com>
Subject: Re: BSD XFS Port & BSD VFS Rewrite
To: dcs@newsguy.com (Daniel C. Sobral)
Date: Thu, 19 Aug 1999 18:02:34 +0000 (GMT)
Cc: tlambert@primenet.com, phk@critter.freebsd.dk,
	michaelh@cet.co.jp, wrstuden@nas.nasa.gov,
	Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
In-Reply-To: <37BB88F3.7184305@newsguy.com> from "Daniel C. Sobral" at Aug 19, 99 01:32:51 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> Terry Lambert wrote:
> > 
> > Make sure that the system you are talking to over the proxy is
> > not assumed to be a FreeBSD system (e.g. don't assume that the
> > vfs_default stuff exists on the other end of the proxy, or that
> > it would be functional).
> 
> Now, Terry, that is ridiculous. One has to assume that both ends
> play by the same rules. That is not only a reasonably expectation,
> it's minimum requirement for any protocol to work.

That's kind of the point.  No other VFS stacking system out there
plays by FreeBSD's revamped rules.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 19 12:18:14 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from oceana.nlanr.net (oceana.sdsc.edu [132.249.40.200])
	by hub.freebsd.org (Postfix) with ESMTP id DCD94151D0
	for <freebsd-fs@freebsd.org>; Thu, 19 Aug 1999 12:18:00 -0700 (PDT)
	(envelope-from tshansen@oceana.nlanr.net)
Received: from localhost (tshansen@localhost)
	by oceana.nlanr.net (8.8.6/8.8.6) with SMTP id MAA29339;
	Thu, 19 Aug 1999 12:16:28 -0700 (PDT)
Date: Thu, 19 Aug 1999 12:16:28 -0700 (PDT)
From: Todd Hansen <tshansen@oceana.nlanr.net>
To: freebsd-fs@freebsd.org
Cc: Tony McGregor <tonym@oceana.nlanr.net>
Subject: turning of filesystem caching for specific filesystems
Message-ID: <Pine.OSF.3.94.990819121227.23165E-100000@oceana.nlanr.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

I was wondering if there was some hidden method in the kernel
configuration or in sysctl that would allow me to turn off kernel level
filesystem cachine for a specific filesystem? The reason I want to do this
is because I have one very large ccd0c filesystem that is accessed
randomly but at a very high frequency (both read and writes). Anyway, I
also have system disks with the programs and such that are run in order to
process the data on the ccd filesystem. The problem is as I am running
these programs I am noticing that I have a .5 MB/s access to the system
disk even though I am only calling one or two sub-programs. Anyway, I
believe that is because the ccd0c filesystem is being used so much it is
exausting the cache. Thanks in advance for your help.
	-todd



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 19 14:46:43 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from alpo.whistle.com (alpo.whistle.com [207.76.204.38])
	by hub.freebsd.org (Postfix) with ESMTP id 24FF71535C
	for <fs@freebsd.org>; Thu, 19 Aug 1999 14:46:38 -0700 (PDT)
	(envelope-from julian@whistle.com)
Received: from current1.whistle.com (current1.whistle.com [207.76.205.22])
	by alpo.whistle.com (8.9.1a/8.9.1) with SMTP id OAA70240
	for <fs@freebsd.org>; Thu, 19 Aug 1999 14:44:44 -0700 (PDT)
Date: Thu, 19 Aug 1999 14:46:01 -0700 (PDT)
From: Julian Elischer <julian@whistle.com>
To: fs@freebsd.org
Subject: BUG in 3.2 fsck! (fwd)
Message-ID: <Pine.BSF.3.95.990819144407.13522H-100000@current1.whistle.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

FS types..

thoughts?

An ex collegue writes to me:

-------------------------------------------
I have created and am testing an fsck version which will make lost+found much
larger (until it fills the first indirect disk page) and which has the ability
to suppress output for errors fixed by preen (which is not exactly what I
proposed before but which achieves the same end with less code and less risk).

It isn't really time yet to discuss merging these features into freeBSD, but I
think that day will come.  What this mail is really about is a bug in fsck.  I
am not currently competent to submit the fix or to even be positive that the
bug is current.  I think this is a potentially serious bug in the current
sources.  Are you interested?

BUG IN FSCK:

When the 3.2 version of fsck has to create the lost+found directory, it may
fail to flag the appropriate inode busy!

Patch: mkdir lost+found in the root directory of all your file systems.

Discussion: fsck allocates only enough space to keep track of the first inodes
in each cylinder group.  This is clever and good - inode usage tends to occur
at the front of the cylinder group and this saves space.  Unfortunately, it
does not work out well when a directory is created which increases the highest
inode number for the cylinder group - the inode usage doesn't get recorded in
the right place and the inode will be flagged available during pass 5.

Fix: The code change causes fsck to check the cylinder group allocation when
adding an inode and expand the inode list for the cylinder group if necessary.

In inode.c::allocino (near line 605):
    for (ino = request; ino < maxino; ino++)
        if (inoinfo(ino)->ino_state == USTATE)
            break;
    if (ino == maxino)
        return (0);
    inoallocinfo (ino);   **** one new line of code.

In fsck.h, add the prototype for the new function inoallocinfo.
In utility.c (near line 138), replace the function inoinfo with the following:

static struct inostat unallocated = { USTATE, 0, 0 };
/*
 * Look up state information for an inode.
 */
struct inostat *
inoinfo(inum)
    ino_t inum;
{
    struct inostatlist *ilp;
    int iloff;

    if (inum > maxino)
        errx(EEXIT, "inoinfo: inumber %d out of range", inum);
    ilp = &inostathead[inum / sblock.fs_ipg];
    iloff = inum % sblock.fs_ipg;
    if (iloff >= ilp->il_numalloced)
        return (&unallocated);
    return (&ilp->il_stat[iloff]);
}

/*
 * Make it safe to allocate this inode!
 */
void
inoallocinfo (inum)
    ino_t inum;
{
    struct inostat *info;
    struct inostatlist *ilp;
    unsigned i, iloff;

    if (inum > maxino)
        errx(EEXIT, "inoinfo: inumber %d out of range", inum);
    ilp = &inostathead[inum / sblock.fs_ipg];
    iloff = inum % sblock.fs_ipg;
    if (iloff >= (unsigned)ilp->il_numalloced) {
        info = calloc (iloff +  1, sizeof *info);
        if (info == NULL)
            errx(EEXIT, "cannot alloc %u bytes for inoinfo\n",
                (unsigned)(sizeof *info * (iloff + 1)));
        memmove (info, ilp->il_stat, ilp->il_numalloced * sizeof *info);
        free(ilp->il_stat);
        ilp->il_stat = info;
        for (i = ilp->il_numalloced; i <= iloff;  ++i)
            memmove (info + i, &unallocated, sizeof unallocated);
        ilp->il_numalloced = iloff + 1;
    }
}



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 19 17: 6:30 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from cygnus.rush.net (cygnus.rush.net [209.45.245.133])
	by hub.freebsd.org (Postfix) with ESMTP id 4B7411520D
	for <fs@FreeBSD.ORG>; Thu, 19 Aug 1999 17:06:24 -0700 (PDT)
	(envelope-from bright@rush.net)
Received: from localhost (bright@localhost)
	by cygnus.rush.net (8.9.3/8.9.3) with SMTP id UAA02736;
	Thu, 19 Aug 1999 20:13:25 -0400 (EDT)
Date: Thu, 19 Aug 1999 20:13:24 -0400 (EDT)
From: Alfred Perlstein <bright@rush.net>
To: "Alton, Matthew" <Matthew.Alton@anheuser-busch.com>
Cc: "'Russell Cattelan'" <cattelan@thebarn.com>,
	"'fs@FreeBSD.ORG'" <fs@FreeBSD.ORG>
Subject: RE: BSD-XFS Update
In-Reply-To: <0740CBD1D149D31193EB0008C7C56836EB8B15@STLABCEXG012>
Message-ID: <Pine.BSF.3.96.990819201232.20420P-100000@cygnus.rush.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Thu, 19 Aug 1999, Alton, Matthew wrote:

> Pinned in the AIX-style "pinned memory" sense?  Succinctly, AIX
> allows userland programs to tag memory pages so as to guarantee that
> they will not be swapped to backing store.  Portions of the _KERNEL_
> are paged out instead if necessary.
> 
> I assume that the pinning is of the AIX sort and that it is desirable, if
> not necessary, for the realtime throughput guarantee policy.  Nes pas?

man mlock

-Alfred



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Aug 19 20:11:20 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from chuq.com (w130.z209220044.sjc-ca.dsl.cnc.net [209.220.44.130])
	by hub.freebsd.org (Postfix) with ESMTP
	id 0BCC614DA0; Thu, 19 Aug 1999 20:11:15 -0700 (PDT)
	(envelope-from chuq@chuq.com)
Received: (from chs@localhost)
	by chuq.com (8.8.8/8.8.8) id UAA02199;
	Thu, 19 Aug 1999 20:10:58 -0700 (PDT)
Date: Thu, 19 Aug 1999 20:10:57 -0700
From: Chuck Silvers <chuq@chuq.com>
To: Terry Lambert <tlambert@primenet.com>
Cc: wrstuden@nas.nasa.gov, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
Message-ID: <19990819201057.A2185@chuq.chuq.com>
References: <Pine.SOL.3.96.990818112953.14430G-100000@marcy.nas.nasa.gov> <199908182043.NAA28863@usr06.primenet.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.95.4us
In-Reply-To: <199908182043.NAA28863@usr06.primenet.com>; from Terry Lambert on Wed, Aug 18, 1999 at 08:43:14PM +0000
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Wed, Aug 18, 1999 at 08:43:14PM +0000, Terry Lambert wrote:
> > > > Nope. The problem is that while stacking (null, umap, and overlay fs's)
> > > > work, we don't have the coherency issues worked out so that upper layers
> > > > can cache data. i.e. so that the lower fs knows it has to ask the uper
> > > > layers to give pages back. :-) But multiple ls -lR's work fine. :-)
> > > 
> > > With UVM in NetBSD, this is (supposedly) not an issue.
> > 
> > UBC. UVM is a new memory manager. UBC unifies the buffer cache with the VM
> > system.
> 
> I was under the impression that th "U" in "UVM" was for "Unified".
> 
> Does NetBSD not have a unified VM and buffer cache?  is th "U" in
> "UVM" referring not to buffer cache unification, but to platform
> unification?
> 
> It was my understanding from John Dyson, who had to work on NetBSD
> for NCI, that the new NetBSD stuff actually unified the VM and the
> buffer cache.
> 
> If this isn't the case, then, yes, you will need to lock all the way
> up and down, and eat the copy overhead for the concurrency for the
> intermediate vnodes.  8-(.

netbsd w/UVM currently doesn't have unified caches.  that feature is
what I named UBC, for "unified buffer cache" (ala DEC's UBC).
the U in UVM doesn't actually stand for anything.  :-)


> > > You could actually think of it this way, as well: only FS's that
> > > contain vnodes that provide backing should implement VOP_GETPAGES
> > > and VOP_PUTPAGES, and all I/O should be done through paging.
> > 
> > Right. That's part of UBC. :-)
> 
> Yep.  Again, if NetBSD doesn't have this, it's really important
> that it obtain it.  8-(.

I'm workin' on it... it'll go in soon after the branch for the next release
is created (ie. it won't be in the next release, but the one after that).

-Chuck


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Fri Aug 20 11:16:39 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from alpo.whistle.com (alpo.whistle.com [207.76.204.38])
	by hub.freebsd.org (Postfix) with ESMTP id 6509B15370
	for <fs@freebsd.org>; Fri, 20 Aug 1999 11:16:35 -0700 (PDT)
	(envelope-from julian@whistle.com)
Received: from current1.whistle.com (current1.whistle.com [207.76.205.22])
	by alpo.whistle.com (8.9.1a/8.9.1) with SMTP id LAA03603
	for <fs@freebsd.org>; Fri, 20 Aug 1999 11:11:49 -0700 (PDT)
Date: Fri, 20 Aug 1999 11:13:13 -0700 (PDT)
From: Julian Elischer <julian@whistle.com>
To: fs@freebsd.org
Subject: Re: BUG in 3.2 fsck! (fwd)
Message-ID: <Pine.BSF.3.95.990820111248.1212A-100000@current1.whistle.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

further discussion...

---------- Forwarded message ----------
Date: Fri, 20 Aug 1999 02:57:25 -0700 (PDT)
From: milt <milt@vicor-nb.com>
To: julian@whistle.com, milt@vicor-nb.com
Cc: cayford@vicor-nb.com, conor@vicor-nb.com, davep@vicor-nb.com,
    daver@vicor-nb.com, jrh@vicor-nb.com
Subject: Re: BUG in 3.2 fsck!

HIYA

Well, soft updates sure sounds interesting.  One of our current problems is
that our damn raids don't preserve the disk write order requested by unix. 
I'm fighting that - maybe soft updates will give me enough ammunition to win
that fight.  If not, soft updates won't help us!

I haven't read the soft updates paper yet - I will, but not tonight.  One
question, I am not sure how soft updates are intended to inter-act with fsck.
Is it your intention that fsck behave differently only in preen mode? If so,
you have mis-spelled at least one if statement (the LINK COUNT INCREASING
test in dir.c::adjust).

I am writing this now because I want to let you see my current status for fsck
fixes before you install my previous patch.  I will repeat all of this in
later mail (with test instructions yet) once I figure out what the heck I want
to do.  What I have right now is:

BUG 1: When lost+found is allocated on a new highest block number for a
       cylinder group it ends up without an inoinfo entry and will be flagged
       available during pass 5.

       This is the one in my previous mail.

BUG 2: When an orphan directory happens to start with the a parent pointer to
       an inode which will become a newly allocated lost+found, the loop in
       pass 2 will skip the i_dotdot update because it points to a USTATE
       inode, but pass 3 will unwind the update which wasn't done because it
       unwinds i_dotdot for everything it connects!  (The inode isn't USTATE
       anymore because it's now lost+found's inode.)

BUG 3: It has become virtually impossible to learn things from redirected
       output.  Some lines go partially to stderr and partially to stdout with
       disastrous results even when both stdout and stderr are redirected!

       What I am running right now is an fsck that does not mention stderr.
       (2.2.8's fsck mentions stderr only for fatal setup problems - that
       works too, but it requires less thought to just eliminated stderr.)

BUG 4: When Milt's new code puts over 32768 files in lost+found is is
       committing a grave error (di_nlinks is a signed, 16 bit quantity).
       Milt better get his act together before he publishes this.

       NOTE: there is no problem with allocation or extra passes here.  fsck
       has long been allocating disk pages as it extends lost+found.

NEW FEATURE: a q switch which suppresses output for and questions about things
      that would be(/are) fixed in preen mode.  When q is in effect, preen
      mode fixes just happen - no notification to the operator and no
      questions.

      This allows us to get a screen which shows only the interesting errors.
      The preen mode problems get fixed quietly and only the serious stuff
      ends up in the operators face or on the redirected output file!

      My original intention was to have preen mode keep running after some
      errors, but I now understand why you thought that would be hard.  This
      new switch achieves my goal of seeing only the real problems and is lots
      easier to implement.

DISCUSSION:

As you can deduce from my discovery of bug 4, I really am having lots of fun
testing all this junk.

Current solution to bug 2 is to update the i_dotdot count even in USTATE
inodes during pass2.  That causes lost+found to come out right but pre-cludes
adding inodes in mid stream, invalidating my previous patch for bug 1.

Currently, I am pre-allocating one extra inoinfo slot per cylinder group
(which prevents bug 1) and updating USTATE counts (which fixes bug 2).

I realized that bug 4 was out there only a few minutes ago.  Two solutions
occur to me:

a. Switch to a new directory under a different name when lost+found has 32760
   entries.

b. Bag it and claim lost+found is full when it has 32760 files in it.

With 5 to 8 million files/directories in a file system, 32760 isn't very many
so I am not enthused about b.  On the other hand, I can't think of a fix for
bugs 1/2 which is compatable with solution a.  So, I think I'll go to bed!

Hmmm, pondering and rereading this an interesting possibility occurs to me. 
On a bad hardware day, it would help if we put each fsck run in a different
lost+found directory (lost+found.01, lost+found.02, etc.).  fsck would ALWAYS
allocate a new lost+found and if you had multi crashes on one day it would be
easier to tell which lost+found files should be recovered to where.  (We
really do have tools to recover these beasties and are working on improving
them.)  Which makes the unimplementable solution a more interesting!




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Fri Aug 20 21:59:24 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from peach.ocn.ne.jp (peach.ocn.ne.jp [210.145.254.87])
	by hub.freebsd.org (Postfix) with ESMTP
	id 314E514A09; Fri, 20 Aug 1999 21:59:21 -0700 (PDT)
	(envelope-from dcs@newsguy.com)
Received: from newsguy.com by peach.ocn.ne.jp (8.9.1a/OCN) id NAA08836; Sat, 21 Aug 1999 13:57:19 +0900 (JST)
Message-ID: <37BE317E.4B1D7791@newsguy.com>
Date: Sat, 21 Aug 1999 13:56:30 +0900
From: "Daniel C. Sobral" <dcs@newsguy.com>
X-Mailer: Mozilla 4.6 [en] (Win98; I)
X-Accept-Language: en,pt-BR,ja
MIME-Version: 1.0
To: Terry Lambert <tlambert@primenet.com>
Cc: phk@critter.freebsd.dk, michaelh@cet.co.jp,
	wrstuden@nas.nasa.gov, Matthew.Alton@anheuser-busch.com,
	Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
References: <199908191802.LAA25563@usr06.primenet.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Terry Lambert wrote:
> 
> That's kind of the point.  No other VFS stacking system out there
> plays by FreeBSD's revamped rules.

I look around and I see no standards. It is still time to be
experimental.

--
Daniel C. Sobral			(8-DCS)
dcs@newsguy.com
dcs@freebsd.org

	- Can I speak to your superior?
	- There's some religious debate on that question.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Sat Aug 21  2:44: 5 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from peach.ocn.ne.jp (peach.ocn.ne.jp [210.145.254.87])
	by hub.freebsd.org (Postfix) with ESMTP
	id 1BA3E14EBE; Sat, 21 Aug 1999 02:43:56 -0700 (PDT)
	(envelope-from dcs@newsguy.com)
Received: from newsguy.com by peach.ocn.ne.jp (8.9.1a/OCN) id SAA05512; Sat, 21 Aug 1999 18:39:36 +0900 (JST)
Message-ID: <37BE6CE8.D59FF19C@newsguy.com>
Date: Sat, 21 Aug 1999 18:10:00 +0900
From: "Daniel C. Sobral" <dcs@newsguy.com>
X-Mailer: Mozilla 4.6 [en] (Win98; I)
X-Accept-Language: en,pt-BR,ja
MIME-Version: 1.0
To: Terry Lambert <tlambert@primenet.com>, phk@critter.freebsd.dk,
	michaelh@cet.co.jp, wrstuden@nas.nasa.gov,
	Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject: Re: BSD XFS Port & BSD VFS Rewrite
References: <199908191802.LAA25563@usr06.primenet.com> <37BE317E.4B1D7791@newsguy.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

"Daniel C. Sobral" wrote:
> 
> Terry Lambert wrote:
> >
> > That's kind of the point.  No other VFS stacking system out there
> > plays by FreeBSD's revamped rules.
> 
> I look around and I see no standards. It is still time to be
> experimental.

Since someone complained of my meekness, let me restate that... :-)

1) BS. That was not your point. Your point, in which you spent many
paragraphs, was that the present way FreeBSD things does it stuff
cannot support passing a method through an intermediate host/fs that
does not know it.

If your "point" was the above, you could just have said "no one else
does it this way, so we won't be able to have non-FreeBSD
intermediate/frontend/backend hosts". Only that does not prove that
"our" way is not right.

2) There is *no* compatibility in the VFS out there. It's a jungle.
If we implemented something compatible with anyone else, it would be
a first. And given that everything out there have it's problems, it
would be a huge mistake to adopt someone's standard just for the
sake of being compatible.

And if you disagree with point 2, feel free to argue against it. But
in no way it will justify that absurd comment you made.

Either that paragraph was trying to cover a flaw in your logic, or
you just lost your train of thought. It certainly detracted from the
content of the message. "You must assume that the intermediate host
doesn't play by your rules". Bah.

[not that I don't generally agree with you more often than it would
be prudent to let it be publicly known :-) ]

--
Daniel C. Sobral			(8-DCS)
dcs@newsguy.com
dcs@freebsd.org

	- Can I speak to your superior?
	- There's some religious debate on that question.




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Sat Aug 21 18:11:27 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from dt011n65.san.rr.com (dt010nb9.san.rr.com [204.210.12.185])
	by hub.freebsd.org (Postfix) with ESMTP id 4E493154B2
	for <freebsd-fs@FreeBSD.ORG>; Sat, 21 Aug 1999 18:11:09 -0700 (PDT)
	(envelope-from Doug@gorean.org)
Received: from gorean.org (master [10.0.0.2])
	by dt011n65.san.rr.com (8.9.3/8.8.8) with ESMTP id SAA97209;
	Sat, 21 Aug 1999 18:09:22 -0700 (PDT)
	(envelope-from Doug@gorean.org)
Message-ID: <37BF4DCB.1E9B7F82@gorean.org>
Date: Sat, 21 Aug 1999 18:09:31 -0700
From: Doug <Doug@gorean.org>
Organization: Triborough Bridge & Tunnel Authority
X-Mailer: Mozilla 4.61 [en] (X11; U; FreeBSD 4.0-CURRENT-0815 i386)
X-Accept-Language: en
MIME-Version: 1.0
To: alk@pobox.com
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: blocking
References: <14250.853.418320.65158@avalon.east>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Anthony Kimball wrote:
> 
> An NFS blocking behaviour which doesn't seem correct to me:
> 
> 1. background a long /bin/cp to /foo from an NFS-mounted file system.
> 2. ls /foo
> 
> note that (2) hangs until (1) completes.  Is this a bug?

	Someone smarter than me will probably respond to tell me that I'm wrong,
but in my nascent understanding of NFS I'd say no, although I can't quite
explain exactly what I'm thinking about it. The best way I can express it
is to say that while one client is already making a change on a file system
more requests from the same client get queued. I believe that if you were
to do the 'ls' from a different system it would not block. 

	Ok, there's the slow hanging curve, someone else can step up and hit it
out of the park. :)

Doug


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message