Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 28 Mar 2002 17:38:38 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        "Brian F. Feldman" <green@FreeBSD.ORG>
Cc:        Kirk McKusick <mckusick@beastie.mckusick.com>, arch@FreeBSD.ORG
Subject:   Re: vnode::v_op bugfix / PERFORCE change 8574 for review (fwd)
Message-ID:  <3CA3C59E.9791EED9@mindspring.com>
References:  <200203282327.g2SNRog05733@green.bikeshed.org>

next in thread | previous in thread | raw e-mail | index | archive | help
"Brian F. Feldman" wrote:
> Kirk McKusick <mckusick@beastie.mckusick.com> wrote:
> > I concur with your suggestion below that the new patch is a better
> > approach. Your ideal solution below sounds reasonable though I have
> > not thought it through completely.
> 
> I really, really hate the idea that the machine will panic without warning
> if the number of vnode ops to be used becomes greather than the
> statically-defined limit.  Isn't there some truly generic solution?

Yes.

The problem is that the number of VOPs grows, and you need to
find the new VOP, potentially in the old table.

The problem comes from the implied assumption that there are old
and new tables.  If you remove that assumption, all the bogosity
falls out of the code.

This is the same way that the scheduling CPU and process group
affinity crap that Linux puts up with just falls out of the
code, as well, when you go to per CPU run queues (then it's a
decision on migration between run queues, which can be a push
operation, instead of a decision by a scheduler about what needs
to be run next, which is a pull operation requiring locks to be
SMP safe).


The easiest fix is to add a field to the struct vnodeop_desc,
like so:

	/*
	 * This structure describes the vnode operation taking place.
	 */
	struct vnodeop_desc {
		...
		struct vnodeop_desc *vdesc_next;
	};

Then at startup, traverse the vnodeop_descs[] array, like so:

	struct vnodeop_desc	*descp;
	for( descp = vnodeop_descs; descp != <last one>; descp++) {
		descp->vdesc_next = descp + 1;
	}
	descp->vdesc_next = NULL;

Then instead of traversing the list with an index looking at entries
up to some index/list entry marker, follow the next pointer down and
stop when you hit NULL.


When you want to add new VOP descriptors, just link them on the
end, the "hard" way:

    void
    add_new_vop_desc( struct vnodeop_desc *addp)
    {
	struct vnodeop_desc **dpp;
	for( dpp = &vnodeop_descs; *dpp != NULL; dpp = &(*dpp)->vdesc_next))
		continue;

	*dpp = addp;
	addp->vdesc_next = NULL;
    }

...no reallocation required.  No allocation required, either, if
you drain outstanding VOPs of a particular type at the interface
that dispatchd them, and unlink the entry, as part of the unload.
Otherwise, you need to copy an instance so it will still be there
if the module that instanced it falls out from under it due to an
unload or other operation.


THen when you want to dereference a VOP, VOP's which are not
supported by a local media file system are supposed to fall
through to EOPNOTSUPP.

Currently, this doesn't happen because of the default VOPs
supposedly acting as a safety net... what they really are,
is an impediment.

This is because when an unrecognized VOP hits a transition
bounary -- whether it be an underlying VFS in a stack, a
local media FS, or a proxying barrier (for user space VFS
impelmentations, or for VFS proxying over a network, as a
better replacement for NFS) -- it's supposed to get EOPNOTSUPP.
With the vfs_default.c code, what happens instead is that there
has to be an explicit "EOPNOTSUPP" entry in the list of default
entries.  This is incredibly bogus, but it can be fixed
seperately.  For now, just add a lot of additional default
slots to account for EOPNOTSUPP returns for currently undefined
defaults (this is much less than what you are complaining about,
since the overallocation you're talking about has to happen at
each FS instance, and not just in the bottm level proxy boundary
from which all other FS's inherit).


Both the extra dereference added by the first patch... and the
dereference that happened anyway... can be eliminated fairly
trivially:

1)	When you start up the system, for each static FS type,
	sort the VOPs static declarations into the same order
	as the VOPs in the vnodeop_descs[] array.  Do this in
	place in the data segment (a bubble sort works; this is
	not a critical performance path, and it happens once).

2)	Do the same thing for VOP lists when you load modules.

3)	When you instance an FS (at mount time), rather than
	taking a reference to the VOP list -- which requires
	a reverse lookup at instance time, and then a dereference
	at operation time -- instance a full list of VOPs known,
	and do it in the order of the master list.  Insert the
	unknown entries into the instance, rather than relying
	on the extra dereference to find them when they are not
	present.

4)	When you stack a VFS, "collapse" unknown entries to the
	underlying layer's unknown entries.  While this is not
	strictly necessary, it will mean that an N layer nullfs
	ends up being O(1) instead of O(N) on dereferences to do
	the call down (this was suggested by Rosenthal in the
	Sun "Stacking Vnode Architecture" paper).

5)	When you add a VOP... link it onto end of the bas list,
	and link it onto the entry for any VFS' that supports it
	(e.g. the VFS that was loaded that caused the VOP to be
	added in the first place).

...obviously, for this to work, the bottim end has to be treated
as a proxy barrier: not a list of defaults for which entries are
provided.  That means getting rid of the default_vfs.c stuff for
the bottom end implementation, and implementing it to a proxy that
converts all unknown entries at fan-out to calls to a VOP that
returns EOPNOTSUPP.  THis is actually in line with John Heidemann's
original design for the code, as presented in the UCLA FICUS project
papers, and his Master's Thesis (and, incidently, the documentation
for the FreeBSD VFS stacking architecture).

Basically:

o	All VFS's are stacking nodes, and

o	All VFS's inherit from a proxying VFS boundary, rather
	than a default implementation.

o	The bottom end implementation proxy turns all VOPs into
	EOPNOTSUPP returning VOPs (pretend they were proxied
	somehwere, and that somewhere had an implementation that
	returned EOPNOTSUPP).

It's always nice when the code does what its documentation says it
does.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3CA3C59E.9791EED9>