Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 26 Oct 1997 01:06:25 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        gurney_j@resnet.uoregon.edu
Cc:        tlambert@primenet.com, michaelh@cet.co.jp, luigi@labinfo.iet.unipi.it, hackers@FreeBSD.ORG
Subject:   Re: zipfs filesystem anyone ?
Message-ID:  <199710260106.SAA15742@usr04.primenet.com>
In-Reply-To: <19971025004544.21378@hydrogen.nike.efn.org> from "John-Mark Gurney" at Oct 25, 97 00:45:44 am

next in thread | previous in thread | raw e-mail | index | archive | help
> > The list terminator is still necessary for the population pass, where
> > the vectors get populated.  The VOP_REPLACEME needs to be seperate.
> 
> I can see on the vnodeopv_desc of the vfs layer, but why does the NULL
> entry on the array of the vnodeop_desc array in vnode_if.c need the
> last NULL, that is what vfs_opv_numops is for... (which the population
> code uses)...

OK.  This code is not obvious.  I will have to refer to the original
kern/vfs_init.c code for this.  In vfs_opv_init(), there is a MALLOC()
where the dynamic vectors are allocated.

The purpose of this is that the per FS descriptor list for the VOP's
defined for a given VFS layer is sparse.

If you look further down in vfs_opv_init(), you will see:

        /*
         * Finally, go back and replace unfilled routines
         * with their default.  (Sigh, an O(n^3) algorithm.  I
         * could make it better, but that'd be work, and n is small.)
         */

So basically, it takes a list like:

	{ A, D, F, J, NULL }

And changes it to:

	{ A, b, c, D, e, F, g, h, i, J, k, l }

The point being that it knows the end of the per FS descriptor list by
the NULL.

This actually bears *directly* on my sorting recommendation: if the
descriptor list is sparse, you will not have a baseline for a sort of
a new descriptor entry that isn't also in the globally declared
descriptor array, because only the descriptors defined in vnode_if.c
will have known offsets.


> so what I'm saying is we make two vars, maxvfs_opv_numops in which we
> allocate all the memory for that many ops... then as a new op is added
> we simply increase vfs_opv_numops to keep track of were we add the next
> op.. when it's equal to (or greater) then we simply disallow adding any
> new ops...

Yes, this is just housekeeping, and it doesn't matter how it is
accomplished so long as it doesn't preclude some future use we haven't
thought of yet.  It's a "don't care" state.


> > > personally, a possible define that declares I want a possible X extra
> > > vop vectors...  and then have a couple variables that tell you the
> > > maximum number, and the current last offset...  this shouldn't be hard
> > > to do...
> > 
> > They kind of need to be static, unless the vector list becomes a linker
> > set.  This can't happen until the linker set can be agregated between
> 
> I was talking about the size of the list being variable, but that ther
> be an options like:
> options	"EXTRA_VNODEOPS=5"
> 
> then we simply do:
> int maxvfs_opv_numops = (sizeof(vfs_op_descs)/sizeof(struct vnodeop_desc *)) +
> 	EXTRA_VNODEOPS;
> 
> and then the define is like:
> struct vnodeop_desc *vfs_op_descs[num] = {
> };
> 
> and we simply modify the awk script to properly generate num... which
> isn't hard...

See above.  The descriptors have to be unique at the time vnode_if.c is
compiled; different indices pointing to the same address won't cut it.
Probably you will need to have a special tag value, or you will need
to have a duplicate array containing only the replacement descriptor
entries portion of the vfs_op_descs table.  Or you will need to save
them one behind when you add the descriptor, and remove them one behind.

I would caution *againt* saving them one behind, though.  I could easily
see a new VOP being used by two loadable FS's.  To make the thing work,
you would need to keep a reference count -- and that implies some place
to keep it.  I would suggest in a structure with the VOP descriptors
you replace.

Really, the output (populated) per FS structure with all entries
filled in for the defaults should be possible to dereference
directly -- *IFF* it's sorted.  This gets rid of the function stubs
for the  VOP_* calls, and saves a lot of call overhead.  Think about
it, anyway...


> arg, I see what you mean..  ffs depends upon some of ufs's functions..
> (arg, I was assuming that each module was independant)... 

Yes.  This kind of screws up the BSDI plans for Soft Updates, since
it mean the soft updates capable UFS will not be able to be shared
with MFS, LFS, etc..


> right now the only way I can think of preventhing this from being a
> problem is to add the MOUNT_xxx to the declaration of the vnodeops,
> and then including in the ffs a dependancy of ufs...

No.  If the entries are sorted, and you have a reference, then an init
routine can reference the previously defined loaded module (the symbol
table is cumulative).  This at least gets you an address to pass into
the modified vfs_opv_init() function so you can convert the UFS
descriptor list to a set of function pointers.

The trick is to know beforehand that the populated descriptor lists
are in the same order in the UFS and the FFS, and just use the offset
so you don't have to reference every function by symbol.  Hence the
sort.  It can be a simple bubble sort (that's what I used), since you
can sort it in place in the populated vector (dynamic vector is a bad
name for this in the code, IMO).



> > Sorting the list also allows you to reference by index instead of by
> > descriptor.  If you think about it, you won't have a descriptor for
> > a new VOP, if you dynamically load it, in vnode_if.h.  This loses the
> > little glue function.  To fix this, you have to kill the glue functions.
> > And you can do it if you can indext into the op vector and get the right
> > function.
> 
> umm... isn't this already what is done? from vfs_opv_init:
>                 opv_desc_vector[opve_descp->opve_op->vdesc_offset] =
>                                 opve_descp->opve_impl;
> the above line will set the appropriate vector by the vdesc_offset..

No.  This requires that the symbol set for the descriptor set be complete;
if you are adding a new VOP, then it won't be (unless you have ELF linker
set agregation across distinct per kld sections working).  Even then,
the VOP entry may be duplicated, which I don't know how the linker set
could handle correctly.  It really needs a reference count for each FS
that's using the new VOP.

It becomes a can of worms quickly unless you reduce the decriptor list
to a set of indices into an array of VOP functions.  So again, doing
the physical sort is the right way to go.  Note that if you ad a VOP,
you have to go back and repopulate the dynamic entry wheere the new VOP
came in, if there is a default behaviour specified.  8-(.

You can make the sort part of the vfs_opv_init().  Since the vfs_opv_init()
"knows" the descriptor list, this works.


> then at the end, there is a second pass to fill in the remaining entries
> with the entry from VOFFSET(vop_default) of it's own vector..
> 
> hmm... looks like we need to remove this comment:
>         /*
>          * Finally, go back and replace unfilled routines
>          * with their default.  (Sigh, an O(n^3) algorithm.  I
>          * could make it better, but that'd be work, and n is small.)
>          */
> as right now the final routine runs in n as far as I can tell... :)

The O(n^3) is wrong, unlesss you consider the whole thing, and you
consider it in light of how the UFS is referenced by the FFS.  For that
specific case, O(n^3) is correct.  This is because there is a seperate
"dynamic vector" for each reference instance of a "non-dynamic vector".
This is also not very obvious from the code.  You need to look at the
array declarations themselves, and the specfs ops, etc., in context
to see this.

The part of the comment that says "replace unfilled routines" is correct,
but, again, it's descriptor based, which breaks truly dynamic loading
of intermediate VFS's as opposed to complete-to-bottom stacks.

The problem is in telling "NULL" entries from "uninitialized" entries.
You need to know this because you want to collapse the first defined
entry to the top of the stack so that you have the minimum possible
call overhead for any individual stacked VOP.

You can *almost* see how this is supposed to work in the "FFS stacked
on UFS" case.  It's kind of obfuscated by the "struct fileops" crap,
and the fact that the vnode handling is from a global free pool, instead
of the vnode objects being subelements in the per FS object structure
(for UFS, struct inode for the in core inode).  If you ignore that code
and pretend the references are direct instead of indirect, it's a lot
easier to visualize (one of the reasons I think the code should be
changed to make it match the conceptual design).

Right now, there is no real "collapse" taking place.  Mostly because
things with NULL entries are broken and getting more broken the more
people stuff VM code into them.  If you look at the "UFS in FFS VOP
structure" entries, you can see how the vnode pointer is supposed to
be an abstract type in each FS.  The per FS macro for "VTOI" and "ITOV"
are supposed to be the only gates where a vnode is not treated as an
opaque type (and that goes back to the global pool implementation; you
*could* use pointer math to make it *truly* opaque if the vnode were
a subelement).


It's all very complicated, but you can see it if you take enough time.
Unfortunately, not very many people have taken enough time, since it's
a lot more convoluted than it should be (ideally and IMO, anyway).
But to ascend out of the fourth ring of hell, you first have to visit
the fourth ring of hell... otherwise you can't get there from here.  ;-).

Once enough people understand the code so that a code review is possible
and people can get the warm fuzzies about the new code, it should be
possible to fix all of it to make it easier for people to get their
brains around.  Then it picks up speed as it rolls downhill.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199710260106.SAA15742>