Date: Thu, 28 Mar 2002 17:38:38 -0800 From: Terry Lambert <tlambert2@mindspring.com> To: "Brian F. Feldman" <green@FreeBSD.ORG> Cc: Kirk McKusick <mckusick@beastie.mckusick.com>, arch@FreeBSD.ORG Subject: Re: vnode::v_op bugfix / PERFORCE change 8574 for review (fwd) Message-ID: <3CA3C59E.9791EED9@mindspring.com> References: <200203282327.g2SNRog05733@green.bikeshed.org>
next in thread | previous in thread | raw e-mail | index | archive | help
"Brian F. Feldman" wrote: > Kirk McKusick <mckusick@beastie.mckusick.com> wrote: > > I concur with your suggestion below that the new patch is a better > > approach. Your ideal solution below sounds reasonable though I have > > not thought it through completely. > > I really, really hate the idea that the machine will panic without warning > if the number of vnode ops to be used becomes greather than the > statically-defined limit. Isn't there some truly generic solution? Yes. The problem is that the number of VOPs grows, and you need to find the new VOP, potentially in the old table. The problem comes from the implied assumption that there are old and new tables. If you remove that assumption, all the bogosity falls out of the code. This is the same way that the scheduling CPU and process group affinity crap that Linux puts up with just falls out of the code, as well, when you go to per CPU run queues (then it's a decision on migration between run queues, which can be a push operation, instead of a decision by a scheduler about what needs to be run next, which is a pull operation requiring locks to be SMP safe). The easiest fix is to add a field to the struct vnodeop_desc, like so: /* * This structure describes the vnode operation taking place. */ struct vnodeop_desc { ... struct vnodeop_desc *vdesc_next; }; Then at startup, traverse the vnodeop_descs[] array, like so: struct vnodeop_desc *descp; for( descp = vnodeop_descs; descp != <last one>; descp++) { descp->vdesc_next = descp + 1; } descp->vdesc_next = NULL; Then instead of traversing the list with an index looking at entries up to some index/list entry marker, follow the next pointer down and stop when you hit NULL. When you want to add new VOP descriptors, just link them on the end, the "hard" way: void add_new_vop_desc( struct vnodeop_desc *addp) { struct vnodeop_desc **dpp; for( dpp = &vnodeop_descs; *dpp != NULL; dpp = &(*dpp)->vdesc_next)) continue; *dpp = addp; addp->vdesc_next = NULL; } ...no reallocation required. No allocation required, either, if you drain outstanding VOPs of a particular type at the interface that dispatchd them, and unlink the entry, as part of the unload. Otherwise, you need to copy an instance so it will still be there if the module that instanced it falls out from under it due to an unload or other operation. THen when you want to dereference a VOP, VOP's which are not supported by a local media file system are supposed to fall through to EOPNOTSUPP. Currently, this doesn't happen because of the default VOPs supposedly acting as a safety net... what they really are, is an impediment. This is because when an unrecognized VOP hits a transition bounary -- whether it be an underlying VFS in a stack, a local media FS, or a proxying barrier (for user space VFS impelmentations, or for VFS proxying over a network, as a better replacement for NFS) -- it's supposed to get EOPNOTSUPP. With the vfs_default.c code, what happens instead is that there has to be an explicit "EOPNOTSUPP" entry in the list of default entries. This is incredibly bogus, but it can be fixed seperately. For now, just add a lot of additional default slots to account for EOPNOTSUPP returns for currently undefined defaults (this is much less than what you are complaining about, since the overallocation you're talking about has to happen at each FS instance, and not just in the bottm level proxy boundary from which all other FS's inherit). Both the extra dereference added by the first patch... and the dereference that happened anyway... can be eliminated fairly trivially: 1) When you start up the system, for each static FS type, sort the VOPs static declarations into the same order as the VOPs in the vnodeop_descs[] array. Do this in place in the data segment (a bubble sort works; this is not a critical performance path, and it happens once). 2) Do the same thing for VOP lists when you load modules. 3) When you instance an FS (at mount time), rather than taking a reference to the VOP list -- which requires a reverse lookup at instance time, and then a dereference at operation time -- instance a full list of VOPs known, and do it in the order of the master list. Insert the unknown entries into the instance, rather than relying on the extra dereference to find them when they are not present. 4) When you stack a VFS, "collapse" unknown entries to the underlying layer's unknown entries. While this is not strictly necessary, it will mean that an N layer nullfs ends up being O(1) instead of O(N) on dereferences to do the call down (this was suggested by Rosenthal in the Sun "Stacking Vnode Architecture" paper). 5) When you add a VOP... link it onto end of the bas list, and link it onto the entry for any VFS' that supports it (e.g. the VFS that was loaded that caused the VOP to be added in the first place). ...obviously, for this to work, the bottim end has to be treated as a proxy barrier: not a list of defaults for which entries are provided. That means getting rid of the default_vfs.c stuff for the bottom end implementation, and implementing it to a proxy that converts all unknown entries at fan-out to calls to a VOP that returns EOPNOTSUPP. THis is actually in line with John Heidemann's original design for the code, as presented in the UCLA FICUS project papers, and his Master's Thesis (and, incidently, the documentation for the FreeBSD VFS stacking architecture). Basically: o All VFS's are stacking nodes, and o All VFS's inherit from a proxying VFS boundary, rather than a default implementation. o The bottom end implementation proxy turns all VOPs into EOPNOTSUPP returning VOPs (pretend they were proxied somehwere, and that somewhere had an implementation that returned EOPNOTSUPP). It's always nice when the code does what its documentation says it does. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3CA3C59E.9791EED9>