From owner-freebsd-hackers Wed Aug 18 13:44:47 1999 Delivered-To: freebsd-hackers@freebsd.org Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135]) by hub.freebsd.org (Postfix) with ESMTP id EA4F215D9F; Wed, 18 Aug 1999 13:43:47 -0700 (PDT) (envelope-from tlambert@usr06.primenet.com) Received: (from daemon@localhost) by smtp05.primenet.com (8.9.1/8.9.1) id NAA113206; Wed, 18 Aug 1999 13:43:27 -0700 Received: from usr06.primenet.com(206.165.6.206) via SMTP by smtp05.primenet.com, id smtpdDReHUa; Wed Aug 18 13:43:17 1999 Received: (from tlambert@localhost) by usr06.primenet.com (8.8.5/8.8.5) id NAA28863; Wed, 18 Aug 1999 13:43:14 -0700 (MST) From: Terry Lambert Message-Id: <199908182043.NAA28863@usr06.primenet.com> Subject: Re: BSD XFS Port & BSD VFS Rewrite To: wrstuden@nas.nasa.gov Date: Wed, 18 Aug 1999 20:43:14 +0000 (GMT) Cc: tlambert@primenet.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG In-Reply-To: from "Bill Studenmund" at Aug 18, 99 11:59:01 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > > > Right. That exported struct lock * makes locking down to the lowest-level > > > file easy - you just feed it to the lock manager, and you're locking the > > > same lock the lowest level fs uses. You then lock all vnodes stacked over > > > this one at the same time. Otherwise, you just call VOP_LOCK below and > > > then lock yourself. > > > > I think this defeats the purpose of the stacking architecture; I > > think that if you look at an unadulterated NULLFS, you'll see what I > > mean. > > Please be more precise. I have looked at an unadulterated NULLFS, and > found it lacking. I don't see how this change breaks stacking. OK, there's the concept of "collapse" of stacking layer. This was first introduced in the Rosenthal stacking vnode architecture, out of Sun Microsystems. Rosenthal was concerned that, when you stack 500 putatively "null" NULLFS's, that the amount of function call overhead not increase proportionally. To resolve this, he introduced the concept of a "collapsed" VFS stack. That is, the actual array of function vectors is actually a one dimensional projection of a two dimensional stack, and that the visible portion is actually where the first layer on the way down the stack that implements a VOP occurs. We can visualize this like so: VOPs Layer | VOP1 VOP2 VOP3 VOP4 VOP5 VOP6 ... ----------------------------------------------------------- L1 - - - imp - - ... L2 imp - - imp - imp ... L3 imp - - imp imp - ... L4 - - imp - - - ... L5 imp imp imp imp imp imp ... The resulting "collapsed" array of entry vectors looks like so: L2VOP1 L5VOP2 L4VOP3 L1VOP4 L3VOP5 L2VOP6 ... There is an implicit assumption here that most stacks will not be randomly staggered like this example. The idea behind this assumption is that additional layers will most frequently add functionality, rather than replacing it. Heidemann carried this idea over into his architecture, to be employed at the point that a VFS stack is first instanced. The BSD4.4 implementation of this is partially flawed. There is an implicit implementation of this for the UFS/FFS "stack" of layers, in the VOP's descriptor array exported by the combination of the two being hard coded as being a precollapsed stack. This is actually antithetical to the design. The second place this flaw is apparent is in the inability to add VOP's into an existing kernel, since the entry point vector is a fixed size, and is not expanded implicitly by the act of adding a VFS layer containing a new VOP. For the use of non-error vfs_defaults, this is also flawed for proxies, but not for the consumer of the VFS stack, only for the producer end on the other side of the proxy, which although it does not implement a particular VOP, needs to _NOT_ use the local vfs_default for the VOP, but instead needs to proxy the VOP over to the other side for remote processing. The act of getting a vfs_default VOP after a collapse, instead of having a NULL entry point that the descriptor call mechanism treats as a call failure, damages the ability to proxy unknown VOP's. > > Intermediate FS's should not trap VOP's that are not applicable > > to them. > > True. But VOP_LOCK is applicable to layered fs's. :-) Only for translation layers that require local backing store. I'm prepared to make an exception for them, and require that they explicitly call the VOP in the underlying vnode over which they are stacked. This is the same compromise that both Rosenthal and Heidemann consciously chose. > > One of the purposes of doing a VOP_LOCK on intermediate vnodes > > that aren't backing objects is to deal with the global vnode > > pool management. I'd really like FS's to own their vnode pools, > > but even without that, you don't need the locking, since you > > only need to flush data on vnodes that are backing objects. > > > > If we look at a stack of FS's with intermediate exposure into the > > namespace, then it's clear that the issue is really only applicable > > to objects that act as a backing store: > > > > > > ---------------------- ---------------------- -------------------- > > FS Exposed in hierarchy Backing object > > ---------------------- ---------------------- -------------------- > > top yes no > > intermediate_1 no no > > intermediate_2 no yes > > intermediate_3 yes no > > bottom no yes > > ---------------------- ---------------------- -------------------- > > > > So when we lock "top", we only lock in intermediate_2 and in bottom. > > No. One of the things Heidemann notes in his dissertation is that to > prevent deadlock, you have to lock the whole stack of vnodes at once, not > bit by bit. > > i.e. there is one lock for the whole thing. This is not true for a unified VM and buffer cache environment, and a significant reduction in overhead can be achieved thereby. Heidemann did his work on SVR4, which does not have a unified VM and buffer cache. The deadlock discussion in his dissertation is only applicable to systems where the coherency model is such that each and every vnode has buffers associated with it. That is, it applies to vnodes which act as backing store (buffer cache object references). If you seperate the concept, such that you don't have to deal with vnodes that do not have coherency issues, then you can drastically reduce the number of coherency operations required (locking is a coherency operation). In addition to this, you can effectively obtain what neither the Rosenthal or the SVR4 version of the Heidemann stacking framework can otherwise obtain: intermediate VFS layer NULL VOP call collapse. The way you obtain this is by caching the vnode of the backing object in the intermediate layer, and dereferencing it to get at it's VOP vector directly. This means that a functional layer that shodows an underlying VOP, seperated by 1,000 NULLFS layers, does not result in a 1,000 function call overhead. > > > Actually isn't the only problem when you have vnode fan-in (union FS)? > > > i.e. a plain compressing layer should not introduce vnode locking > > > problems. > > > > If it's a block compression layer, it will. Also a translation layer; > > consider a pure Unicode system that wants to remotely mount an FS > > from a legacy system. To do this, it needs to expand the pages from > > the legacy system [only it can, since the legacy system doesn't know > > about Unicode] in a 2:1 ratio. Now consider doing a byte-range lock > > on a file on such a system. To propogate the lock, you have to do > > an arithmetic conversion at the translation layer. This gets worse > > if the lower end FS is exposed in the namespace as well. > > Wait. byte-range locking is different from vnode locking. I've been > talking about vnode locking, which is different from the byte-range > locking you're discussing above. Conceptually, they're not really different at all. You want to apply an operation against a stack of vnodes, and only involve the relevent vnodes when you do it. > > > Nope. The problem is that while stacking (null, umap, and overlay fs's) > > > work, we don't have the coherency issues worked out so that upper layers > > > can cache data. i.e. so that the lower fs knows it has to ask the uper > > > layers to give pages back. :-) But multiple ls -lR's work fine. :-) > > > > With UVM in NetBSD, this is (supposedly) not an issue. > > UBC. UVM is a new memory manager. UBC unifies the buffer cache with the VM > system. I was under the impression that th "U" in "UVM" was for "Unified". Does NetBSD not have a unified VM and buffer cache? is th "U" in "UVM" referring not to buffer cache unification, but to platform unification? It was my understanding from John Dyson, who had to work on NetBSD for NCI, that the new NetBSD stuff actually unified the VM and the buffer cache. If this isn't the case, then, yes, you will need to lock all the way up and down, and eat the copy overhead for the concurrency for the intermediate vnodes. 8-(. > > You could actually think of it this way, as well: only FS's that > > contain vnodes that provide backing should implement VOP_GETPAGES > > and VOP_PUTPAGES, and all I/O should be done through paging. > > Right. That's part of UBC. :-) Yep. Again, if NetBSD doesn't have this, it's really important that it obtain it. 8-(. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message