Date: Fri, 8 Jan 1999 02:11:30 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: dillon@apollo.backplane.com (Matthew Dillon) Cc: tlambert@primenet.com, dyson@iquest.net, pfgiffun@bachue.usc.unal.edu.co, freebsd-hackers@FreeBSD.ORG Subject: Re: questions/problems with vm_fault() in Stable Message-ID: <199901080211.TAA00869@usr01.primenet.com> In-Reply-To: <199901072306.PAA35328@apollo.backplane.com> from "Matthew Dillon" at Jan 7, 99 03:06:21 pm
next in thread | previous in thread | raw e-mail | index | archive | help
OK, now on the the non-MFS alias issues: > :There's *no* memory waste, if you don't instanceincoherent copies > :of pages in the first place. > > You are ignoring the point. I mmap() a file. I mmap() the block device > underlying the filesystem the file resides on. I start accessing pages. > Poof... no coherency, plus lots of wasted memory. No. You mmap the file. This instances a vnode, and a VM object using the vnode as a swap store. The VM object using the vnode as a swap store makes VOP_GETPAGES and VOP_PUTPAGES calles through the VFS interface in order to satisfy page read and write faults, respectively. For a 386, which does not support write faulting in kernel mode, this is a problem. You have to unmap the page, and handle the translation lookaside in the fault handler (welcome to the 386 complications to the copyin/copyout routines since time immemorial). The underlying VFS makes VM calls, and these go against the real (VM) backing object. It would be a mistake to try and tunnel the VM through the VFS to alias these objects to the same object. For one thing, they may not be on the same machine. I don't really understand how you expect to be able to use a file on an NFS as a swap store for a program image without a seperate local and remote copy of the object. > You are assuming that VFS devices can be collapsed together such > that the inner layers are not independantly accessible, and thus > cannot be independantly accessed without going through the upper > VFS layers. This is an extremely restrictive view which fails > utterly in a number of already existing cases and fails even > worse when you try to extend the model across a network. No, I'm not allowing aliases in the case of adjacent local layers. Consider the case of an ACL VFS stacking layer stacked on top of an NFS client and under a system call layer. If you expose the ACL layer seperate from the layer on which it is mounted, then you can access the file in two places in the directory hierarchy. The underlying NFS has vnodes that pont to NFSnodes. These have VM objects associated with them. The ACL FS stacks on top of this VFS. It *also* has vnodes. These vnodes are used as abstract credential holders, and point to the underlying vnode in the NFS. These vnodes DO *NOT* have VM objects associated with them. When someone reads or writes a page in the buffer cache from a user space program, then the fault results in a VOP_GETPAGES or VOP_PUTPAGES, repsectively. For the NFS, the information is directly accessed (with lease controls -- otherwise known as opportunity locks - for cache coherency) from the underlying vnode's VM objcts. For the ACL FS stacked on the NFS, the VOP_GETPAGES and VOP_PUTPAGES *ARE PASSED DOWN*. The entire purpose of the ACLFS is to impose access semantics on the underlying vnode objects. The way it does this is by a namespace escape of an access control file that it itself accesses by way direct calls to the underlying NFS. A (really) hidden file, in other words. If we stack a QUOTA FS on top of this, and expose it in another location in the directory hierarchy, the same rules apply. It probably even uses a similar namespace escape to hide a quota file at the root of the directory hierarchy for the mounted FS. If we stack a RESOURCE FS that turns file creations into directory creations, and supports VOP_LOOKUP based inherited flagging semantics on top of this, the same rules apply, but it's a little more complex. When a file is created, it actually creates a directory with the file name, with an internal file namesd something like "filedata" (leaving the 4 character upper case namespace for resources in the "resource fork" or the "_xxx" namespace for an underscore prefix for OS/2 extended atrtibute fors for the file. In each case, however, the requests to obtain the page for reading and writing go down to the cnode object off of which it is hung. OK, that's most FS's for which metadata is tunneled; that's pretty easy to understand. And in the last case, it doesn't really make sense to expose the FS hierarchy underlying the RESOURCE FS layer, because it exposes you to non-cache based namespace coherency problems (e.g., how do you handle someone CD'ing into a directory thats a file, and removing the "filedata" file containing the file data fork, without also removing the associated resource fork or extended attribute? ... You can't, and the VM cache coherency protocol you propose won't handle this non-VM coherency problem, either). Part II: The case where cache coherency is a real issue So now we build a cryptographic FS. It uses any CDROM as a one time pad, and does duplicate eliminatation on the CDROM data so that runs of identical data, especially 0's, are not adjacent to allow statistical analysis. At the same time, it XOR's in a repeating password so that pattern data is not differentially analyzable (we could fix this by using peephole techniques to machine-eliminate repeating patterns, and deal with comoon phrase elimination (English speakers CDROMs probably contain English text, etc.), but we're going to be lazy about the implementation. The OTPFS (One Time Pad FS) has vnode objects that stack on top of the underlying vnode objects. Now we have two problems: (1) The coherency issue can not be dealt with via aliases because the data in the decrypted form of a page can not be used. In other words, there is no direct alias between one page and the other. They are *procedurally* related, but not content-identical (we used a OTP to get around the issue of N:M byte relationships where N != M). Nevertheless, we must deal with read and write faults, and update the upper pages on the former and the lower pages on the latter. (2) Because this is sensitive data, it should not be written to persistant storage. That means that the anonymous pages can't be that anonymous. The pages can't be backed by persistant storage, only memory, and only memory that is protected from view by other processes. So you can't use a file or swap as the backing store for any dirty unencrypted pages; instead, you must reencrypt direty pages and store them out. Since you are using a OTP, if you used the same offset on the CDROM to do this, then you would compromise the pad. Therefore, you must store metadata with the offset into the pad, as well. Probably, you want to *not* write dirty data for as long as possible, since each write of a page will eat another 4k of your pad. Neither of these is amenable to the standard vmobject/vmobject alias soloution. The only way to deal with teh fault issue is procedurally. If an access is done at the intermediate layer (say because you don't want to send cleartext over the net between a remote accessor and the machine on which the data is stored), then the "getpages" needs to operate on the underlying object. In otherwords, a cached copy must be invalidated. Luckily, we do not keep a true cached copy. We merely need to check the underlying page against the upper page for timestamp when a getpages occurs on the upper level page. If the lower level page (and pad offset) has changed out from under it, then the page is updated from the lower level page. Again, we see that we must provide procedural access for page contents. The cached copy is dependent on a lower page reference, even if it does not result in a pad translation. We could simplify this considerably by requiring a pad translation for each unencrypted page reference. We don't do this for two reasons: first, because it's overhead; second, because we need to prove to ourselves that we can handle the coherency issue for multiple accessors at multiple levels, without resorting to VM page aliases. Part III: Where do we need aliases? Aliases would be useful if we wanted to tunnel page mappings between an underlying VM object and an upper level VM object using the FS which owns the underlying object as a file store object for potentially non-adjacent physical blocks. In other words, it's a VN device (file as block device) optimization that isn't strictly necessary, and, given the potential non-adjacency of the underlying blocks, probably not a useful one. The mapping maintenance overhead will drive the cost up to the point that the optimization has no value in all but special (contiguous) cases. Aliases would also be useful if one vnode directly represented the blocks of some underlying vnode. But the savings here are minimal; it would be trivial to cache the FINALVP (the vnode pointer that has the underlying VM object association) as a vnode pointer instead of a VM object alias. Dereferencing a vnode to get an alias object, and dereferencing the alias object to get the actual VM object, is no less expensive than dereferencing a vnode to get a vnode, and then dereferencing that vnode to get the actual VM object. Moreover, this type of optimization assumes a great depth of stacked vnodes. While this might occur in some specialized cases, this type of optimization is best left as an option for the VFS implementor. Indeed, one could easily envision an "ALIASFS" layer, whose sole reason for existance was to provide a vnode that cached the underlying vnode that contained the VM object, many layers below itself. A much saner implementation than introducing aliases everywhere in the expectation of a performance win of a double dereference over a stack traversal. Part IV: Conclusion So we don't need aliases in almost any cases. In the general case of an object that aliases an object, a special cacheing layer, or per layer caching can be employed. Those cases are rare. In transformational layers, such as our OTPFS, the aliases are useless because the data is not the same, and, in fact, increased anonymity of VM resources is counterproductive to the purposes of the FS. It damages their ability to do what they are intended to do. In pure semantic layers, and even in semantic layers that tunnel their information, like our ACLFS, our QUOTAFS, and our RESOURCEFS, it's useless because it would be counterproductive to use real vnodes at these layers in the first place. Using real vnodes in these layers would, in fact, add needless translational complexity (as we see in the 1992 code in /sys/miscfs/nullfs/nullfs_vnops.c to support the ugly nullfs_bypass() VOP -- something not necessary on other platforms because of more correct paging architecture). In other words, we don't need aliases, except in cases where we introduce unnecessary abstraction and complexity, for the apparent purpose of requiring aliases. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199901080211.TAA00869>