From owner-freebsd-fs Wed Dec 16 13:18:28 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id NAA13293 for freebsd-fs-outgoing; Wed, 16 Dec 1998 13:18:28 -0800 (PST) (envelope-from owner-freebsd-fs@FreeBSD.ORG) Received: from pail.scd.ucar.edu (pail.scd.ucar.edu [128.117.28.5]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id NAA13284 for ; Wed, 16 Dec 1998 13:18:25 -0800 (PST) (envelope-from rousskov@nlanr.net) Received: from localhost (rousskov@localhost) by pail.scd.ucar.edu (8.8.7/8.8.7) with SMTP id OAA29627 for ; Wed, 16 Dec 1998 14:17:57 -0700 (MST) (envelope-from rousskov@nlanr.net) X-Authentication-Warning: pail.scd.ucar.edu: rousskov owned process doing -bs Date: Wed, 16 Dec 1998 14:17:57 -0700 (MST) From: Alex Rousskov X-Sender: rousskov@pail.scd.ucar.edu Reply-To: Alex Rousskov To: FreeBSD-FS@FreeBSD.ORG Subject: Fast/slow close and open Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hi there, I have a strange problem that is probably FS related. Any help is appreciated. Background: ---------- - A small program ("player") replays a trace from Squid Web proxy running under Web Polygraph benchmark - The trace consists of ~ 300K of open(2)/write(2)/close(2) file system calls - The calls may interleave (e.g. open(#5), write(#3), write(#5), close(#3), write(#5)) but all FS calls are blocking (no threads and such) - No artificial delays between the calls are introduced - Most files are "small": 11K mean, 7.5K median - No other activity on the system - FreeBSD 2.2.7-RELEASE - 256 RAM; Pentium II (267.27-MHz 686-class CPU) - kern.update is set to 43200 (12 hours) (for these tests I do not care about FS consistency) - 9GB disk(s) ["IBM DDRS-39130W S92A" type 0 fixed SCSI 2; Direct-Access 8715MB] with one partition per disk - newfs -o time mount options: rw,noauto - each disk has /cache?/??/???/ directories pre-created (1x16x128 directories); no other data on the disk - Squid (and hence player) fill leaf directories with files one leaf directory at a time (until there are 128 files in the directory) - Disk space utilization starts with 0% and is 20% at the end of the end of each experiment The problem: ----------- The player measures open/write/close delays using gettimeofday() calls wrapped around file system calls. I monitor sudden peaks and dives in close and open calls throughput: http://ircache.nlanr.net/Polygraph/tmp/ During peaks, close(2) response time DEcreases from a mean of 17 msec to tens of usec(!) and open(2) response time INcreases from 20 to 27 msec or so. During dives, both response times increase 50-400%. Write(2) response time is very fast (e.g., 300 usec mean) and is virtually not affected by the peaks and dives. There are no peaks for 3- and 4- disks experiments. Dives are present on 1-, 2-, 3-, and 4-disk runs. There are up to three bursts within a ~2-3 hour experiment so it is hard to say if they occur at "close-to-regular" intervals. Usually the bursts are 40-80 minutes apart. Each burst lasts 5-10 minutes (9K-18K open/close calls) so it is not a "random" thing. The same behavior was measured on Squid. The player program is very small and simple and confirms that those oddities are not caused by something in Squid or in the network. Question: -------- What kind of system activity can suddenly significantly speedup or slowdown open/close calls? Since it happens once in 40-80 minutes and lasts for several minutes (with 30-50 open_calls/sec rate) it is probably something major.. I am especially confused by close calls being almost zero-overhead in the middle of a big experiment. Looks like some write-behind buffers suddenly appear out of nowhere, get used (speeding up close calls), and then disappear. Dives are also very disturbing as they hurt overall performance... Any clues? Thank you, Alex. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Dec 18 09:53:40 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id JAA05687 for freebsd-fs-outgoing; Fri, 18 Dec 1998 09:53:40 -0800 (PST) (envelope-from owner-freebsd-fs@FreeBSD.ORG) Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id JAA05682 for ; Fri, 18 Dec 1998 09:53:38 -0800 (PST) (envelope-from ezk@shekel.mcl.cs.columbia.edu) Received: from shekel.mcl.cs.columbia.edu (shekel.mcl.cs.columbia.edu [128.59.18.15]) by cs.columbia.edu (8.9.1/8.9.1) with ESMTP id MAA15280 for ; Fri, 18 Dec 1998 12:53:28 -0500 (EST) Received: (from ezk@localhost) by shekel.mcl.cs.columbia.edu (8.9.1/8.9.1) id MAA05461; Fri, 18 Dec 1998 12:53:27 -0500 (EST) Date: Fri, 18 Dec 1998 12:53:27 -0500 (EST) Message-Id: <199812181753.MAA05461@shekel.mcl.cs.columbia.edu> X-Authentication-Warning: shekel.mcl.cs.columbia.edu: ezk set sender to ezk@shekel.mcl.cs.columbia.edu using -f From: Erez Zadok To: freebsd-fs@FreeBSD.ORG Subject: nullfs bugs Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hello all. As this message is my first on this list, it unfortunately has to be long. My apologies in advance. Before I go into details, I'll give a quick overview. * Brief overview: My research involves stackable file systems. I've written several stackable file systems for a few unix platforms (freebsd, linux, and solaris). I fixed nullfs in freebsd 3.0, but the fixes are only workarounds to more serious bugs. I'm seeking help from this list in finding the real bugs in freebsd and solving them correctly, to eventually include in an official freebsd distribution. Now on to the details. * Introduction My name is Erez Zadok, and I'm a PhD student at Columbia University, studying Comp. Sci. You may have heard my name as the maintainer of am-utils (aka amd.) I've worked with file systems for 9 years now. I've worked with freebsd kernels for 3+ years, but have only recently joined freebsd-{fs,announce}. My research involves generating stackable file systems out of a higher level description language. One key component is a template file system I call wrapfs (wrapper file system). Wrapfs includes hooks users can use to modify file data, names, and their attributes. Wrapfs is similar to lofs/nullfs, but it also copies data/pages/names between the upper and lower layers, includes hooks for a code generator, and more. I started writing Wrapfs in Solaris 2.x, based on their lofs. Then I moved on to Linux 2.0 using a reference implementation of an lofs someone had written. After that I ported wrapfs to freebsd 3.0 using nullfs as a starting point, and finally ported wrapfs to Linux 2.1. Once I had wrapfs for each platform, I wrote actual file systems using it. I wrote a simple encryption f/s called rot13fs, and then a stronger one called cryptfs (using Blowfish.) I wrote a few of other file systems based on wrapfs, all of which are described in a few papers I've written and the sources I've released (see below for URLs). * nullfs for FreeBSD 3.0 When I started with nullfs on freebsd 3.0 (the May 98 snapshot) I found out that it was not a complete file system. Some VFS operations were left unimplemented, most notably the MMAP ones. I could mount nullfs, but trying to do any MMAP operation (such as executing a binary), and the kernel panics. So I added the missing functionality to a point where you could do all operations. As a test I usually configure and build am-utils inside the new f/s (those who've built am-utils know it has a rather lengthy configure and build process, which makes it a good file system exerciser.) ** Bugs in Nullfs I fixed two major bugs in nullfs: (1) Asynchronous writes: The vanilla nullfs has a serious bug where if you write a large file (3MB or more) through it, several pages of the file are written as zeros to the lower f/s. I tried various machines running freebsd 3.0, and different disks and CPU speeds. In all cases I got the same data corruption. The best "fix" I could find was to force the underlying write to happen synchronously: error = VOP_WRITE(lower_vp, &temp_uio, (ioflag | IO_SYNC), cr); That solved the problem, but obviously it hurts write performance since now all writes through nullfs have to be done synchronously, even for writing one byte. My best guess for the reason for this bug is that there might be a race condition b/t the file system and the buffer cache or even the MMU, and that some sort of locking/synchronization is needed to avoid the race. I'm familiar with the f/s code in freebsd, and have become very familiar with the vfs/fs code in linux and solaris --- enough to know that this freebsd bug is likely not the fault of my code. Alas, there are vast areas of the rest of the kernel I'm not familiar with. I want to fix the bug correctly if possible, and allow nullfs to write asynchronously, but I'm not sure where to look at. If anyone has any ideas how to go about finding and fixing the bug, I'll be happy to work w/ them to fix the problem and eventually submit it for inclusion in a future freebsd release. (2) Getpages/Putpages: The second bug is even stranger. Initially, I had the implementation of getpages and putpages call the same VOP on lowervp, with newly allocated pages. But then under heavy loads I got obscure problems that seem to come from deep inside UFS. It sometimes will return from ffs_getpages() (in ufs_readwrite.c) with an invalid page, or one that's marked as deadc0de. I tried to make sense of that ufs/ffs code, and I think that somewhere either nullfs or the higher level vfs aren't locking or synchronizing something they should be. I "fixed" the problem with getpages, by implementing it using read(), so now it works reliably, but with a suboptimal data access interface. Having implemented getpages() using read() forced me to implement writepages() using write(), b/c otherwise the getpages and putpages didn't seem to work well together (possibly b/c of interaction b/t [buffer] caches, MMU, etc.) But recall that in order to solve bug #1, I made write() synchronous. So now all putpages() have become synchronous as well. Like I said before, these fixes of mine are but workarounds. Some might consider them hacks. But they do make nullfs fully functional at least. If anyone has any idea how to fix this MMAP related bug, please let me know. Frankly, I have a feeling that the two bugs I'm reporting here may be related, and that fixing bug #1 would be easier and may impact the solution to bug #2. * URLs Here's some info to those who want to read more about the subject. Stackable f/s software for freebsd, solaris, and linux: http://www.cs.columbia.edu/~ezk/research/software/ Papers I've written about some of the f/s in the s/w page: http://www.cs.columbia.edu/~ezk/research/wip.html Thanks, Erez Zadok. --- Columbia University Department of Computer Science. EMail: ezk@cs.columbia.edu Web: http://www.cs.columbia.edu/~ezk To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Dec 18 13:42:28 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id NAA04953 for freebsd-fs-outgoing; Fri, 18 Dec 1998 13:42:28 -0800 (PST) (envelope-from owner-freebsd-fs@FreeBSD.ORG) Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id NAA04948 for ; Fri, 18 Dec 1998 13:42:27 -0800 (PST) (envelope-from tlambert@usr09.primenet.com) Received: (from daemon@localhost) by smtp04.primenet.com (8.8.8/8.8.8) id OAA27859; Fri, 18 Dec 1998 14:42:14 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp04.primenet.com, id smtpd027680; Fri Dec 18 14:42:06 1998 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id OAA11441; Fri, 18 Dec 1998 14:41:55 -0700 (MST) From: Terry Lambert Message-Id: <199812182141.OAA11441@usr09.primenet.com> Subject: Re: nullfs bugs To: ezk@cs.columbia.edu (Erez Zadok) Date: Fri, 18 Dec 1998 21:41:55 +0000 (GMT) Cc: freebsd-fs@FreeBSD.ORG In-Reply-To: <199812181753.MAA05461@shekel.mcl.cs.columbia.edu> from "Erez Zadok" at Dec 18, 98 12:53:27 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > * nullfs for FreeBSD 3.0 > > When I started with nullfs on freebsd 3.0 (the May 98 snapshot) I found out > that it was not a complete file system. Some VFS operations were left > unimplemented, most notably the MMAP ones. I could mount nullfs, but trying > to do any MMAP operation (such as executing a binary), and the kernel > panics. Right. Here's the scoop. Right now in FreeBSD, a vnode is treated as a backing object, and a backing object is a mapping. This is a consequence of a unified VM and buffer cache. When you have a vnode stacked on another vnode, you have an aliasing problem to resolve: which vnode has the correct page information hung off of it? > ** Bugs in Nullfs [ ... in reverse order ... ] > (2) Getpages/Putpages: > > The second bug is even stranger. Initially, I had the implementation of > getpages and putpages call the same VOP on lowervp, with newly allocated > pages. But then under heavy loads I got obscure problems that seem to come > from deep inside UFS. It sometimes will return from ffs_getpages() (in > ufs_readwrite.c) with an invalid page, or one that's marked as deadc0de. I > tried to make sense of that ufs/ffs code, and I think that somewhere either > nullfs or the higher level vfs aren't locking or synchronizing something > they should be. Right. This is confusion about the backing object, per the above. > I "fixed" the problem with getpages, by implementing it using read(), so now > it works reliably, but with a suboptimal data access interface. > > Having implemented getpages() using read() forced me to implement > writepages() using write(), b/c otherwise the getpages and putpages didn't > seem to work well together (possibly b/c of interaction b/t [buffer] caches, > MMU, etc.) But recall that in order to solve bug #1, I made write() > synchronous. So now all putpages() have become synchronous as well. > > Like I said before, these fixes of mine are but workarounds. Some might > consider them hacks. But they do make nullfs fully functional at least. If > anyone has any idea how to fix this MMAP related bug, please let me know. These fixes will actually only work for a stack that is exactly one layer deep. This is because the lower_vp is the object off of which the pages are actually hung. If you were to use this on a nullfs on top of a nullfs, then you would probably see some errors (unless you implemented read in terms of VOP_GETPAGES). The reason for this is that your read is creating a copy of the data that is hung off the lower_vp, and then returning it to a user buffer. The problem here is that the top layer is going to issue a similar read to the middle layer, and it's going to fail because there is no backing object in the middle layer (only in the bottom layer). This can be brute-forced to work (I believe Tor Egge is the one who did this at one time?) by instancing a backing object in the intermediate layers. The reason this works with the read/write and not with the getpages and putpages is that you establish a copy instead of an alias. Using copies like this introduces cache corehency problems similar to those in a non-unified VM and buffer cache, and given the unification in FreeBSD, FreeBSD is pretty much totally unprepared to deal with maintaining coherency at this level, especially if a namespace is exposed to the user both above and below a stacking layer (e.g., with an ACL or cryptographic FS). The general soloution to this, which has been discussed by John Heidemann, John Dyson, Michael Hancock, Eivind Ecklund, Kirk McKusick, and myself at various times in the past is to get rid of the aliases. The only way to effectively do that is to provide a mechanism for an upper layer to ask for the vp of the backing object that's actually backing the vm, instead of the top level object. The main one that has been discussed is called VOP_GETFINALVP, or, more correctly, VOP_GETBACKINGVP. This can actually be implemented at low cost, since the only layer that really cares about doing the call is a layer with a VFS interface on both the top and the bottom. So it doesn't effect NFS client code (a VFS provider), the FFS code (a VFS provider, like all local media file systems), the NFS server code (a VFS consumer), or the system call layer (another VFS consumer). So basically, only the stacking layers take this hit, and then only in the case that they are doing data translation (crypto/compression) or object proxying. This is probably the best way to resolve this problem, since it hides the details of the VM implementation from the stacking layers. Even if you were to use a non-unified VM and buffer cache (e.g. SVR4), you would want to isolate the depedency on VM and buffer cache interaction so as to reduce the amount of system dependency in the code. So this is a win either way. > (1) Asynchronous writes: > > The vanilla nullfs has a serious bug where if you write a large file (3MB or > more) through it, several pages of the file are written as zeros to the > lower f/s. I tried various machines running freebsd 3.0, and different > disks and CPU speeds. In all cases I got the same data corruption. Yes. This is an alias problem, where the coherence between the upper and lower level objects are not being maintained. This happens because there is no read-before-write, as there would be with a normal FS block on FS blocksize boundaries. To confirm this, verify the size and offset of the corrupted extents (this should be a pretty trivial exercise). > The best "fix" I could find was to force the underlying write to happen > synchronously: > > error = VOP_WRITE(lower_vp, &temp_uio, (ioflag | IO_SYNC), cr); > > That solved the problem, but obviously it hurts write performance since now > all writes through nullfs have to be done synchronously, even for writing > one byte. Yeah. This is an explict synchronization, which happens to ensure cache coherency between the two backing objects, when there should only be one backing object. > My best guess for the reason for this bug is that there might be a race > condition b/t the file system and the buffer cache or even the MMU, and that > some sort of locking/synchronization is needed to avoid the race. Again, the answer is to avoid everything by explicit coherency, and the way to do it is to eliminate the aliases, and, in this particular case, the cached copies of partial data. > I'm familiar with the f/s code in freebsd, and have become very familiar > with the vfs/fs code in linux and solaris --- enough to know that this > freebsd bug is likely not the fault of my code. Alas, there are vast areas > of the rest of the kernel I'm not familiar with. I want to fix the bug > correctly if possible, and allow nullfs to write asynchronously, but I'm not > sure where to look at. Well, then you have to know then that the FreeBSD code is a hell of a lot more flexible and useful, if done right. 8-). These issues are pretty well understood, but there needs to be an architectural pass over the code with a view toward stacking. This has actually been my own pet hobby horse for at lease a number of (3) years now. It's to the point that enough people understand the issues and the problems that this is becoming a political possibility. > Frankly, I have a feeling that the two bugs I'm reporting here may be > related, and that fixing bug #1 would be easier and may impact the solution > to bug #2. Actually, #2 would be easiest, and would result in #1 being fixed as well, by eliminating the potential coherency race that comes from using the fault handler instead of an explicit copy (read). I'm going to be intentioanlly incommunicado for a while, as I'm going on vacation, but I'll probably break down and read my email once or twice, so if you have something needing immediate clarification, you can send me email, but I may not respond before the first of the year. Other people to contact who appear to be actively interested in solving these issues are Eivind Ecklund and Michael Hancock, so they may be good bets as well. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Dec 18 14:19:06 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id OAA09242 for freebsd-fs-outgoing; Fri, 18 Dec 1998 14:19:06 -0800 (PST) (envelope-from owner-freebsd-fs@FreeBSD.ORG) Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id OAA09237 for ; Fri, 18 Dec 1998 14:19:03 -0800 (PST) (envelope-from ezk@shekel.mcl.cs.columbia.edu) Received: from shekel.mcl.cs.columbia.edu (shekel.mcl.cs.columbia.edu [128.59.18.15]) by cs.columbia.edu (8.9.1/8.9.1) with ESMTP id RAA01964; Fri, 18 Dec 1998 17:18:52 -0500 (EST) Received: (from ezk@localhost) by shekel.mcl.cs.columbia.edu (8.9.1/8.9.1) id RAA12135; Fri, 18 Dec 1998 17:18:52 -0500 (EST) Date: Fri, 18 Dec 1998 17:18:52 -0500 (EST) Message-Id: <199812182218.RAA12135@shekel.mcl.cs.columbia.edu> X-Authentication-Warning: shekel.mcl.cs.columbia.edu: ezk set sender to ezk@shekel.mcl.cs.columbia.edu using -f From: Erez Zadok To: Terry Lambert Cc: ezk@cs.columbia.edu (Erez Zadok), freebsd-fs@FreeBSD.ORG Subject: Re: nullfs bugs In-reply-to: Your message of "Fri, 18 Dec 1998 21:41:55 GMT." <199812182141.OAA11441@usr09.primenet.com> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Thanks, there's a lot of info in your message that I have to digest. I'll probably have to re-read it several times while keeping freebsd sources close at hand. When I'm done you'd probably be back from vacation... :-) I'll try to comment on the rest of the message later. I agree that we should have a mini-design pass before seriously implementing anything of the sort. But I may still take a stab at it, at least to see how complicated the work is and outline potential trouble spots. In all of the ports I've done, I tried very hard to avoid changing the rest of the OS, esp. in a way that would require making changes to other file systems. I was able to have a wrapper file system (and a crypto f/s) on freebsd and solaris w/o changing them, and on linux only had one small change that didn't affect anything else. Being new here, let me ask this beginner question. How receptive are the freebsd developers to accepting such fixes, given that the changes won't be trivial. In particular, is there a chance they'd be incorporated into 3.0 for a near-future release? I'm asking this b/c I don't wish to have to maintain a different set of kernel sources for too long. Erez. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Dec 18 20:20:10 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id UAA18885 for freebsd-fs-outgoing; Fri, 18 Dec 1998 20:20:10 -0800 (PST) (envelope-from owner-freebsd-fs@FreeBSD.ORG) Received: from gatekeeper.tsc.tdk.com (gatekeeper.tsc.tdk.com [207.113.159.21]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id UAA18880 for ; Fri, 18 Dec 1998 20:20:07 -0800 (PST) (envelope-from gdonl@tsc.tdk.com) Received: from sunrise.gv.tsc.tdk.com (root@sunrise.gv.tsc.tdk.com [192.168.241.191]) by gatekeeper.tsc.tdk.com (8.8.8/8.8.8) with ESMTP id UAA25520; Fri, 18 Dec 1998 20:19:45 -0800 (PST) (envelope-from gdonl@tsc.tdk.com) Received: from salsa.gv.tsc.tdk.com (salsa.gv.tsc.tdk.com [192.168.241.194]) by sunrise.gv.tsc.tdk.com (8.8.5/8.8.5) with ESMTP id UAA22374; Fri, 18 Dec 1998 20:19:44 -0800 (PST) Received: (from gdonl@localhost) by salsa.gv.tsc.tdk.com (8.8.5/8.8.5) id UAA11408; Fri, 18 Dec 1998 20:19:42 -0800 (PST) From: Don Lewis Message-Id: <199812190419.UAA11408@salsa.gv.tsc.tdk.com> Date: Fri, 18 Dec 1998 20:19:42 -0800 In-Reply-To: Terry Lambert "Re: nullfs bugs" (Dec 18, 9:41pm) X-Mailer: Mail User's Shell (7.2.6 alpha(3) 7/19/95) To: Terry Lambert , ezk@cs.columbia.edu (Erez Zadok) Subject: Re: nullfs bugs Cc: freebsd-fs@FreeBSD.ORG Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Dec 18, 9:41pm, Terry Lambert wrote: } Subject: Re: nullfs bugs } Right now in FreeBSD, a vnode is treated as a backing object, and a } backing object is a mapping. } } This is a consequence of a unified VM and buffer cache. } > I "fixed" the problem with getpages, by implementing it using read(), so now } > it works reliably, but with a suboptimal data access interface. } > } > Having implemented getpages() using read() forced me to implement } > writepages() using write(), b/c otherwise the getpages and putpages didn't } > seem to work well together (possibly b/c of interaction b/t [buffer] caches, } > MMU, etc.) But recall that in order to solve bug #1, I made write() } > synchronous. So now all putpages() have become synchronous as well. } > } > Like I said before, these fixes of mine are but workarounds. Some might } > consider them hacks. But they do make nullfs fully functional at least. If } > anyone has any idea how to fix this MMAP related bug, please let me know. } } These fixes will actually only work for a stack that is exactly one } layer deep. This is because the lower_vp is the object off of which } the pages are actually hung. } } If you were to use this on a nullfs on top of a nullfs, then you } would probably see some errors (unless you implemented read in } terms of VOP_GETPAGES). } } The reason for this is that your read is creating a copy of the data } that is hung off the lower_vp, and then returning it to a user buffer. I did something similar when I was hacking nullfs to somewhat work in a private version of 2.1.x. It worked to some extent, but had cache coherence problems. } The problem here is that the top layer is going to issue a similar } read to the middle layer, and it's going to fail because there is } no backing object in the middle layer (only in the bottom layer). } } This can be brute-forced to work (I believe Tor Egge is the one who } did this at one time?) by instancing a backing object in the intermediate } layers. Eivind has some patches that work something like this. } The general soloution to this, which has been discussed by John } Heidemann, John Dyson, Michael Hancock, Eivind Ecklund, Kirk McKusick, } and myself at various times in the past is to get rid of the aliases. } } } The only way to effectively do that is to provide a mechanism for } an upper layer to ask for the vp of the backing object that's } actually backing the vm, instead of the top level object. The } main one that has been discussed is called VOP_GETFINALVP, or, more } correctly, VOP_GETBACKINGVP. I implemented one of these a while back (though I don't even recall which name I used). The problem I ran into was that there are a number of references to vp->v_object scattered about. Eivind's patches fix those by turning them into a VOP_ (I would have used a function call that called VOP_GETwhateverVP). I had some time to read a little more of Heidemann's paper while I was travelling a few weeks ago, and it appears that Heidemann took a somewhat different approach in his SunOS implementation. It looks like he also passes the backing vp into the VOP calls that need to access the backing object. See Appendix B of his paper . I haven't had time to look at how this would fit into the FreeBSD implementation. } I'm going to be intentioanlly incommunicado for a while, as I'm going } on vacation, but I'll probably break down and read my email once } or twice, so if you have something needing immediate clarification, } you can send me email, but I may not respond before the first of the } year. } } Other people to contact who appear to be actively interested in } solving these issues are Eivind Ecklund and Michael Hancock, so } they may be good bets as well. You can add my name to the list as well. I need at least a somewhat working nullfs for certain applications. I'll be away from my email until the 4th, and then it will take me a few days to dig through the backlog. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message