From owner-freebsd-fs  Wed Dec 16 13:18:28 1998
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id NAA13293
          for freebsd-fs-outgoing; Wed, 16 Dec 1998 13:18:28 -0800 (PST)
          (envelope-from owner-freebsd-fs@FreeBSD.ORG)
Received: from pail.scd.ucar.edu (pail.scd.ucar.edu [128.117.28.5])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id NAA13284
          for <FreeBSD-FS@FreeBsd.org>; Wed, 16 Dec 1998 13:18:25 -0800 (PST)
          (envelope-from rousskov@nlanr.net)
Received: from localhost (rousskov@localhost)
	by pail.scd.ucar.edu (8.8.7/8.8.7) with SMTP id OAA29627
	for <FreeBSD-FS@FreeBsd.org>; Wed, 16 Dec 1998 14:17:57 -0700 (MST)
	(envelope-from rousskov@nlanr.net)
X-Authentication-Warning: pail.scd.ucar.edu: rousskov owned process doing -bs
Date: Wed, 16 Dec 1998 14:17:57 -0700 (MST)
From: Alex Rousskov <rousskov@nlanr.net>
X-Sender: rousskov@pail.scd.ucar.edu
Reply-To: Alex Rousskov <rousskov@nlanr.net>
To: FreeBSD-FS@FreeBSD.ORG
Subject: Fast/slow close and open
Message-ID: <Pine.BSF.3.96.981216115141.29208E-100000@pail.scd.ucar.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Hi there,

	I have a strange problem that is probably FS related. Any help is
appreciated. 

Background:
----------

	- A small program ("player") replays a trace from Squid Web proxy
	  running under Web Polygraph benchmark
	- The trace consists of ~ 300K of open(2)/write(2)/close(2) file
	  system calls
	- The calls may interleave
	  (e.g. open(#5), write(#3), write(#5), close(#3), write(#5))
	  but all FS calls are blocking (no threads and such)
	- No artificial delays between the calls are introduced
	- Most files are "small": 11K mean, 7.5K median
	- No other activity on the system

	- FreeBSD 2.2.7-RELEASE
	- 256 RAM; Pentium II (267.27-MHz 686-class CPU)
	- kern.update is set to 43200  (12 hours)
	  (for these tests I do not care about FS consistency)
	- 9GB disk(s) 
	  ["IBM DDRS-39130W S92A" type 0 fixed SCSI 2; Direct-Access 8715MB] 
	  with one partition per disk
	- newfs -o time
	  mount options: rw,noauto
	- each disk has /cache?/??/???/ directories pre-created 
	  (1x16x128 directories); no other data on the disk
	- Squid (and hence player) fill leaf directories with files 
	  one leaf directory at a time 
	  (until there are 128 files in the directory)
	- Disk space utilization starts with 0% and is 20%
	  at the end of the end of each experiment


The problem:
-----------

	The player measures open/write/close delays using gettimeofday() 
calls wrapped around file system calls. I monitor sudden peaks and dives in
close and open calls throughput:

	http://ircache.nlanr.net/Polygraph/tmp/

	During peaks, close(2) response time DEcreases from a mean of 17 msec
to tens of usec(!) and open(2) response time INcreases from 20 to 27 msec or
so. During dives, both response times increase 50-400%. Write(2) response
time is very fast (e.g., 300 usec mean) and is virtually not affected by the
peaks and dives. 

	There are no peaks for 3- and 4- disks experiments. Dives are present
on 1-, 2-, 3-, and 4-disk runs. 

	There are up to three bursts within a ~2-3 hour experiment so it is
hard to say if they occur at "close-to-regular" intervals. Usually the bursts
are 40-80 minutes apart. Each burst lasts 5-10 minutes (9K-18K open/close
calls) so it is not a "random" thing.

	The same behavior was measured on Squid. The player program is very
small and simple and confirms that those oddities are not caused by something
in Squid or in the network.


Question:
--------

	What kind of system activity can suddenly significantly speedup or
slowdown open/close calls? Since it happens once in 40-80 minutes and lasts
for several minutes (with 30-50 open_calls/sec rate) it is probably something
major.. 

I am especially confused by close calls being almost zero-overhead in the
middle of a big experiment. Looks like some write-behind buffers suddenly
appear out of nowhere, get used (speeding up close calls), and then
disappear. Dives are also very disturbing as they hurt overall performance...

Any clues?

Thank you,

Alex.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message

From owner-freebsd-fs  Fri Dec 18 09:53:40 1998
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id JAA05687
          for freebsd-fs-outgoing; Fri, 18 Dec 1998 09:53:40 -0800 (PST)
          (envelope-from owner-freebsd-fs@FreeBSD.ORG)
Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id JAA05682
          for <freebsd-fs@freebsd.org>; Fri, 18 Dec 1998 09:53:38 -0800 (PST)
          (envelope-from ezk@shekel.mcl.cs.columbia.edu)
Received: from shekel.mcl.cs.columbia.edu (shekel.mcl.cs.columbia.edu [128.59.18.15])
	by cs.columbia.edu (8.9.1/8.9.1) with ESMTP id MAA15280
	for <freebsd-fs@freebsd.org>; Fri, 18 Dec 1998 12:53:28 -0500 (EST)
Received: (from ezk@localhost)
	by shekel.mcl.cs.columbia.edu (8.9.1/8.9.1) id MAA05461;
	Fri, 18 Dec 1998 12:53:27 -0500 (EST)
Date: Fri, 18 Dec 1998 12:53:27 -0500 (EST)
Message-Id: <199812181753.MAA05461@shekel.mcl.cs.columbia.edu>
X-Authentication-Warning: shekel.mcl.cs.columbia.edu: ezk set sender to ezk@shekel.mcl.cs.columbia.edu using -f
From: Erez Zadok <ezk@cs.columbia.edu>
To: freebsd-fs@FreeBSD.ORG
Subject: nullfs bugs
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Hello all.  As this message is my first on this list, it unfortunately has
to be long.  My apologies in advance.  Before I go into details, I'll give a
quick overview.

* Brief overview:

My research involves stackable file systems.  I've written several stackable
file systems for a few unix platforms (freebsd, linux, and solaris).  I
fixed nullfs in freebsd 3.0, but the fixes are only workarounds to more
serious bugs.  I'm seeking help from this list in finding the real bugs in
freebsd and solving them correctly, to eventually include in an official
freebsd distribution.


Now on to the details.

* Introduction

My name is Erez Zadok, and I'm a PhD student at Columbia University,
studying Comp. Sci.  You may have heard my name as the maintainer of
am-utils (aka amd.)  I've worked with file systems for 9 years now.  I've
worked with freebsd kernels for 3+ years, but have only recently joined
freebsd-{fs,announce}.

My research involves generating stackable file systems out of a higher level
description language.  One key component is a template file system I call
wrapfs (wrapper file system).  Wrapfs includes hooks users can use to modify
file data, names, and their attributes.  Wrapfs is similar to lofs/nullfs,
but it also copies data/pages/names between the upper and lower layers,
includes hooks for a code generator, and more.

I started writing Wrapfs in Solaris 2.x, based on their lofs.  Then I moved
on to Linux 2.0 using a reference implementation of an lofs someone had
written.  After that I ported wrapfs to freebsd 3.0 using nullfs as a
starting point, and finally ported wrapfs to Linux 2.1.  Once I had wrapfs
for each platform, I wrote actual file systems using it.  I wrote a simple
encryption f/s called rot13fs, and then a stronger one called cryptfs (using
Blowfish.)  I wrote a few of other file systems based on wrapfs, all of
which are described in a few papers I've written and the sources I've
released (see below for URLs).

* nullfs for FreeBSD 3.0

When I started with nullfs on freebsd 3.0 (the May 98 snapshot) I found out
that it was not a complete file system.  Some VFS operations were left
unimplemented, most notably the MMAP ones.  I could mount nullfs, but trying
to do any MMAP operation (such as executing a binary), and the kernel
panics.

So I added the missing functionality to a point where you could do all
operations.  As a test I usually configure and build am-utils inside the new
f/s (those who've built am-utils know it has a rather lengthy configure and
build process, which makes it a good file system exerciser.)

** Bugs in Nullfs

I fixed two major bugs in nullfs:

(1) Asynchronous writes:

The vanilla nullfs has a serious bug where if you write a large file (3MB or
more) through it, several pages of the file are written as zeros to the
lower f/s.  I tried various machines running freebsd 3.0, and different
disks and CPU speeds.  In all cases I got the same data corruption.

The best "fix" I could find was to force the underlying write to happen
synchronously:

	error = VOP_WRITE(lower_vp, &temp_uio, (ioflag | IO_SYNC), cr);

That solved the problem, but obviously it hurts write performance since now
all writes through nullfs have to be done synchronously, even for writing
one byte.

My best guess for the reason for this bug is that there might be a race
condition b/t the file system and the buffer cache or even the MMU, and that
some sort of locking/synchronization is needed to avoid the race.

I'm familiar with the f/s code in freebsd, and have become very familiar
with the vfs/fs code in linux and solaris --- enough to know that this
freebsd bug is likely not the fault of my code.  Alas, there are vast areas
of the rest of the kernel I'm not familiar with.  I want to fix the bug
correctly if possible, and allow nullfs to write asynchronously, but I'm not
sure where to look at.

If anyone has any ideas how to go about finding and fixing the bug, I'll be
happy to work w/ them to fix the problem and eventually submit it for
inclusion in a future freebsd release.


(2) Getpages/Putpages:

The second bug is even stranger.  Initially, I had the implementation of
getpages and putpages call the same VOP on lowervp, with newly allocated
pages.  But then under heavy loads I got obscure problems that seem to come
from deep inside UFS.  It sometimes will return from ffs_getpages() (in
ufs_readwrite.c) with an invalid page, or one that's marked as deadc0de.  I
tried to make sense of that ufs/ffs code, and I think that somewhere either
nullfs or the higher level vfs aren't locking or synchronizing something
they should be.

I "fixed" the problem with getpages, by implementing it using read(), so now
it works reliably, but with a suboptimal data access interface.

Having implemented getpages() using read() forced me to implement
writepages() using write(), b/c otherwise the getpages and putpages didn't
seem to work well together (possibly b/c of interaction b/t [buffer] caches,
MMU, etc.)  But recall that in order to solve bug #1, I made write()
synchronous.  So now all putpages() have become synchronous as well.

Like I said before, these fixes of mine are but workarounds.  Some might
consider them hacks.  But they do make nullfs fully functional at least.  If
anyone has any idea how to fix this MMAP related bug, please let me know.

Frankly, I have a feeling that the two bugs I'm reporting here may be
related, and that fixing bug #1 would be easier and may impact the solution
to bug #2.

* URLs

Here's some info to those who want to read more about the subject.

Stackable f/s software for freebsd, solaris, and linux:

	http://www.cs.columbia.edu/~ezk/research/software/

Papers I've written about some of the f/s in the s/w page:

	http://www.cs.columbia.edu/~ezk/research/wip.html

Thanks,
Erez Zadok.
---
Columbia University Department of Computer Science.
EMail: ezk@cs.columbia.edu           Web: http://www.cs.columbia.edu/~ezk

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message

From owner-freebsd-fs  Fri Dec 18 13:42:28 1998
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id NAA04953
          for freebsd-fs-outgoing; Fri, 18 Dec 1998 13:42:28 -0800 (PST)
          (envelope-from owner-freebsd-fs@FreeBSD.ORG)
Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id NAA04948
          for <freebsd-fs@FreeBSD.ORG>; Fri, 18 Dec 1998 13:42:27 -0800 (PST)
          (envelope-from tlambert@usr09.primenet.com)
Received: (from daemon@localhost)
	by smtp04.primenet.com (8.8.8/8.8.8) id OAA27859;
	Fri, 18 Dec 1998 14:42:14 -0700 (MST)
Received: from usr09.primenet.com(206.165.6.209)
 via SMTP by smtp04.primenet.com, id smtpd027680; Fri Dec 18 14:42:06 1998
Received: (from tlambert@localhost)
	by usr09.primenet.com (8.8.5/8.8.5) id OAA11441;
	Fri, 18 Dec 1998 14:41:55 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199812182141.OAA11441@usr09.primenet.com>
Subject: Re: nullfs bugs
To: ezk@cs.columbia.edu (Erez Zadok)
Date: Fri, 18 Dec 1998 21:41:55 +0000 (GMT)
Cc: freebsd-fs@FreeBSD.ORG
In-Reply-To: <199812181753.MAA05461@shekel.mcl.cs.columbia.edu> from "Erez Zadok" at Dec 18, 98 12:53:27 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> * nullfs for FreeBSD 3.0
> 
> When I started with nullfs on freebsd 3.0 (the May 98 snapshot) I found out
> that it was not a complete file system.  Some VFS operations were left
> unimplemented, most notably the MMAP ones.  I could mount nullfs, but trying
> to do any MMAP operation (such as executing a binary), and the kernel
> panics.


Right.  Here's the scoop.

Right now in FreeBSD, a vnode is treated as a backing object, and a
backing object is a mapping.

This is a consequence of a unified VM and buffer cache.


When you have a vnode stacked on another vnode, you have an aliasing
problem to resolve: which vnode has the correct page information
hung off of it?


> ** Bugs in Nullfs

[ ... in reverse order ... ]

> (2) Getpages/Putpages:
> 
> The second bug is even stranger.  Initially, I had the implementation of
> getpages and putpages call the same VOP on lowervp, with newly allocated
> pages.  But then under heavy loads I got obscure problems that seem to come
> from deep inside UFS.  It sometimes will return from ffs_getpages() (in
> ufs_readwrite.c) with an invalid page, or one that's marked as deadc0de.  I
> tried to make sense of that ufs/ffs code, and I think that somewhere either
> nullfs or the higher level vfs aren't locking or synchronizing something
> they should be.

Right.  This is confusion about the backing object, per the above.


> I "fixed" the problem with getpages, by implementing it using read(), so now
> it works reliably, but with a suboptimal data access interface.
> 
> Having implemented getpages() using read() forced me to implement
> writepages() using write(), b/c otherwise the getpages and putpages didn't
> seem to work well together (possibly b/c of interaction b/t [buffer] caches,
> MMU, etc.)  But recall that in order to solve bug #1, I made write()
> synchronous.  So now all putpages() have become synchronous as well.
> 
> Like I said before, these fixes of mine are but workarounds.  Some might
> consider them hacks.  But they do make nullfs fully functional at least.  If
> anyone has any idea how to fix this MMAP related bug, please let me know.

These fixes will actually only work for a stack that is exactly one
layer deep.  This is because the lower_vp is the object off of which
the pages are actually hung.

If you were to use this on a nullfs on top of a nullfs, then you
would probably see some errors (unless you implemented read in
terms of VOP_GETPAGES).

The reason for this is that your read is creating a copy of the data
that is hung off the lower_vp, and then returning it to a user buffer.

The problem here is that the top layer is going to issue a similar
read to the middle layer, and it's going to fail because there is
no backing object in the middle layer (only in the bottom layer).

This can be brute-forced to work (I believe Tor Egge is the one who
did this at one time?) by instancing a backing object in the intermediate
layers.

The reason this works with the read/write and not with the getpages
and putpages is that you establish a copy instead of an alias.

Using copies like this introduces cache corehency problems similar
to those in a non-unified VM and buffer cache, and given the unification
in FreeBSD, FreeBSD is pretty much totally unprepared to deal with
maintaining coherency at this level, especially if a namespace is
exposed to the user both above and below a stacking layer (e.g.,
with an ACL or cryptographic FS).


The general soloution to this, which has been discussed by John
Heidemann, John Dyson, Michael Hancock, Eivind Ecklund, Kirk McKusick,
and myself at various times in the past is to get rid of the aliases.


The only way to effectively do that is to provide a mechanism for
an upper layer to ask for the vp of the backing object that's
actually backing the vm, instead of the top level object.  The
main one that has been discussed is called VOP_GETFINALVP, or, more
correctly, VOP_GETBACKINGVP.


This can actually be implemented at low cost, since the only layer
that really cares about doing the call is a layer with a VFS interface
on both the top and the bottom.  So it doesn't effect NFS client
code (a VFS provider), the FFS code (a VFS provider, like all local
media file systems), the NFS server code (a VFS consumer), or the
system call layer (another VFS consumer).

So basically, only the stacking layers take this hit, and then only
in the case that they are doing data translation (crypto/compression)
or object proxying.


This is probably the best way to resolve this problem, since it hides
the details of the VM implementation from the stacking layers.  Even
if you were to use a non-unified VM and buffer cache (e.g. SVR4),
you would want to isolate the depedency on VM and buffer cache
interaction so as to reduce the amount of system dependency in the
code.  So this is a win either way.


> (1) Asynchronous writes:
> 
> The vanilla nullfs has a serious bug where if you write a large file (3MB or
> more) through it, several pages of the file are written as zeros to the
> lower f/s.  I tried various machines running freebsd 3.0, and different
> disks and CPU speeds.  In all cases I got the same data corruption.

Yes.  This is an alias problem, where the coherence between the upper
and lower level objects are not being maintained.  This happens because
there is no read-before-write, as there would be with a normal FS block
on FS blocksize boundaries.

To confirm this, verify the size and offset of the corrupted extents
(this should be a pretty trivial exercise).


> The best "fix" I could find was to force the underlying write to happen
> synchronously:
> 
> 	error = VOP_WRITE(lower_vp, &temp_uio, (ioflag | IO_SYNC), cr);
> 
> That solved the problem, but obviously it hurts write performance since now
> all writes through nullfs have to be done synchronously, even for writing
> one byte.


Yeah.  This is an explict synchronization, which happens to ensure
cache coherency between the two backing objects, when there should
only be one backing object.


> My best guess for the reason for this bug is that there might be a race
> condition b/t the file system and the buffer cache or even the MMU, and that
> some sort of locking/synchronization is needed to avoid the race.

Again, the answer is to avoid everything by explicit coherency, and
the way to do it is to eliminate the aliases, and, in this particular
case, the cached copies of partial data.


> I'm familiar with the f/s code in freebsd, and have become very familiar
> with the vfs/fs code in linux and solaris --- enough to know that this
> freebsd bug is likely not the fault of my code.  Alas, there are vast areas
> of the rest of the kernel I'm not familiar with.  I want to fix the bug
> correctly if possible, and allow nullfs to write asynchronously, but I'm not
> sure where to look at.

Well, then you have to know then that the FreeBSD code is a hell of a
lot more flexible and useful, if done right.  8-).

These issues are pretty well understood, but there needs to be an
architectural pass over the code with a view toward stacking.  This
has actually been my own pet hobby horse for at lease a number of
(3) years now.  It's to the point that enough people understand the
issues and the problems that this is becoming a political possibility.


> Frankly, I have a feeling that the two bugs I'm reporting here may be
> related, and that fixing bug #1 would be easier and may impact the solution
> to bug #2.

Actually, #2 would be easiest, and would result in #1 being fixed as
well, by eliminating the potential coherency race that comes from
using the fault handler instead of an explicit copy (read).


I'm going to be intentioanlly incommunicado for a while, as I'm going
on vacation, but I'll probably break down and read my email once
or twice, so if you have something needing immediate clarification,
you can send me email, but I may not respond before the first of the
year.

Other people to contact who appear to be actively interested in
solving these issues are Eivind Ecklund and Michael Hancock, so
they may be good bets as well.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message

From owner-freebsd-fs  Fri Dec 18 14:19:06 1998
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id OAA09242
          for freebsd-fs-outgoing; Fri, 18 Dec 1998 14:19:06 -0800 (PST)
          (envelope-from owner-freebsd-fs@FreeBSD.ORG)
Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id OAA09237
          for <freebsd-fs@FreeBSD.ORG>; Fri, 18 Dec 1998 14:19:03 -0800 (PST)
          (envelope-from ezk@shekel.mcl.cs.columbia.edu)
Received: from shekel.mcl.cs.columbia.edu (shekel.mcl.cs.columbia.edu [128.59.18.15])
	by cs.columbia.edu (8.9.1/8.9.1) with ESMTP id RAA01964;
	Fri, 18 Dec 1998 17:18:52 -0500 (EST)
Received: (from ezk@localhost)
	by shekel.mcl.cs.columbia.edu (8.9.1/8.9.1) id RAA12135;
	Fri, 18 Dec 1998 17:18:52 -0500 (EST)
Date: Fri, 18 Dec 1998 17:18:52 -0500 (EST)
Message-Id: <199812182218.RAA12135@shekel.mcl.cs.columbia.edu>
X-Authentication-Warning: shekel.mcl.cs.columbia.edu: ezk set sender to ezk@shekel.mcl.cs.columbia.edu using -f
From: Erez Zadok <ezk@cs.columbia.edu>
To: Terry Lambert <tlambert@primenet.com>
Cc: ezk@cs.columbia.edu (Erez Zadok), freebsd-fs@FreeBSD.ORG
Subject: Re: nullfs bugs 
In-reply-to: Your message of "Fri, 18 Dec 1998 21:41:55 GMT."
             <199812182141.OAA11441@usr09.primenet.com> 
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Thanks, there's a lot of info in your message that I have to digest.  I'll
probably have to re-read it several times while keeping freebsd sources
close at hand.  When I'm done you'd probably be back from vacation... :-)
I'll try to comment on the rest of the message later.

I agree that we should have a mini-design pass before seriously implementing
anything of the sort.  But I may still take a stab at it, at least to see
how complicated the work is and outline potential trouble spots.

In all of the ports I've done, I tried very hard to avoid changing the rest
of the OS, esp. in a way that would require making changes to other file
systems.  I was able to have a wrapper file system (and a crypto f/s) on
freebsd and solaris w/o changing them, and on linux only had one small
change that didn't affect anything else.

Being new here, let me ask this beginner question.  How receptive are the
freebsd developers to accepting such fixes, given that the changes won't be
trivial.  In particular, is there a chance they'd be incorporated into 3.0
for a near-future release?  I'm asking this b/c I don't wish to have to
maintain a different set of kernel sources for too long.

Erez.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message

From owner-freebsd-fs  Fri Dec 18 20:20:10 1998
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id UAA18885
          for freebsd-fs-outgoing; Fri, 18 Dec 1998 20:20:10 -0800 (PST)
          (envelope-from owner-freebsd-fs@FreeBSD.ORG)
Received: from gatekeeper.tsc.tdk.com (gatekeeper.tsc.tdk.com [207.113.159.21])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id UAA18880
          for <freebsd-fs@FreeBSD.ORG>; Fri, 18 Dec 1998 20:20:07 -0800 (PST)
          (envelope-from gdonl@tsc.tdk.com)
Received: from sunrise.gv.tsc.tdk.com (root@sunrise.gv.tsc.tdk.com [192.168.241.191])
	by gatekeeper.tsc.tdk.com (8.8.8/8.8.8) with ESMTP id UAA25520;
	Fri, 18 Dec 1998 20:19:45 -0800 (PST)
	(envelope-from gdonl@tsc.tdk.com)
Received: from salsa.gv.tsc.tdk.com (salsa.gv.tsc.tdk.com [192.168.241.194])
	by sunrise.gv.tsc.tdk.com (8.8.5/8.8.5) with ESMTP id UAA22374;
	Fri, 18 Dec 1998 20:19:44 -0800 (PST)
Received: (from gdonl@localhost)
	by salsa.gv.tsc.tdk.com (8.8.5/8.8.5) id UAA11408;
	Fri, 18 Dec 1998 20:19:42 -0800 (PST)
From: Don Lewis <Don.Lewis@tsc.tdk.com>
Message-Id: <199812190419.UAA11408@salsa.gv.tsc.tdk.com>
Date: Fri, 18 Dec 1998 20:19:42 -0800
In-Reply-To: Terry Lambert <tlambert@primenet.com>
       "Re: nullfs bugs" (Dec 18,  9:41pm)
X-Mailer: Mail User's Shell (7.2.6 alpha(3) 7/19/95)
To: Terry Lambert <tlambert@primenet.com>, ezk@cs.columbia.edu (Erez Zadok)
Subject: Re: nullfs bugs
Cc: freebsd-fs@FreeBSD.ORG
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Dec 18,  9:41pm, Terry Lambert wrote:
} Subject: Re: nullfs bugs

} Right now in FreeBSD, a vnode is treated as a backing object, and a
} backing object is a mapping.
} 
} This is a consequence of a unified VM and buffer cache.

} > I "fixed" the problem with getpages, by implementing it using read(), so now
} > it works reliably, but with a suboptimal data access interface.
} > 
} > Having implemented getpages() using read() forced me to implement
} > writepages() using write(), b/c otherwise the getpages and putpages didn't
} > seem to work well together (possibly b/c of interaction b/t [buffer] caches,
} > MMU, etc.)  But recall that in order to solve bug #1, I made write()
} > synchronous.  So now all putpages() have become synchronous as well.
} > 
} > Like I said before, these fixes of mine are but workarounds.  Some might
} > consider them hacks.  But they do make nullfs fully functional at least.  If
} > anyone has any idea how to fix this MMAP related bug, please let me know.
} 
} These fixes will actually only work for a stack that is exactly one
} layer deep.  This is because the lower_vp is the object off of which
} the pages are actually hung.
} 
} If you were to use this on a nullfs on top of a nullfs, then you
} would probably see some errors (unless you implemented read in
} terms of VOP_GETPAGES).
} 
} The reason for this is that your read is creating a copy of the data
} that is hung off the lower_vp, and then returning it to a user buffer.

I did something similar when I was hacking nullfs to somewhat work
in a private version of 2.1.x.  It worked to some extent, but had
cache coherence problems.

} The problem here is that the top layer is going to issue a similar
} read to the middle layer, and it's going to fail because there is
} no backing object in the middle layer (only in the bottom layer).
} 
} This can be brute-forced to work (I believe Tor Egge is the one who
} did this at one time?) by instancing a backing object in the intermediate
} layers.

Eivind has some patches that work something like this.


} The general soloution to this, which has been discussed by John
} Heidemann, John Dyson, Michael Hancock, Eivind Ecklund, Kirk McKusick,
} and myself at various times in the past is to get rid of the aliases.
} 
} 
} The only way to effectively do that is to provide a mechanism for
} an upper layer to ask for the vp of the backing object that's
} actually backing the vm, instead of the top level object.  The
} main one that has been discussed is called VOP_GETFINALVP, or, more
} correctly, VOP_GETBACKINGVP.

I implemented one of these a while back (though I don't even recall
which name I used).  The problem I ran into was that there are
a number of references to vp->v_object scattered about.  Eivind's
patches fix those by turning them into a VOP_ (I would have used
a function call that called VOP_GETwhateverVP).

I had some time to read a little more of Heidemann's paper while
I was travelling a few weeks ago, and it appears that Heidemann
took a somewhat different approach in his SunOS implementation.
It looks like he also passes the backing vp into the VOP calls
that need to access the backing object.  See Appendix B of his paper
<ftp://ftp.cs.ucla.edu/tech-report/95-reports/950032.2.ps.Z>.
I haven't had time to look at how this would fit into the FreeBSD
implementation.

} I'm going to be intentioanlly incommunicado for a while, as I'm going
} on vacation, but I'll probably break down and read my email once
} or twice, so if you have something needing immediate clarification,
} you can send me email, but I may not respond before the first of the
} year.
} 
} Other people to contact who appear to be actively interested in
} solving these issues are Eivind Ecklund and Michael Hancock, so
} they may be good bets as well.

You can add my name to the list as well.  I need at least a somewhat
working nullfs for certain applications.  I'll be away from my email
until the 4th, and then it will take me a few days to dig through the
backlog.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message