Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 18 Dec 1998 21:41:55 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        ezk@cs.columbia.edu (Erez Zadok)
Cc:        freebsd-fs@FreeBSD.ORG
Subject:   Re: nullfs bugs
Message-ID:  <199812182141.OAA11441@usr09.primenet.com>
In-Reply-To: <199812181753.MAA05461@shekel.mcl.cs.columbia.edu> from "Erez Zadok" at Dec 18, 98 12:53:27 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> * nullfs for FreeBSD 3.0
> 
> When I started with nullfs on freebsd 3.0 (the May 98 snapshot) I found out
> that it was not a complete file system.  Some VFS operations were left
> unimplemented, most notably the MMAP ones.  I could mount nullfs, but trying
> to do any MMAP operation (such as executing a binary), and the kernel
> panics.


Right.  Here's the scoop.

Right now in FreeBSD, a vnode is treated as a backing object, and a
backing object is a mapping.

This is a consequence of a unified VM and buffer cache.


When you have a vnode stacked on another vnode, you have an aliasing
problem to resolve: which vnode has the correct page information
hung off of it?


> ** Bugs in Nullfs

[ ... in reverse order ... ]

> (2) Getpages/Putpages:
> 
> The second bug is even stranger.  Initially, I had the implementation of
> getpages and putpages call the same VOP on lowervp, with newly allocated
> pages.  But then under heavy loads I got obscure problems that seem to come
> from deep inside UFS.  It sometimes will return from ffs_getpages() (in
> ufs_readwrite.c) with an invalid page, or one that's marked as deadc0de.  I
> tried to make sense of that ufs/ffs code, and I think that somewhere either
> nullfs or the higher level vfs aren't locking or synchronizing something
> they should be.

Right.  This is confusion about the backing object, per the above.


> I "fixed" the problem with getpages, by implementing it using read(), so now
> it works reliably, but with a suboptimal data access interface.
> 
> Having implemented getpages() using read() forced me to implement
> writepages() using write(), b/c otherwise the getpages and putpages didn't
> seem to work well together (possibly b/c of interaction b/t [buffer] caches,
> MMU, etc.)  But recall that in order to solve bug #1, I made write()
> synchronous.  So now all putpages() have become synchronous as well.
> 
> Like I said before, these fixes of mine are but workarounds.  Some might
> consider them hacks.  But they do make nullfs fully functional at least.  If
> anyone has any idea how to fix this MMAP related bug, please let me know.

These fixes will actually only work for a stack that is exactly one
layer deep.  This is because the lower_vp is the object off of which
the pages are actually hung.

If you were to use this on a nullfs on top of a nullfs, then you
would probably see some errors (unless you implemented read in
terms of VOP_GETPAGES).

The reason for this is that your read is creating a copy of the data
that is hung off the lower_vp, and then returning it to a user buffer.

The problem here is that the top layer is going to issue a similar
read to the middle layer, and it's going to fail because there is
no backing object in the middle layer (only in the bottom layer).

This can be brute-forced to work (I believe Tor Egge is the one who
did this at one time?) by instancing a backing object in the intermediate
layers.

The reason this works with the read/write and not with the getpages
and putpages is that you establish a copy instead of an alias.

Using copies like this introduces cache corehency problems similar
to those in a non-unified VM and buffer cache, and given the unification
in FreeBSD, FreeBSD is pretty much totally unprepared to deal with
maintaining coherency at this level, especially if a namespace is
exposed to the user both above and below a stacking layer (e.g.,
with an ACL or cryptographic FS).


The general soloution to this, which has been discussed by John
Heidemann, John Dyson, Michael Hancock, Eivind Ecklund, Kirk McKusick,
and myself at various times in the past is to get rid of the aliases.


The only way to effectively do that is to provide a mechanism for
an upper layer to ask for the vp of the backing object that's
actually backing the vm, instead of the top level object.  The
main one that has been discussed is called VOP_GETFINALVP, or, more
correctly, VOP_GETBACKINGVP.


This can actually be implemented at low cost, since the only layer
that really cares about doing the call is a layer with a VFS interface
on both the top and the bottom.  So it doesn't effect NFS client
code (a VFS provider), the FFS code (a VFS provider, like all local
media file systems), the NFS server code (a VFS consumer), or the
system call layer (another VFS consumer).

So basically, only the stacking layers take this hit, and then only
in the case that they are doing data translation (crypto/compression)
or object proxying.


This is probably the best way to resolve this problem, since it hides
the details of the VM implementation from the stacking layers.  Even
if you were to use a non-unified VM and buffer cache (e.g. SVR4),
you would want to isolate the depedency on VM and buffer cache
interaction so as to reduce the amount of system dependency in the
code.  So this is a win either way.


> (1) Asynchronous writes:
> 
> The vanilla nullfs has a serious bug where if you write a large file (3MB or
> more) through it, several pages of the file are written as zeros to the
> lower f/s.  I tried various machines running freebsd 3.0, and different
> disks and CPU speeds.  In all cases I got the same data corruption.

Yes.  This is an alias problem, where the coherence between the upper
and lower level objects are not being maintained.  This happens because
there is no read-before-write, as there would be with a normal FS block
on FS blocksize boundaries.

To confirm this, verify the size and offset of the corrupted extents
(this should be a pretty trivial exercise).


> The best "fix" I could find was to force the underlying write to happen
> synchronously:
> 
> 	error = VOP_WRITE(lower_vp, &temp_uio, (ioflag | IO_SYNC), cr);
> 
> That solved the problem, but obviously it hurts write performance since now
> all writes through nullfs have to be done synchronously, even for writing
> one byte.


Yeah.  This is an explict synchronization, which happens to ensure
cache coherency between the two backing objects, when there should
only be one backing object.


> My best guess for the reason for this bug is that there might be a race
> condition b/t the file system and the buffer cache or even the MMU, and that
> some sort of locking/synchronization is needed to avoid the race.

Again, the answer is to avoid everything by explicit coherency, and
the way to do it is to eliminate the aliases, and, in this particular
case, the cached copies of partial data.


> I'm familiar with the f/s code in freebsd, and have become very familiar
> with the vfs/fs code in linux and solaris --- enough to know that this
> freebsd bug is likely not the fault of my code.  Alas, there are vast areas
> of the rest of the kernel I'm not familiar with.  I want to fix the bug
> correctly if possible, and allow nullfs to write asynchronously, but I'm not
> sure where to look at.

Well, then you have to know then that the FreeBSD code is a hell of a
lot more flexible and useful, if done right.  8-).

These issues are pretty well understood, but there needs to be an
architectural pass over the code with a view toward stacking.  This
has actually been my own pet hobby horse for at lease a number of
(3) years now.  It's to the point that enough people understand the
issues and the problems that this is becoming a political possibility.


> Frankly, I have a feeling that the two bugs I'm reporting here may be
> related, and that fixing bug #1 would be easier and may impact the solution
> to bug #2.

Actually, #2 would be easiest, and would result in #1 being fixed as
well, by eliminating the potential coherency race that comes from
using the fault handler instead of an explicit copy (read).


I'm going to be intentioanlly incommunicado for a while, as I'm going
on vacation, but I'll probably break down and read my email once
or twice, so if you have something needing immediate clarification,
you can send me email, but I may not respond before the first of the
year.

Other people to contact who appear to be actively interested in
solving these issues are Eivind Ecklund and Michael Hancock, so
they may be good bets as well.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199812182141.OAA11441>