Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 7 Feb 1998 04:29:33 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        kato@migmatite.eps.nagoya-u.ac.jp (KATO Takenori)
Cc:        tlambert@primenet.com, current@FreeBSD.ORG
Subject:   Re: unionfs clobbers a file
Message-ID:  <199802070429.VAA25084@usr05.primenet.com>
In-Reply-To: <19980207110757B.kato@gneiss.eps.nagoya-u.ac.jp> from "KATO Takenori" at Feb 7, 98 11:07:57 am

next in thread | previous in thread | raw e-mail | index | archive | help
> > The unionfs may still fail because of VOP_LOCK, depending on how
> > it is implemented this week.  If it's still using the lockmgr code,
> > it will definitely fail, because that code projects a three dimensional
> > geodesic into a two dimensional space.  I can explain how to fix
> > this, if you are interested.
> 
> Please explain it.
> 
> I think lock/unlock system in unionfs is somethink like ropewalking
> and current implemetation is a result of compromise.


OK, this delves into the "What if I already had an implementation
that used VOP_FINALVP to avoid alias problems" case.


To understand it, you first need to know why advisory locks have to
be veto-based to work for an agregating FS (one that presents the
buffers of two vnodes as a an agregate that looks like one set of
buffers on one vnode).  This is a simpler analogy to the VOP_LOCK
in unionfs case because it demonstrates the dimensionality and
reversibility problem without needing me to draw a two color
perspective drawing in ASCII art.  8-).

This is because, for the purposes of VOP_LOCK, any FS which implements
a multiplexer will agregate the vnode locks under some circumstances;
the union FS is one such FS.  The mount point traversal code in a
translucent FS mounted into the same directory as exports its root
is another case.


When you agregate two objects (be they block lists or be they directory
block lists) under a single vnode pointer, you end up with a case
where the operation must be idempotent but can not be atomic (this
means that you must either do everything or nothing, and that you
can't, by definition, do everything at once).

Say I have a new FS type, called "AGFS".  It's purpose is to replace
CCD, and do transparent file concatenation.  When one FS runs out
of room writing the file, the next write starts a new file in the
next FS.

Now, after I've written this file, I have something that looks like:


                            vp[1] (AGFS, size 100 blocks)
                            /  \
  (FFS, size 80 blocks) vp[2]   vp[3] (FFS, size 20 blocks)


I come back later to open this file.  So far, so good.  Now I want
to set an advisory lock on 20 blocks: 70 through 90.  This advisory
lock needs to span two underlying objects (regardless of whether I
implement the VOP_ADVLOCK chain off the vp or the ip in the underlying
implementation, this is true).

So I assert a lock on vp[1].  The AGFS asserts the lock on both
vp[2] and vp[3] in the underlying FS's.  Now we have problems:

	IF !VOP_ADVLOCK(vp[2])
	  FAIL			/* this is OK; we can fail back*/
	IF !VOP_ADVLOCK(vp[3])
	  FAIL			/* this is bad; recovery is indeterminate*/

Why can't we recover from the second failure?

The reason we can't recover is that the first operation has succeeded,
and as a result, we have coelesced the lock list for that vp.  This
may have resulted in promotions or demotions. of the lock state.

When can this happen in real life?

It can happen when we start with something like this:

		[              SSS   ][ XXX                ]
		[              111   ][ 222                ]

(process 1 holds a shared lock and process 2 holds an exclusive lock),
and we want to assert a new lock:

		[               XXXXX][XX                  ]
		[               11111][11                  ]

After the first statement, we have:

				**
		[              SXXXXX][ XXX                ]
		[              111111][ 222                ]

(* lock promoted from S to X, previous state lost)

The second statement fails, because the last part of the lock would
conflict with the lock from process 2.

Now let's look at the VOP_LOCK case for a directory traversal, and
specifically look at the race condition for the parent unlock when
traversing to the child (this is a well known race, which is
handled by generation numbers in the flat stack case).  When the
traversal crosses the agregation boundry, the race is greatly
expanded.  This is less of a problem if one of the FS's is mounted
read-only... but only if you don't permit white-outs.

In the non-read-only case, we must hold the parent (upper level)
lock twice: there are two parent spaces: one for the stack
relationship, and one for the underlying parent/child relationship.

You can illustrate the same problem with a translucent fs, where
you have /a/b/c and you remount c in /a/b/d.  The d/c traversal
means that the lookup prior needs to do a b->b transition.

You could probably kludge all of these border cases using a recursion
flag, and specific equality testing, if you had to.

Or you could delay the coelesce, buying you an additional dimension,
and then make the call down to the second one, and decide to abort
the coelesce instead, when the second call fails.

A delayed coelesce only buys you a single dimensionality, however;
if you did something like:

               vp (union)
              /  \
             vp   vp (union)
                 /  \
                vp   vp

Then the problem just comes back; only now you can't solve it.  The
fix is to make both the VOP_LOCK and VOP_ADVLOCK use veto interfaces;
then you can ask the layer if the lock can proceed, and if it can,
then you run it to completion.

The one limitation here is that you can only have a single FINALVP
that needs a proxy mechanism, unless the proxy coelesce is also
reversible (an NFS client FS employs a proxy mechanism where it
must actually remotely assert the lock to determine if the lock
can be asserted.  This is a flay in the NFS protocol, as is the
inability to proxy VOP_IOCTL requests).

In reality, if you are not referencing the backing object for the
vnode (the in-core inode, nfsnode, or whatever), then you can apply
a single lock to the backing vnodes alone.  You are still left
with the fact that if you assert one NFS lock, it will be remotely
coelesced, and then if you assert another NFS lock, and the
assertion fails, you can't back the first out.  One way out of this
bind is to maintain a lock list locally, and manufacture PID's for
the lock requests so that a coelesce does not take place on the
remote server.  This means you must keep an interference graph
locally to match them up.  The NFS rpc.lockd needs to have similar
code.  In any case, if I can only have a single dimensionality
from a delayed coelesce, I'd rather spend it on a remote FS, and
make multiple stacks work.  Right now, you don't even have one
extra dimensionality.

To drag myself back on the specific subject, though, a union FS
has a loss of dimensionality, and you end up with something that
looks like:

               vp1'
              /  \
             vp2' vp1
                 /  \
                vp2  vp3

Where vp1' shadows vp1 and vp2' shadows vp2 in the flat space
managed by the lockmgr() code (visualize vp1' floating over vp1,
and if there is both a vp2' and a vp2, they float over each other
and result in a conflict because you don't know which one you are
supposed to be going to when you traverse down from vp1'/vp1).

The vp1'/vp1 shadowing is not a problem (you can special case "self").
The vp2'/vp2 shadowing does not always occur in practice (mostly it's
vp4' off of vp1', so you don't have a space conflict), but when it
does, then you lock recursively and get a "locking against myself" error.

This shadowing is the same shadowing that's currently causing your
panic's in the buffer alias case.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199802070429.VAA25084>