Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 10 Mar 2005 06:27:44 -0500 (EST)
From:      Jeff Roberson <jroberson@chesapeake.net>
To:        Matthew Dillon <dillon@apollo.backplane.com>
Cc:        pete@isilon.com
Subject:   Re: Cleaning up vgone.
Message-ID:  <20050310062109.L20708@mail.chesapeake.net>
In-Reply-To: <200503101026.j2AAQYHa022048@apollo.backplane.com>
References:  <20050310034922.Y20708@mail.chesapeake.net> <200503101026.j2AAQYHa022048@apollo.backplane.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 10 Mar 2005, Matthew Dillon wrote:

> :I've run into a few vclean races and some related problems with VOP_CLOSE
> :not using locks.  I've made some fairly major changes to the way vfs
> :handles vnode teardown in the process of fixing this.  I'll summarize what
> :I've done here.
> :
> :The main problem with teardown was the two stage locking scheme involving
> :the XLOCK.  I got rid of the XLOCK and simply require the vnode lock
> :throughout the whole operation.  To accommodate this, VOP_INACTIVE,
> :VOP_RECLAIM, VOP_CLOSE, and VOP_REVOKE all require the vnode lock.  As
> :does vgone().
> :
> :Prior to this, vgone() would set XLOCK and then do a LK_DRAIN to make
> :sure there were no callers waiting in VOP_LOCK so that they would always
> :see the VI_XLOCK and know that the vnode had changed identities.  Now,
> :vgone sets XLOCK, and all lockers who use vget() and vn_lock() check for
> :VI_DOOMED before and after acquiring the vnode lock.  To wait for the
> :transition to complete, you simply wait on the vnode lock.
> :
> :This really only required minor changes of the filesystems in the tree.
> :Most only required the removal of a VOP_UNLOCK in VOP_INACTIVE, and a few
> :acquired the lock in VOP_CLOSE to do operations which they otherwise could
> :not.  There is one change to ffs and coda which inspect v_data in their
> :vop_lock routines.  This is only safe with the interlock held, where
> :before the XLOCK would have protected v_data in the one case that could
> :lead to panic.
> :
> :The patch is available at http://www.chesapeake.net/~jroberson/vgone.diff
> :
> :Cheers,
> :Jeff
>
>     I did basically the same thing in DragonFly.  I can only caution that
>     you have to be very, *very* careful to ensure that there are no races
>     between entities trying to ffs_vget() through the inode verses the
>     related vnode undergoing a vgone().  Be aware of the fact that there
>     is a major non-atomicy problem between the destruction of the vnode
>     and the clearing of the inode's bit in the bitmap and the destruction
>     of the inode in the inode hash table.

I haven't looked at dragonfly, so I don't know why you had this problem.
In FreeBSD 5, you will get ENOENT back from the vget() in ufs_ihashget()
if the vnode is in the process of being torn down.  Interlocking between
the vnode interlock and the ihash_mtx ensures that the hash data is still
valid when you attempt to acquire the lock.  If it is invalidated
afterwards you are guaranteed to get ENOENT and eventually pick up the
correct inode.

>
>     In DragonFly there were races in the inode-reuse case where an inode
>     would be reused before it was entirely disconnected from the vnode.
>     This case can occur because the inode's bitmap bit is cleared before
>     the vnode is unlocked (and you have to do it that way, you can't clear
>     the inode's bitmap bit after the vnode has been unlocked without creating
>     yet more issues).
>
>     This means that a totally unrelated process creating a file or directory
>     could allocate the SAME inode number out of the inode bitmap and call
>     FFS_VGET() *BEFORE* the original vnode controlling that inode number
>     finishes being reclaimed.  In fact, the original inode is still pointing
>     at the vnode undergoing destruction and the vnode's v_data is still
>     pointing at the inode.  The result:  bad things happen.
>
>     It got so hairy in DFly that I wound up reordering the code as follows:
>
> 	(1) lock vnode
> 	(2) free the inode in the bitmap (UFS_IFREE() equivalent)
> 	(3) Set v_data to NULL
> 	(4) Remove inode from the inode hash table (clear back pointer)
> 		[inode now fully disassociated]
> 	(5) unlock vnode.
>
> 	*** Modify the bitmap allocation code for inodes to check whether the
> 	    candidate inode is still present in the inode hash and SKIP THAT
> 	    INODE if it is.  (This in the UFS code).
>
>     If you do not do that you wind up with a case where the filesystem cannot
>     differentiate between someone trying to lock the original vnode and
>     someone trying to vget() the vnode related to a newly created file whos
>     inode was just allocated out of the inode bitmap.
>
>     The word 'nasty' doesn't even begin to describe the problem.  There are
>     *THREE* races here:  The vnode lock, the inode hash table, AND the
>     inode bitmap.  Due to disk I/O it is possible for the system to block
>     in between any of the related operations so you have to make sure that
>     races against all three points are handled properly.  My solution was
>     to end-run around the points by leaving the inode hash table intact until
>     the very end.  In DragonFly I don't call ufs_ihashrem() until the inode
>     has been completely divorced from the vnode and that gives me a way
>     to check for the race (by checking whether the inode is present in the
>     inode hash table or not).
>
>     Another thing I did in DragonFly was to get rid of the crazy handling of
>     the vnode's reference count during the reclaim.  I don't know what
>     FreeBSD-HEAD is doing now, but FreeBSD-4 dropped the ref count to 0
>     and then did the crazy VXLOCK stuff.  In DragonFly I changed that so
>     the ref count is left at 1 (not 0) during the reclaim.  This required
>     fixing up a few cases that were checking the refcount against an absolute
>     0 or 1 (I forget the cases), but it made the code a whole lot easier to
>     understand.
>
> 					-Matt
> 					Matthew Dillon
> 					<dillon@backplane.com>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050310062109.L20708>