Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Dec 2013 14:44:05 +0100
From:      Pawel Jakub Dawidek <pjd@FreeBSD.org>
To:        krichy@tvnetwork.hu
Cc:        freebsd-fs@freebsd.org
Subject:   Re: kern/184677 / ZFS snapshot handling deadlocks
Message-ID:  <20131220134405.GE1658@garage.freebsd.pl>
In-Reply-To: <alpine.DEB.2.10.1312191629370.4344@krichy.tvnetwork.hu>
References:  <alpine.DEB.2.10.1312161647410.11599@krichy.tvnetwork.hu> <alpine.DEB.2.10.1312171326520.7714@krichy.tvnetwork.hu> <alpine.DEB.2.10.1312191629370.4344@krichy.tvnetwork.hu>

next in thread | previous in thread | raw e-mail | index | archive | help

--LKTjZJSUETSlgu2t
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Dec 19, 2013 at 04:46:11PM +0100, krichy@tvnetwork.hu wrote:
> Dear devs,
>=20
> I am attaching a more clear patch/fix for my snapshot handling issues=20
> (0002), but I would be happy if some ZFS expert would comment it. I am=20
> trying to solve it at least for two weeks now, and an ACK or a NACK would=
=20
> be nice from someone. Also a commit is reverted since it also caused=20
> deadlocks. I've read its comments, which also eliminates deadlocks, but I=
=20
> did not find any reference how to produce that deadlock. In my view=20
> reverting that makes my issues disappear, but I dont know what new cases=
=20
> will it rise.

Richard,

I won't be able to analyse it myself anytime soon, because of other
obligations, but I forwarded you e-mail to the zfs-devel@ mailing list
(it is closed, but gathers FreeBSD ZFS devs). Hopefully someone from
there will be able to help you.

> I've rewritten traverse() to make more like upstream, added two extra=20
> VN_HOLD()s to snapdir_lookup() when traverse returned the same vnode what=
=20
> was passed to it (I dont know even that upon creating a snapshot vnode wh=
y=20
> is that extra two holds needed for the vnode.) And unfortunately, due to=
=20
> FreeBSD calls vop_inactive callbacks with vnodes locked, that could also=
=20
> cause deadlocks, so zfsctl_snapshot_inactive() and=20
> zfsctl_snapshot_vptocnp() has been rewritten to work that around.
>=20
> After this, one may also get a deadlock, when a simple access would call=
=20
> into zfsctl_snapshot_lookup(). The documentation says, that those vnodes=
=20
> should always be covered, but after some stress test, sometimes we hit=20
> that call, and that can cause again deadlocks. In our environment I've=20
> just uncommented that callback, which returns ENODIR on some calls, but a=
t=20
> least not a deadlock.
>=20
> The attached script can be used to reproduce my cases (would one confirm=
=20
> that?), and after the patches applied, they disappear (confirm?).
>=20
> Thanks,
>=20
>=20
> Kojedzinszky Richard
> Euronet Magyarorszag Informatikai Zrt.
>=20
> On Tue, 17 Dec 2013, krichy@tvnetwork.hu wrote:
>=20
> > Date: Tue, 17 Dec 2013 14:50:16 +0100 (CET)
> > From: krichy@tvnetwork.hu
> > To: pjd@freebsd.org
> > Cc: freebsd-fs@freebsd.org
> > Subject: Re: kern/184677 (fwd)
> >=20
> > Dear devs,
> >
> > I will sum up my experience regarding the issue:
> >
> > The sympton is that a concurrent 'zfs send -R' and some activity on the=
=20
> > snapshot dir (or in the snapshot) may cause a deadlock.
> >
> > After investigating the problem, I found that zfs send umounts the snap=
shots,=20
> > and that causes the deadlock, so later I tested only with concurrent um=
ount=20
> > and the "activity". More later I found that listing the snapshots in=20
> > .zfs/snapshot/ and unounting them can cause the found deadlock, so I us=
ed=20
> > them for the tests. But for my surprise, instead of a deadlock, a recur=
sive=20
> > lock panic has arised.
> >
> > The vnode for the ".zfs/snapshot/" directory contains zfs's zfsctl_snap=
dir_t=20
> > structure (sdp). This contains a tree of mounted snapshots, and each en=
try=20
> > (sep) contains the vnode of entry on which the snapshot is mounted on t=
op=20
> > (se_root). The strange is that the se_root member does not hold a refer=
ence=20
> > for the vnode, just a simple pointer to it.
> >
> > Upon entry lookup (zfsctl_snapdir_lookup()) the "snapshot" vnode is loc=
ked,=20
> > the zfsctl_snapdir_t's tree is locked, and searched for the mount if it=
=20
> > exists already. If it founds no entry, does the mount. In the case of a=
n=20
> > entry was found, the se_root member contains the vnode which the snapsh=
ot is=20
> > mounted on. Thus, a reference is taken for it, and the traverse() call =
will=20
> > resolve to the real root vnode of the mounted snapshot, returning it as=
=20
> > locked. (Examining the traverse() code I've found that it did not follo=
w=20
> > FreeBSD's lock order recommendation described in sys/kern/vfs_subr.c.)
> >
> > On the other way, when an umount is issued, the se_root vnode looses it=
s last=20
> > reference (as only the mountpoint holds one for it), it goes through th=
e=20
> > vinactive() path, to zfsctl_snapshot_inactive(). In FreeBSD this is cal=
led=20
> > with a locked vnode, so this is a deadlock race condition. While=20
> > zfsctl_snapdir_lookup() holds the mutex for the sdp tree, and traverse(=
)=20
> > tries to acquire the se_root, zfsctl_snapshot_inactive() holds the lock=
 on=20
> > se_root while tries to access the sdp lock.
> >
> > The zfsctl_snapshot_inactive() has an if statement checking the v_useco=
unt,=20
> > which is incremented in zfsctl_snapdir_lookup(), but in that context it=
 is=20
> > not covered by VI_LOCK. And it seems to me that FreeBSD's vinactive() p=
ath=20
> > assumes that the vnode remains inactive (as opposed to illumos, at leas=
t how=20
> > i read the code). So zfsctl_snapshot_inactive() must free resources whi=
le in=20
> > a locked state. I was a bit confused, and probably that is why the prev=
iously=20
> > posted patch is as is.
> >
> > Maybe if I had some clues on the directions of this problem, I could ha=
ve=20
> > worked more for a nicer, shorter solution.
> >
> > Please someone comment on my post.
> >
> > Regards,
> >
> >
> >
> > Kojedzinszky Richard
> > Euronet Magyarorszag Informatikai Zrt.
> >
> > On Mon, 16 Dec 2013, krichy@tvnetwork.hu wrote:
> >
> >> Date: Mon, 16 Dec 2013 16:52:16 +0100 (CET)
> >> From: krichy@tvnetwork.hu
> >> To: pjd@freebsd.org
> >> Cc: freebsd-fs@freebsd.org
> >> Subject: Re: kern/184677 (fwd)
> >>=20
> >> Dear PJD,
> >>=20
> >> I am a happy FreeBSD user, I am sure you've read my previous posts=20
> >> regarding some issues in ZFS. Please give some advice for me, where to=
 look=20
> >> for solutions, or how could I help to resolve those issues.
> >>=20
> >> Regards,
> >> Kojedzinszky Richard
> >> Euronet Magyarorszag Informatikai Zrt.
> >>=20
> >> ---------- Forwarded message ----------
> >> Date: Mon, 16 Dec 2013 15:23:06 +0100 (CET)
> >> From: krichy@tvnetwork.hu
> >> To: freebsd-fs@freebsd.org
> >> Subject: Re: kern/184677
> >>=20
> >>=20
> >> Seems that pjd did a change which eliminated the zfsdev_state_lock on =
Fri=20
> >> Aug 12 07:04:16 2011 +0000, which might introduced a new deadlock=20
> >> situation. Any comments on this?
> >>=20
> >>=20
> >> Kojedzinszky Richard
> >> Euronet Magyarorszag Informatikai Zrt.
> >>=20
> >> On Mon, 16 Dec 2013, krichy@tvnetwork.hu wrote:
> >>=20
> >>> Date: Mon, 16 Dec 2013 11:08:11 +0100 (CET)
> >>> From: krichy@tvnetwork.hu
> >>> To: freebsd-fs@freebsd.org
> >>> Subject: kern/184677
> >>>=20
> >>> Dear devs,
> >>>=20
> >>> I've attached a patch, which makes the recursive lockmgr disappear, a=
nd=20
> >>> makes the reported bug to disappear. I dont know if I followed any=20
> >>> guidelines well, or not, but at least it works for me. Please some=20
> >>> ZFS/FreeBSD fs expert review it, and fix it where it needed.
> >>>=20
> >>> But unfortunately, my original problem is still not solved, maybe the=
 same=20
> >>> as Ryan's:=20
> >>> http://lists.freebsd.org/pipermail/freebsd-fs/2013-December/018707.ht=
ml
> >>>=20
> >>> Tracing the problem down is that zfsctl_snapdir_lookup() tries to acq=
uire=20
> >>> spa_namespace_lock while when finishing a zfs send -R does a=20
> >>> zfsdev_close(), and that also holds the same mutex. And this causes a=
=20
> >>> deadlock scenario. I looked at illumos's code, and for some reason th=
ey=20
> >>> use another mutex on zfsdev_close(), which therefore may not deadlock=
 with=20
> >>> zfsctl_snapdir_lookup(). But I am still investigating the problem.
> >>>=20
> >>> I would like to help making ZFS more stable on freebsd also with its =
whole=20
> >>> functionality. I would be very thankful if some expert would give som=
e=20
> >>> advice, how to solve these bugs. PJD, Steven, Xin?
> >>>=20
> >>> Thanks in advance,
> >>>=20
> >>>=20
> >>> Kojedzinszky Richard
> >>> Euronet Magyarorszag Informatikai Zrt.
> >> _______________________________________________
> >> freebsd-fs@freebsd.org mailing list
> >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
> >>=20
> >


> From 39298da838d006ad225e41529d7b7f240fccfe73 Mon Sep 17 00:00:00 2001
> From: Richard Kojedzinszky <krichy@cflinux.hu>
> Date: Mon, 16 Dec 2013 15:39:11 +0100
> Subject: [PATCH 1/2] Revert "Eliminate the zfsdev_state_lock entirely and
>  replace it with the"
>=20
> This reverts commit 1d8972b3f353f986eb5b85bc108b1c0d946d3218.
>=20
> Conflicts:
> 	sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
> ---
>  .../opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h  |   1 +
>  .../opensolaris/uts/common/fs/zfs/vdev_geom.c      |  14 ++-
>  .../opensolaris/uts/common/fs/zfs/zfs_ioctl.c      |  16 +--
>  .../contrib/opensolaris/uts/common/fs/zfs/zvol.c   | 119 +++++++++------=
------
>  4 files changed, 70 insertions(+), 80 deletions(-)
>=20
> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl=
=2Eh b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h
> index af2def2..8272c4d 100644
> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h
> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h
> @@ -383,6 +383,7 @@ extern void *zfsdev_get_soft_state(minor_t minor,
>  extern minor_t zfsdev_minor_alloc(void);
> =20
>  extern void *zfsdev_state;
> +extern kmutex_t zfsdev_state_lock;
> =20
>  #endif	/* _KERNEL */
> =20
> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c b=
/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
> index 15685a5..5c3e9f3 100644
> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
> @@ -581,7 +581,7 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t =
*max_psize,
>  	struct g_provider *pp;
>  	struct g_consumer *cp;
>  	size_t bufsize;
> -	int error;
> +	int error, lock;
> =20
>  	/*
>  	 * We must have a pathname, and it must be absolute.
> @@ -593,6 +593,12 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t=
 *max_psize,
> =20
>  	vd->vdev_tsd =3D NULL;
> =20
> +	if (mutex_owned(&spa_namespace_lock)) {
> +		mutex_exit(&spa_namespace_lock);
> +		lock =3D 1;
> +	} else {
> +		lock =3D 0;
> +	}
>  	DROP_GIANT();
>  	g_topology_lock();
>  	error =3D 0;
> @@ -624,7 +630,11 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t=
 *max_psize,
>  	    !ISP2(cp->provider->sectorsize)) {
>  		ZFS_LOG(1, "Provider %s has unsupported sectorsize.",
>  		    vd->vdev_path);
> +
> +		g_topology_lock();
>  		vdev_geom_detach(cp, 0);
> +		g_topology_unlock();
> +
>  		error =3D EINVAL;
>  		cp =3D NULL;
>  	} else if (cp->acw =3D=3D 0 && (spa_mode(vd->vdev_spa) & FWRITE) !=3D 0=
) {
> @@ -647,6 +657,8 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t =
*max_psize,
>  	}
>  	g_topology_unlock();
>  	PICKUP_GIANT();
> +	if (lock)
> +		mutex_enter(&spa_namespace_lock);
>  	if (cp =3D=3D NULL) {
>  		vd->vdev_stat.vs_aux =3D VDEV_AUX_OPEN_FAILED;
>  		return (error);
> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c b=
/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
> index e9fba26..91becde 100644
> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
> @@ -5635,7 +5635,7 @@ zfsdev_minor_alloc(void)
>  	static minor_t last_minor;
>  	minor_t m;
> =20
> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
> =20
>  	for (m =3D last_minor + 1; m !=3D last_minor; m++) {
>  		if (m > ZFSDEV_MAX_MINOR)
> @@ -5655,7 +5655,7 @@ zfs_ctldev_init(struct cdev *devp)
>  	minor_t minor;
>  	zfs_soft_state_t *zs;
> =20
> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
> =20
>  	minor =3D zfsdev_minor_alloc();
>  	if (minor =3D=3D 0)
> @@ -5676,7 +5676,7 @@ zfs_ctldev_init(struct cdev *devp)
>  static void
>  zfs_ctldev_destroy(zfs_onexit_t *zo, minor_t minor)
>  {
> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
> =20
>  	zfs_onexit_destroy(zo);
>  	ddi_soft_state_free(zfsdev_state, minor);
> @@ -5706,9 +5706,9 @@ zfsdev_open(struct cdev *devp, int flag, int mode, =
struct thread *td)
> =20
>  	/* This is the control device. Allocate a new minor if requested. */
>  	if (flag & FEXCL) {
> -		mutex_enter(&spa_namespace_lock);
> +		mutex_enter(&zfsdev_state_lock);
>  		error =3D zfs_ctldev_init(devp);
> -		mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  	}
> =20
>  	return (error);
> @@ -5723,14 +5723,14 @@ zfsdev_close(void *data)
>  	if (minor =3D=3D 0)
>  		return;
> =20
> -	mutex_enter(&spa_namespace_lock);
> +	mutex_enter(&zfsdev_state_lock);
>  	zo =3D zfsdev_get_soft_state(minor, ZSST_CTLDEV);
>  	if (zo =3D=3D NULL) {
> -		mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		return;
>  	}
>  	zfs_ctldev_destroy(zo, minor);
> -	mutex_exit(&spa_namespace_lock);
> +	mutex_exit(&zfsdev_state_lock);
>  }
> =20
>  static int
> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c b/sys/=
cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
> index 72d4502..aec5219 100644
> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
> @@ -104,11 +104,12 @@ static char *zvol_tag =3D "zvol_tag";
>  #define	ZVOL_DUMPSIZE		"dumpsize"
> =20
>  /*
> - * The spa_namespace_lock protects the zfsdev_state structure from being
> - * modified while it's being used, e.g. an open that comes in before a
> - * create finishes.  It also protects temporary opens of the dataset so =
that,
> + * This lock protects the zfsdev_state structure from being modified
> + * while it's being used, e.g. an open that comes in before a create
> + * finishes.  It also protects temporary opens of the dataset so that,
>   * e.g., an open doesn't get a spurious EBUSY.
>   */
> +kmutex_t zfsdev_state_lock;
>  static uint32_t zvol_minors;
> =20
>  typedef struct zvol_extent {
> @@ -249,7 +250,7 @@ zvol_minor_lookup(const char *name)
>  	struct g_geom *gp;
>  	zvol_state_t *zv =3D NULL;
> =20
> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
> =20
>  	g_topology_lock();
>  	LIST_FOREACH(gp, &zfs_zvol_class.geom, geom) {
> @@ -465,11 +466,11 @@ zvol_name2minor(const char *name, minor_t *minor)
>  {
>  	zvol_state_t *zv;
> =20
> -	mutex_enter(&spa_namespace_lock);
> +	mutex_enter(&zfsdev_state_lock);
>  	zv =3D zvol_minor_lookup(name);
>  	if (minor && zv)
>  		*minor =3D zv->zv_minor;
> -	mutex_exit(&spa_namespace_lock);
> +	mutex_exit(&zfsdev_state_lock);
>  	return (zv ? 0 : -1);
>  }
>  #endif	/* sun */
> @@ -489,10 +490,10 @@ zvol_create_minor(const char *name)
> =20
>  	ZFS_LOG(1, "Creating ZVOL %s...", name);
> =20
> -	mutex_enter(&spa_namespace_lock);
> +	mutex_enter(&zfsdev_state_lock);
> =20
>  	if (zvol_minor_lookup(name) !=3D NULL) {
> -		mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		return (SET_ERROR(EEXIST));
>  	}
> =20
> @@ -500,20 +501,20 @@ zvol_create_minor(const char *name)
>  	error =3D dmu_objset_own(name, DMU_OST_ZVOL, B_TRUE, FTAG, &os);
> =20
>  	if (error) {
> -		mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		return (error);
>  	}
> =20
>  #ifdef sun
>  	if ((minor =3D zfsdev_minor_alloc()) =3D=3D 0) {
>  		dmu_objset_disown(os, FTAG);
> -		mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		return (SET_ERROR(ENXIO));
>  	}
> =20
>  	if (ddi_soft_state_zalloc(zfsdev_state, minor) !=3D DDI_SUCCESS) {
>  		dmu_objset_disown(os, FTAG);
> -		mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		return (SET_ERROR(EAGAIN));
>  	}
>  	(void) ddi_prop_update_string(minor, zfs_dip, ZVOL_PROP_NAME,
> @@ -525,7 +526,7 @@ zvol_create_minor(const char *name)
>  	    minor, DDI_PSEUDO, 0) =3D=3D DDI_FAILURE) {
>  		ddi_soft_state_free(zfsdev_state, minor);
>  		dmu_objset_disown(os, FTAG);
> -		mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		return (SET_ERROR(EAGAIN));
>  	}
> =20
> @@ -536,7 +537,7 @@ zvol_create_minor(const char *name)
>  		ddi_remove_minor_node(zfs_dip, chrbuf);
>  		ddi_soft_state_free(zfsdev_state, minor);
>  		dmu_objset_disown(os, FTAG);
> -		mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		return (SET_ERROR(EAGAIN));
>  	}
> =20
> @@ -587,7 +588,7 @@ zvol_create_minor(const char *name)
> =20
>  	zvol_minors++;
> =20
> -	mutex_exit(&spa_namespace_lock);
> +	mutex_exit(&zfsdev_state_lock);
> =20
>  	zvol_geom_run(zv);
> =20
> @@ -609,7 +610,7 @@ zvol_remove_zv(zvol_state_t *zv)
>  	minor_t minor =3D zv->zv_minor;
>  #endif
> =20
> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>  	if (zv->zv_total_opens !=3D 0)
>  		return (SET_ERROR(EBUSY));
> =20
> @@ -635,15 +636,15 @@ zvol_remove_minor(const char *name)
>  	zvol_state_t *zv;
>  	int rc;
> =20
> -	mutex_enter(&spa_namespace_lock);
> +	mutex_enter(&zfsdev_state_lock);
>  	if ((zv =3D zvol_minor_lookup(name)) =3D=3D NULL) {
> -		mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		return (SET_ERROR(ENXIO));
>  	}
>  	g_topology_lock();
>  	rc =3D zvol_remove_zv(zv);
>  	g_topology_unlock();
> -	mutex_exit(&spa_namespace_lock);
> +	mutex_exit(&zfsdev_state_lock);
>  	return (rc);
>  }
> =20
> @@ -755,7 +756,7 @@ zvol_update_volsize(objset_t *os, uint64_t volsize)
>  	dmu_tx_t *tx;
>  	int error;
> =20
> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
> =20
>  	tx =3D dmu_tx_create(os);
>  	dmu_tx_hold_zap(tx, ZVOL_ZAP_OBJ, TRUE, NULL);
> @@ -786,7 +787,7 @@ zvol_remove_minors(const char *name)
>  	namelen =3D strlen(name);
> =20
>  	DROP_GIANT();
> -	mutex_enter(&spa_namespace_lock);
> +	mutex_enter(&zfsdev_state_lock);
>  	g_topology_lock();
> =20
>  	LIST_FOREACH_SAFE(gp, &zfs_zvol_class.geom, geom, gptmp) {
> @@ -804,7 +805,7 @@ zvol_remove_minors(const char *name)
>  	}
> =20
>  	g_topology_unlock();
> -	mutex_exit(&spa_namespace_lock);
> +	mutex_exit(&zfsdev_state_lock);
>  	PICKUP_GIANT();
>  }
> =20
> @@ -818,10 +819,10 @@ zvol_set_volsize(const char *name, major_t maj, uin=
t64_t volsize)
>  	uint64_t old_volsize =3D 0ULL;
>  	uint64_t readonly;
> =20
> -	mutex_enter(&spa_namespace_lock);
> +	mutex_enter(&zfsdev_state_lock);
>  	zv =3D zvol_minor_lookup(name);
>  	if ((error =3D dmu_objset_hold(name, FTAG, &os)) !=3D 0) {
> -		mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		return (error);
>  	}
> =20
> @@ -888,7 +889,7 @@ zvol_set_volsize(const char *name, major_t maj, uint6=
4_t volsize)
>  out:
>  	dmu_objset_rele(os, FTAG);
> =20
> -	mutex_exit(&spa_namespace_lock);
> +	mutex_exit(&zfsdev_state_lock);
> =20
>  	return (error);
>  }
> @@ -899,36 +900,19 @@ zvol_open(struct g_provider *pp, int flag, int coun=
t)
>  {
>  	zvol_state_t *zv;
>  	int err =3D 0;
> -	boolean_t locked =3D B_FALSE;
> =20
> -	/*
> -	 * Protect against recursively entering spa_namespace_lock
> -	 * when spa_open() is used for a pool on a (local) ZVOL(s).
> -	 * This is needed since we replaced upstream zfsdev_state_lock
> -	 * with spa_namespace_lock in the ZVOL code.
> -	 * We are using the same trick as spa_open().
> -	 * Note that calls in zvol_first_open which need to resolve
> -	 * pool name to a spa object will enter spa_open()
> -	 * recursively, but that function already has all the
> -	 * necessary protection.
> -	 */
> -	if (!MUTEX_HELD(&spa_namespace_lock)) {
> -		mutex_enter(&spa_namespace_lock);
> -		locked =3D B_TRUE;
> -	}
> +	mutex_enter(&zfsdev_state_lock);
> =20
>  	zv =3D pp->private;
>  	if (zv =3D=3D NULL) {
> -		if (locked)
> -			mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		return (SET_ERROR(ENXIO));
>  	}
> =20
>  	if (zv->zv_total_opens =3D=3D 0)
>  		err =3D zvol_first_open(zv);
>  	if (err) {
> -		if (locked)
> -			mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		return (err);
>  	}
>  	if ((flag & FWRITE) && (zv->zv_flags & ZVOL_RDONLY)) {
> @@ -950,15 +934,13 @@ zvol_open(struct g_provider *pp, int flag, int coun=
t)
>  #endif
> =20
>  	zv->zv_total_opens +=3D count;
> -	if (locked)
> -		mutex_exit(&spa_namespace_lock);
> +	mutex_exit(&zfsdev_state_lock);
> =20
>  	return (err);
>  out:
>  	if (zv->zv_total_opens =3D=3D 0)
>  		zvol_last_close(zv);
> -	if (locked)
> -		mutex_exit(&spa_namespace_lock);
> +	mutex_exit(&zfsdev_state_lock);
>  	return (err);
>  }
> =20
> @@ -968,18 +950,12 @@ zvol_close(struct g_provider *pp, int flag, int cou=
nt)
>  {
>  	zvol_state_t *zv;
>  	int error =3D 0;
> -	boolean_t locked =3D B_FALSE;
> =20
> -	/* See comment in zvol_open(). */
> -	if (!MUTEX_HELD(&spa_namespace_lock)) {
> -		mutex_enter(&spa_namespace_lock);
> -		locked =3D B_TRUE;
> -	}
> +	mutex_enter(&zfsdev_state_lock);
> =20
>  	zv =3D pp->private;
>  	if (zv =3D=3D NULL) {
> -		if (locked)
> -			mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		return (SET_ERROR(ENXIO));
>  	}
> =20
> @@ -1002,8 +978,7 @@ zvol_close(struct g_provider *pp, int flag, int coun=
t)
>  	if (zv->zv_total_opens =3D=3D 0)
>  		zvol_last_close(zv);
> =20
> -	if (locked)
> -		mutex_exit(&spa_namespace_lock);
> +	mutex_exit(&zfsdev_state_lock);
>  	return (error);
>  }
> =20
> @@ -1658,12 +1633,12 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int =
flag, cred_t *cr, int *rvalp)
>  	int error =3D 0;
>  	rl_t *rl;
> =20
> -	mutex_enter(&spa_namespace_lock);
> +	mutex_enter(&zfsdev_state_lock);
> =20
>  	zv =3D zfsdev_get_soft_state(getminor(dev), ZSST_ZVOL);
> =20
>  	if (zv =3D=3D NULL) {
> -		mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		return (SET_ERROR(ENXIO));
>  	}
>  	ASSERT(zv->zv_total_opens > 0);
> @@ -1677,7 +1652,7 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int fl=
ag, cred_t *cr, int *rvalp)
>  		dki.dki_ctype =3D DKC_UNKNOWN;
>  		dki.dki_unit =3D getminor(dev);
>  		dki.dki_maxtransfer =3D 1 << (SPA_MAXBLOCKSHIFT - zv->zv_min_bs);
> -		mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		if (ddi_copyout(&dki, (void *)arg, sizeof (dki), flag))
>  			error =3D SET_ERROR(EFAULT);
>  		return (error);
> @@ -1687,7 +1662,7 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int fl=
ag, cred_t *cr, int *rvalp)
>  		dkm.dki_lbsize =3D 1U << zv->zv_min_bs;
>  		dkm.dki_capacity =3D zv->zv_volsize >> zv->zv_min_bs;
>  		dkm.dki_media_type =3D DK_UNKNOWN;
> -		mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		if (ddi_copyout(&dkm, (void *)arg, sizeof (dkm), flag))
>  			error =3D SET_ERROR(EFAULT);
>  		return (error);
> @@ -1697,14 +1672,14 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int =
flag, cred_t *cr, int *rvalp)
>  			uint64_t vs =3D zv->zv_volsize;
>  			uint8_t bs =3D zv->zv_min_bs;
> =20
> -			mutex_exit(&spa_namespace_lock);
> +			mutex_exit(&zfsdev_state_lock);
>  			error =3D zvol_getefi((void *)arg, flag, vs, bs);
>  			return (error);
>  		}
> =20
>  	case DKIOCFLUSHWRITECACHE:
>  		dkc =3D (struct dk_callback *)arg;
> -		mutex_exit(&spa_namespace_lock);
> +		mutex_exit(&zfsdev_state_lock);
>  		zil_commit(zv->zv_zilog, ZVOL_OBJ);
>  		if ((flag & FKIOCTL) && dkc !=3D NULL && dkc->dkc_callback) {
>  			(*dkc->dkc_callback)(dkc->dkc_cookie, error);
> @@ -1730,10 +1705,10 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int =
flag, cred_t *cr, int *rvalp)
>  			}
>  			if (wce) {
>  				zv->zv_flags |=3D ZVOL_WCE;
> -				mutex_exit(&spa_namespace_lock);
> +				mutex_exit(&zfsdev_state_lock);
>  			} else {
>  				zv->zv_flags &=3D ~ZVOL_WCE;
> -				mutex_exit(&spa_namespace_lock);
> +				mutex_exit(&zfsdev_state_lock);
>  				zil_commit(zv->zv_zilog, ZVOL_OBJ);
>  			}
>  			return (0);
> @@ -1828,7 +1803,7 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int fl=
ag, cred_t *cr, int *rvalp)
>  		break;
> =20
>  	}
> -	mutex_exit(&spa_namespace_lock);
> +	mutex_exit(&zfsdev_state_lock);
>  	return (error);
>  }
>  #endif	/* sun */
> @@ -1844,12 +1819,14 @@ zvol_init(void)
>  {
>  	VERIFY(ddi_soft_state_init(&zfsdev_state, sizeof (zfs_soft_state_t),
>  	    1) =3D=3D 0);
> +	mutex_init(&zfsdev_state_lock, NULL, MUTEX_DEFAULT, NULL);
>  	ZFS_LOG(1, "ZVOL Initialized.");
>  }
> =20
>  void
>  zvol_fini(void)
>  {
> +	mutex_destroy(&zfsdev_state_lock);
>  	ddi_soft_state_fini(&zfsdev_state);
>  	ZFS_LOG(1, "ZVOL Deinitialized.");
>  }
> @@ -1889,7 +1866,7 @@ zvol_dump_init(zvol_state_t *zv, boolean_t resize)
>  	uint64_t version =3D spa_version(spa);
>  	enum zio_checksum checksum;
> =20
> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>  	ASSERT(vd->vdev_ops =3D=3D &vdev_root_ops);
> =20
>  	error =3D dmu_free_long_range(zv->zv_objset, ZVOL_OBJ, 0,
> @@ -2437,7 +2414,7 @@ zvol_rename_minor(struct g_geom *gp, const char *ne=
wname)
>  	struct g_provider *pp;
>  	zvol_state_t *zv;
> =20
> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>  	g_topology_assert();
> =20
>  	pp =3D LIST_FIRST(&gp->provider);
> @@ -2471,7 +2448,7 @@ zvol_rename_minors(const char *oldname, const char =
*newname)
>  	newnamelen =3D strlen(newname);
> =20
>  	DROP_GIANT();
> -	mutex_enter(&spa_namespace_lock);
> +	mutex_enter(&zfsdev_state_lock);
>  	g_topology_lock();
> =20
>  	LIST_FOREACH(gp, &zfs_zvol_class.geom, geom) {
> @@ -2494,6 +2471,6 @@ zvol_rename_minors(const char *oldname, const char =
*newname)
>  	}
> =20
>  	g_topology_unlock();
> -	mutex_exit(&spa_namespace_lock);
> +	mutex_exit(&zfsdev_state_lock);
>  	PICKUP_GIANT();
>  }
> --=20
> 1.8.4.2
>=20

> From 57d5a383b585c32c77af54e8cdacaddf8ce7584f Mon Sep 17 00:00:00 2001
> From: Richard Kojedzinszky <krichy@cflinux.hu>
> Date: Wed, 18 Dec 2013 22:11:21 +0100
> Subject: [PATCH 2/2] ZFS snapshot handling fix
>=20
> ---
>  .../compat/opensolaris/kern/opensolaris_lookup.c   | 13 +++---
>  .../opensolaris/uts/common/fs/zfs/zfs_ctldir.c     | 53 +++++++++++++++-=
------
>  2 files changed, 42 insertions(+), 24 deletions(-)
>=20
> diff --git a/sys/cddl/compat/opensolaris/kern/opensolaris_lookup.c b/sys/=
cddl/compat/opensolaris/kern/opensolaris_lookup.c
> index 94383d6..4cac053 100644
> --- a/sys/cddl/compat/opensolaris/kern/opensolaris_lookup.c
> +++ b/sys/cddl/compat/opensolaris/kern/opensolaris_lookup.c
> @@ -81,6 +81,8 @@ traverse(vnode_t **cvpp, int lktype)
>  	 * progress on this vnode.
>  	 */
> =20
> +	vn_lock(cvp, lktype);
> +
>  	for (;;) {
>  		/*
>  		 * Reached the end of the mount chain?
> @@ -89,13 +91,7 @@ traverse(vnode_t **cvpp, int lktype)
>  		if (vfsp =3D=3D NULL)
>  			break;
>  		error =3D vfs_busy(vfsp, 0);
> -		/*
> -		 * tvp is NULL for *cvpp vnode, which we can't unlock.
> -		 */
> -		if (tvp !=3D NULL)
> -			vput(cvp);
> -		else
> -			vrele(cvp);
> +		VOP_UNLOCK(cvp, 0);
>  		if (error)
>  			return (error);
> =20
> @@ -107,6 +103,9 @@ traverse(vnode_t **cvpp, int lktype)
>  		vfs_unbusy(vfsp);
>  		if (error !=3D 0)
>  			return (error);
> +
> +		VN_RELE(cvp);
> +
>  		cvp =3D tvp;
>  	}
> =20
> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c =
b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
> index 28ab1fa..d3464b4 100644
> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
> @@ -112,6 +112,25 @@ snapentry_compare(const void *a, const void *b)
>  		return (0);
>  }
> =20
> +/* Return the zfsctl_snapdir_t object from current vnode, following
> + * the lock orders in zfsctl_snapdir_lookup() to avoid deadlocks.
> + * On return the passed in vp is unlocked */
> +static zfsctl_snapdir_t *
> +zfsctl_snapshot_get_snapdir(vnode_t *vp, vnode_t **dvpp)
> +{
> +	gfs_dir_t *dp =3D vp->v_data;
> +	*dvpp =3D dp->gfsd_file.gfs_parent;
> +	zfsctl_snapdir_t *sdp;
> +
> +	VN_HOLD(*dvpp);
> +	VOP_UNLOCK(vp, 0);
> +	vn_lock(*dvpp, LK_SHARED | LK_RETRY | LK_CANRECURSE);
> +	sdp =3D (*dvpp)->v_data;
> +	VOP_UNLOCK(*dvpp, 0);
> +
> +	return (sdp);
> +}
> +
>  #ifdef sun
>  vnodeops_t *zfsctl_ops_root;
>  vnodeops_t *zfsctl_ops_snapdir;
> @@ -1013,6 +1032,8 @@ zfsctl_snapdir_lookup(ap)
>  			 * The snapshot was unmounted behind our backs,
>  			 * try to remount it.
>  			 */
> +			VOP_UNLOCK(*vpp, 0);
> +			VN_HOLD(*vpp);
>  			VERIFY(zfsctl_snapshot_zname(dvp, nm, MAXNAMELEN, snapname) =3D=3D 0);
>  			goto domount;
>  		} else {
> @@ -1064,7 +1085,6 @@ zfsctl_snapdir_lookup(ap)
>  	sep->se_name =3D kmem_alloc(strlen(nm) + 1, KM_SLEEP);
>  	(void) strcpy(sep->se_name, nm);
>  	*vpp =3D sep->se_root =3D zfsctl_snapshot_mknode(dvp, dmu_objset_id(sna=
p));
> -	VN_HOLD(*vpp);
>  	avl_insert(&sdp->sd_snaps, sep, where);
> =20
>  	dmu_objset_rele(snap, FTAG);
> @@ -1075,6 +1095,7 @@ domount:
>  	(void) snprintf(mountpoint, mountpoint_len,
>  	    "%s/" ZFS_CTLDIR_NAME "/snapshot/%s",
>  	    dvp->v_vfsp->mnt_stat.f_mntonname, nm);
> +	VN_HOLD(*vpp);
>  	err =3D mount_snapshot(curthread, vpp, "zfs", mountpoint, snapname, 0);
>  	kmem_free(mountpoint, mountpoint_len);
>  	if (err =3D=3D 0) {
> @@ -1464,16 +1485,18 @@ zfsctl_snapshot_inactive(ap)
>  	int locked;
>  	vnode_t *dvp;
> =20
> -	if (vp->v_count > 0)
> -		goto end;
> -
> -	VERIFY(gfs_dir_lookup(vp, "..", &dvp, cr, 0, NULL, NULL) =3D=3D 0);
> -	sdp =3D dvp->v_data;
> -	VOP_UNLOCK(dvp, 0);
> +	sdp =3D zfsctl_snapshot_get_snapdir(vp, &dvp);
> =20
>  	if (!(locked =3D MUTEX_HELD(&sdp->sd_lock)))
>  		mutex_enter(&sdp->sd_lock);
> =20
> +	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
> +
> +	if (vp->v_count > 0) {
> +		mutex_exit(&sdp->sd_lock);
> +		return (0);
> +	}
> +
>  	ASSERT(!vn_ismntpt(vp));
> =20
>  	sep =3D avl_first(&sdp->sd_snaps);
> @@ -1494,7 +1517,6 @@ zfsctl_snapshot_inactive(ap)
>  		mutex_exit(&sdp->sd_lock);
>  	VN_RELE(dvp);
> =20
> -end:
>  	/*
>  	 * Dispose of the vnode for the snapshot mount point.
>  	 * This is safe to do because once this entry has been removed
> @@ -1595,20 +1617,17 @@ zfsctl_snapshot_lookup(ap)
>  static int
>  zfsctl_snapshot_vptocnp(struct vop_vptocnp_args *ap)
>  {
> -	zfsvfs_t *zfsvfs =3D ap->a_vp->v_vfsp->vfs_data;
> -	vnode_t *dvp, *vp;
> +	vnode_t *dvp, *vp =3D ap->a_vp;
>  	zfsctl_snapdir_t *sdp;
>  	zfs_snapentry_t *sep;
> -	int error;
> +	int error =3D 0;
> =20
> -	ASSERT(zfsvfs->z_ctldir !=3D NULL);
> -	error =3D zfsctl_root_lookup(zfsvfs->z_ctldir, "snapshot", &dvp,
> -	    NULL, 0, NULL, kcred, NULL, NULL, NULL);
> -	if (error !=3D 0)
> -		return (error);
> -	sdp =3D dvp->v_data;
> +	sdp =3D zfsctl_snapshot_get_snapdir(vp, &dvp);
> =20
>  	mutex_enter(&sdp->sd_lock);
> +
> +	vn_lock(vp, LK_SHARED | LK_RETRY);
> +
>  	sep =3D avl_first(&sdp->sd_snaps);
>  	while (sep !=3D NULL) {
>  		vp =3D sep->se_root;
> --=20
> 1.8.4.2
>=20

--=20
Pawel Jakub Dawidek                       http://www.wheelsystems.com
FreeBSD committer                         http://www.FreeBSD.org
Am I Evil? Yes, I Am!                     http://mobter.com

--LKTjZJSUETSlgu2t
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (FreeBSD)

iEYEARECAAYFAlK0SaUACgkQForvXbEpPzSzLACfddI+1gydBrna/vXLdDwR4+DW
M2EAnROvev3FqMsIlPHznalQ1EyeeVXw
=v+Ou
-----END PGP SIGNATURE-----

--LKTjZJSUETSlgu2t--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20131220134405.GE1658>