Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 17 Jul 2013 14:03:25 +0300
From:      Andriy Gapon <avg@FreeBSD.org>
To:        zfs-devel@FreeBSD.org, freebsd-fs@FreeBSD.org
Subject:   zfs_rename: another zfs+vfs deadlock
Message-ID:  <51E679FD.3040306@FreeBSD.org>

next in thread | raw e-mail | index | archive | help

I received a report about what looked like a deadlock involving ZFS.
Interesting bits from the report are:

Thread 1156 (Thread 2038380):
#0  sched_switch (td=0xfffffe01a9e56460, newtd=0xfffffe001c3ff000,
flags=Variable "flags" is not available.
) at /usr/src/sys/kern/sched_ule.c:1860
#1  0xffffffff808ab51a in mi_switch (flags=260, newtd=0x0) at
/usr/src/sys/kern/kern_synch.c:466
#2  0xffffffff808e3e63 in sleepq_switch (wchan=0xfffffe06d9df2310, pri=96) at
/usr/src/sys/kern/subr_sleepqueue.c:538
#3  0xffffffff808e4a9d in sleepq_wait (wchan=0xfffffe06d9df2310, pri=96) at
/usr/src/sys/kern/subr_sleepqueue.c:617
#4  0xffffffff8088aebb in __lockmgr_args (lk=0xfffffe06d9df2310, flags=524544,
ilk=0xfffffe06d9df23d8, wmesg=Variable "wmesg" is not available.
) at /usr/src/sys/kern/kern_lock.c:214
#5  0xffffffff8092d349 in vop_stdlock (ap=Variable "ap" is not available.
) at lockmgr.h:97
#6  0xffffffff80bd62ab in VOP_LOCK1_APV (vop=0xffffffff8111cc80,
a=0xffffff90729ee6e0) at vnode_if.c:1988
#7  0xffffffff8094cfa7 in _vn_lock (vp=0xfffffe06d9df2278, flags=524288,
file=Variable "file" is not available.
) at vnode_if.h:859
#8  0xffffffff80942220 in vputx (vp=0xfffffe06d9df2278, func=1) at
/usr/src/sys/kern/vfs_subr.c:2279
#9  0xffffffff816e75a4 in zfs_rename_unlock (zlpp=0xffffff90729ee878) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:3501
#10 0xffffffff816e8df4 in zfs_freebsd_rename (ap=Variable "ap" is not available.
) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:3927
#11 0xffffffff80bd67cb in VOP_RENAME_APV (vop=0xffffffff8175b900,
a=0xffffff90729eeaa0) at vnode_if.c:1474
#12 0xffffffff80947844 in kern_renameat (td=Variable "td" is not available.
) at vnode_if.h:636
#13 0xffffffff80b4eff2 in amd64_syscall (td=0xfffffe01a9e56460, traced=0) at
subr_syscall.c:135
#14 0xffffffff80b39b97 in Xfast_syscall () at
/usr/src/sys/amd64/amd64/exception.S:387

fr 11
p *a
$1 = {a_gen = {a_desc = 0xffffffff811594c0}, a_fdvp = 0xfffffe05fb094278, a_fvp
= 0xfffffe04b7b62278, a_fcnp = 0xffffff90729eea58, a_tdvp = 0xfffffe0514137768,
a_tvp = 0x0, a_tcnp = 0xffffff90729ee9a8}

Thread 1158 (Thread 4174978):
#0  sched_switch (td=0xfffffe088cbef000, newtd=0xfffffe001c40e000,
flags=Variable "flags" is not available.
) at /usr/src/sys/kern/sched_ule.c:1860
#1  0xffffffff808ab51a in mi_switch (flags=260, newtd=0x0) at
/usr/src/sys/kern/kern_synch.c:466
#2  0xffffffff808e3e63 in sleepq_switch (wchan=0xfffffe0514137800, pri=96) at
/usr/src/sys/kern/subr_sleepqueue.c:538
#3  0xffffffff808e4a9d in sleepq_wait (wchan=0xfffffe0514137800, pri=96) at
/usr/src/sys/kern/subr_sleepqueue.c:617
#4  0xffffffff8088b4e0 in __lockmgr_args (lk=0xfffffe0514137800, flags=2097152,
ilk=0xfffffe05141378c8, wmesg=Variable "wmesg" is not available.
) at /usr/src/sys/kern/kern_lock.c:214
#5  0xffffffff8092d349 in vop_stdlock (ap=Variable "ap" is not available.
) at lockmgr.h:97
#6  0xffffffff80bd62ab in VOP_LOCK1_APV (vop=0xffffffff8111cc80,
a=0xffffff9072813470) at vnode_if.c:1988
#7  0xffffffff8094cfa7 in _vn_lock (vp=0xfffffe0514137768, flags=2097152,
file=Variable "file" is not available.
) at vnode_if.h:859
#8  0xffffffff816e5bdd in zfs_vnode_lock (vp=0xfffffe0514137768, flags=Variable
"flags" is not available.
) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1704
#9  0xffffffff816e6d70 in zfs_lookup (dvp=0xfffffe06d9df2278,
nm=0xffffff90728135b0 "toBeDeleted", vpp=0xffffff9072813930,
cnp=0xffffff9072813958, nameiop=0, cr=0xfffffe0ba89b0a00, td=0xfffffe088cbef000,
flags=0)
    at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:1433
#10 0xffffffff816e7511 in zfs_freebsd_lookup (ap=0xffffff9072813710) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:5758
#11 0xffffffff80bd5d3f in VOP_CACHEDLOOKUP_APV (vop=0xffffffff8175b900,
a=0xffffff9072813710) at vnode_if.c:187
#12 0xffffffff8092b103 in vfs_cache_lookup (ap=Variable "ap" is not available.
) at vnode_if.h:80
#13 0xffffffff80bd7187 in VOP_LOOKUP_APV (vop=0xffffffff8175b900,
a=0xffffff90728137d0) at vnode_if.c:123
#14 0xffffffff8093260a in lookup (ndp=0xffffff90728138f0) at vnode_if.h:54
#15 0xffffffff8093354e in namei (ndp=0xffffff90728138f0) at
/usr/src/sys/kern/vfs_lookup.c:297
#16 0xffffffff80944213 in kern_statat_vnhook (td=0xfffffe088cbef000,
flag=Variable "flag" is not available.
) at /usr/src/sys/kern/vfs_syscalls.c:2432
#17 0xffffffff809443b5 in kern_statat (td=Variable "td" is not available.
) at /usr/src/sys/kern/vfs_syscalls.c:2413
#18 0xffffffff8094455a in sys_stat (td=Variable "td" is not available.
) at /usr/src/sys/kern/vfs_syscalls.c:2374
#19 0xffffffff80b4eff2 in amd64_syscall (td=0xfffffe088cbef000, traced=0) at
subr_syscall.c:135
#20 0xffffffff80b39b97 in Xfast_syscall () at
/usr/src/sys/amd64/amd64/exception.S:387

As far as I understand the code, I think that zfs_rename_lock (called from
zfs_rename) iterates up ancestor chain of target directory (tdzp) and obtains a
reference on each of the ancestors via zfs_zget.
zfs_rename_unlock does the opposite - it iterates in the reverse order and
VN_RELE-s the ancestor znodes.
As you can see above, on FreeBSD VN_RELE translates to vputx, which internally
needs to obtain a vnode lock.

The problem seems to be is that VOP_RENAME -> zfs_freebsd_rename is called with
locked tdvp (and perhaps non-NULL and thus locked tvp).  tdvp's vnode lock is
released at the very end of zfs_freebsd_rename and so it is held over
zfs_rename_unlock.
And that means that vnode locks of tvp's ancestors can be acquired while tdvp's
vnode lock is held.
That violates the VFS lock ordering where a descendant's lock must always be
acquired after an ancestor's lock.  So that could lead to a deadlock with
another VFS operation that acquires locks in the proper order.

In the above snippet 0xfffffe06d9df2278 is a directory/ancestor of tdvp and
0xfffffe0514137768 is tdvp.  VOP_LOOKUP -> zfs_lookup acquires the locks in the
correct order (dvp is the ancestor while vp is the tdvp) while zfs_rename does
it in the opposite order.

A scenario to reproduce this bug could be like this.
mkdir a
mkdir a/b
mv some-file a/b/ (in parallel with) stat a/b
Of course it would have to be repeated many times to hit the right timing
window.  Also, namecache could interfere with this scenario, but I am not sure.

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51E679FD.3040306>