From owner-freebsd-fs@FreeBSD.ORG Mon Nov 24 21:13:33 2014 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id ACA2ACE7; Mon, 24 Nov 2014 21:13:33 +0000 (UTC) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 9E9022C1; Mon, 24 Nov 2014 21:13:32 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id XAA08522; Mon, 24 Nov 2014 23:15:23 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1Xt0ws-000CkT-4S; Mon, 24 Nov 2014 23:13:30 +0200 Message-ID: <54739F41.8030407@FreeBSD.org> Date: Mon, 24 Nov 2014 23:12:33 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: sbruno@FreeBSD.org, freebsd-current@FreeBSD.org, freebsd-fs Subject: zfs locking vs vnode locking [Was: zfs/vfs lockups, via poudriere] References: <1416684021.7423.77.camel@bruno> <547109A2.9010506@FreeBSD.org> <1416761846.1186.0.camel@bruno> In-Reply-To: <1416761846.1186.0.camel@bruno> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Nov 2014 21:13:33 -0000 On 23/11/2014 18:57, Sean Bruno wrote: > 31071 100995 rm - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x18d __lockmgr_args+0x9ab vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 zfs_lookup+0x45d zfs_freebsd_lookup+0x6d VOP_CACHEDLOOKUP_APV+0xa1 vfs_cache_lookup+0xd6 VOP_LOOKUP_APV+0xa1 lookup+0x5a1 namei+0x534 kern_rmdirat+0x8d amd64_syscall+0x3fb Xfast_syscall+0xfb > 31075 100693 mv - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x18d __lockmgr_args+0xd5d vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 vputx+0x28a zfs_rename_unlock+0x3e zfs_freebsd_rename+0xe39 VOP_RENAME_APV+0xab kern_renameat+0x4a6 amd64_syscall+0x3fb Xfast_syscall+0xfb Just the stack traces are not sufficient to analyze the problem without examining the relevant vnodes and vnode locks. But I believe that I have seen reports about this kind of problem before. And I think that I understand what's going on. And, as a part of my job, I tried to develop a fix [*] for this problem and had some positive feedback for it. But the fix is not just a few lines of changes. It's a lot of modifications to a lot of files. Besides, my changes alter quite a bit more code than a bare minimum required to fix the problem, which still would be quite a bit of changes. So, right now I would like to describe the problem as I understand it. Some general information about the FreeBSD VFS and its difference from Solaris VFS [**] can be useful, but is not really required. I'll try to explain by example. If we look at any mature and "native" FreeBSD filesystem with read-write support - ffs, maybe tmpfs - then we can make the following observations. In most of the vnode operation implementations there are no calls to vnode locking functions. E.g. for an operation like vop_remove two vnodes in question are already locked at the VFS layer. In some cases VOPs do locking, but it is very trivial e.g. in vop_create a newly created vnode must be returned locked. Naturally, if we look at the VFS code we see a lot of vnode locking for various purposes. Like locking the vnodes for vop_remove call. Or locking vnodes during their life cycle management, so that, for example, a vnode is not destroyed while there is an ongoing operation on it. Also, we can see locking in VFS namei / lookup implementation where we need to hold onto a directory vnode while looking up a child vnode by name. But there are two vnode operation implementations where we can see a non-trivial vnode locking "dance". Those are vop_rename and vop_lookup. Anyone is welcome to take a cursory look at the first hundred or so lines of ufs_rename(). The point of the above observations is that both VFS and a filesystem driver do vnode locking. And, thus, both VFS and the driver must cooperate by using the same locking protocol. Now, if we look at the ZFS ZPL code and most prominently at zfs_rename() we see that there is quite a bit of locking going on there, e.g. see zfs_rename_lock, but the locks in question are all ZFS internal locks. We do not see the vnode locks. From this comes a suspicion, or even a conclusion, that ZFS currently does not use the same vnode locking protocol that is expected from any filesystem driver. There is a weird form of redundancy between the fine grained ZFS locks that got ported over and the FreeBSD vnode locks. In some cases the ZFS locks are always uncontested because the vnode locks held at the VFS level ensure a serialized access. In other cases there is no protection at all, because one thread is in VFS code which uses the vnode locks and another thread is in ZFS code which uses the ZFS locks and thus there is no real synchronization between those threads. My solution to this problem was to completely eliminate (at least) the following ZFS locks kmutex_t z_lock; /* znode modification lock */ krwlock_t z_parent_lock; /* parent lock for directories */ krwlock_t z_name_lock; /* "master" lock for dirent locks */ zfs_dirlock_t *z_dirlocks; /* directory entry lock list */ and to ensure that the proper vnode locking protocol is followed. That required substantial changes to the rename and lookup code. Finally, this is not really a suggestion to test or discuss my changes, but rather a call to discuss the problem and other possible ways to fix it. I do not preclude any options including making changes to our VFS (and thus ti all the filesystems) :-) [*] https://github.com/avg-I/freebsd/compare/wip/hc/zfs-fbsd-vfs [**] https://clusterhq.com/blog/complexity-freebsd-vfs-using-zfs-example-part-1-2/ https://clusterhq.com/blog/complexity-freebsd-vfs-using-zfs-example-part-2/ -- Andriy Gapon