From owner-freebsd-bugs@FreeBSD.ORG Thu Mar 9 07:20:04 2006 Return-Path: X-Original-To: freebsd-bugs@hub.freebsd.org Delivered-To: freebsd-bugs@hub.freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E82B616A420 for ; Thu, 9 Mar 2006 07:20:04 +0000 (GMT) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [216.136.204.21]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3CA0143D49 for ; Thu, 9 Mar 2006 07:20:04 +0000 (GMT) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1]) by freefall.freebsd.org (8.13.4/8.13.4) with ESMTP id k297K4Dr001856 for ; Thu, 9 Mar 2006 07:20:04 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.13.4/8.13.4/Submit) id k297K4PX001855; Thu, 9 Mar 2006 07:20:04 GMT (envelope-from gnats) Resent-Date: Thu, 9 Mar 2006 07:20:04 GMT Resent-Message-Id: <200603090720.k297K4PX001855@freefall.freebsd.org> Resent-From: FreeBSD-gnats-submit@FreeBSD.org (GNATS Filer) Resent-To: freebsd-bugs@FreeBSD.org Resent-Reply-To: FreeBSD-gnats-submit@FreeBSD.org, Steve Watt Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8F90C16A420 for ; Thu, 9 Mar 2006 07:15:44 +0000 (GMT) (envelope-from steve@Watt.COM) Received: from wattres.watt.com (wattres.watt.com [66.93.133.130]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8460C43D45 for ; Thu, 9 Mar 2006 07:15:43 +0000 (GMT) (envelope-from steve@Watt.COM) Received: from wattres.watt.com (localhost.watt.com [127.0.0.1]) by wattres.watt.com (8.13.4/8.13.4) with ESMTP id k297FhCC035880 for ; Wed, 8 Mar 2006 23:15:43 -0800 (PST) (envelope-from steve@wattres.watt.com) Received: (from steve@localhost) by wattres.watt.com (8.13.4/8.13.4/Submit) id k297Fhff035879; Wed, 8 Mar 2006 23:15:43 -0800 (PST) (envelope-from steve) Message-Id: <200603090715.k297Fhff035879@wattres.watt.com> Date: Wed, 8 Mar 2006 23:15:43 -0800 (PST) From: Steve Watt To: FreeBSD-gnats-submit@FreeBSD.org X-Send-Pr-Version: 3.113 Cc: Subject: kern/94261: [ufs2] deadlock between ffs_mksnap and mysnc X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Steve Watt List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Mar 2006 07:20:05 -0000 >Number: 94261 >Category: kern >Synopsis: [ufs2] deadlock between ffs_mksnap and mysnc >Confidential: no >Severity: critical >Priority: high >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Thu Mar 09 07:20:03 GMT 2006 >Closed-Date: >Last-Modified: >Originator: Steve Watt >Release: FreeBSD 5.5-PRERELEASE i386 >Organization: Watt Consultants >Environment: System: FreeBSD wattres.Watt.COM 5.5-PRERELEASE FreeBSD 5.5-PRERELEASE #11: Wed Mar 1 10:54:09 PST 2006 root@wattres.Watt.COM:/usr/obj/usr/src/sys/WATTRES i386 >Description: Per a thread on -hackers back in May '05 titled "snapshots and innds", I'm finally getting around (sorry) to filing the PR. I got a kernel core dump that I can do useful stuff with. The system was mostly operating normally, except that any attempt to access the /news partition (which has articles, tradspool.map, overviews, and incoming/outgoing data) would get stuck in "suspfs". So I forced a dump from ddb. The mount point does (as one would expect) have MNTK_SUSPEND set. I see mksnap_ffs sitting waiting for "ufs" (really vnode 0xc19af318), which it got to via: (kgdb) info stack #0 sched_switch (td=0xc1ede780, newtd=0xc146f480, flags=1) at /usr/src/sys/kern/sched_4bsd.c:882 #1 0xc0662ad0 in mi_switch (flags=1, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:355 #2 0xc067a9e4 in sleepq_switch (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:406 #3 0xc067ab9e in sleepq_wait (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:518 #4 0xc06627b6 in msleep (ident=0xc19af3c4, mtx=0xc095e4cc, priority=80, wmesg=0xc08a3f13 "ufs", timo=0) at /usr/src/sys/kern/kern_synch.c:228 #5 0xc06505d6 in acquire (lkpp=0xd02df680, extflags=16777280, wanted=1536) at /usr/src/sys/kern/kern_lock.c:161 #6 0xc0650a14 in lockmgr (lkp=0xc19af3c4, flags=16842754, interlkp=0x0, td=0xc1ede780) at /usr/src/sys/kern/kern_lock.c:389 #7 0xc07bd6e3 in ufs_lock (ap=0xd02df6bc) at /usr/src/sys/ufs/ufs/ufs_vnops.c:2007 #8 0xc07be380 in ufs_vnoperate (ap=0x0) at /usr/src/sys/ufs/ufs/ufs_vnops.c:2828 #9 0xc06c0501 in vn_lock (vp=0xc19af318, flags=65538, td=0xc1ede780) at vnode_if.h:1013 #10 0xc06b4195 in vget (vp=0xc19af318, flags=65538, td=0x0) at /usr/src/sys/kern/vfs_subr.c:2028 #11 0xc07af408 in ffs_sync (mp=0xc15e5c00, waitfor=1, cred=0xc2953080, td=0xc1ede780) at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1151 #12 0xc06c0840 in vfs_write_suspend (mp=0xc15e5c00) at /usr/src/sys/kern/vfs_vnops.c:1084 #13 0xc079db18 in ffs_snapshot (mp=0xc15e5c00, snapfile=0xbfbfef1b
) at /usr/src/sys/ufs/ffs/ffs_snapshot.c:317 #14 0xc07ad5d8 in ffs_omount (mp=0xc15e5c00, path=0xc2a8c380 "/news", data=0x0, td=0xc1ede780) at /usr/src/sys/ufs/ffs/ffs_vfsops.c:313 #15 0xc06af787 in vfs_domount (td=0xc1ede780, fstype=0xc1eea730 "ffs", fspath=0xc2a8c380 "/news", fsflags=18944000, fsdata=0xbfbfe7d4, compat=1) at /usr/src/sys/kern/vfs_mount.c:861 #16 0xc06aef16 in mount (td=0x0, uap=0xd02dfd04) at /usr/src/sys/kern/vfs_mount.c:620 #17 0xc0828553 in syscall (frame= [ snip ] And inn is sitting waiting for the suspended filesystem: (kgdb) info stack #0 sched_switch (td=0xc1c16c00, newtd=0xc1ede780, flags=1) at /usr/src/sys/kern/sched_4bsd.c:882 #1 0xc0662ad0 in mi_switch (flags=1, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:355 #2 0xc067a9e4 in sleepq_switch (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:406 #3 0xc067ab9e in sleepq_wait (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:518 #4 0xc06627b6 in msleep (ident=0xc15e5c6c, mtx=0x0, priority=159, wmesg=0xc08a5c44 "suspfs", timo=0) at /usr/src/sys/kern/kern_synch.c:228 #5 0xc06c06b8 in vn_start_write (vp=0x0, mpp=0xc15e5c00, flags=1) at /usr/src/sys/kern/vfs_vnops.c:995 #6 0xc07daeb4 in vnode_pager_putpages (object=0xc1990b58, m=0x0, count=4096, sync=12, rtvals=0x0) at /usr/src/sys/vm/vnode_pager.c:969 #7 0xc07d2f42 in vm_pageout_flush (mc=0xd022bac0, count=1, flags=12) at vm_pager.h:147 #8 0xc07cd84b in vm_object_page_collect_flush (object=0xc1990b58, p=0xc10ddc40, curgeneration=1, pagerflags=12) at /usr/src/sys/vm/vm_object.c:958 #9 0xc07cd2a3 in vm_object_page_clean (object=0xc1990b58, start=628, end=Unhandled dwarf expression opcode 0x93 ) at /usr/src/sys/vm/vm_object.c:761 #10 0xc07cda77 in vm_object_sync (object=0xc1990b58, offset=2572288, size=4096, syncio=0, invalidate=0) at /usr/src/sys/vm/vm_object.c:1026 #11 0xc07c9267 in vm_map_sync (map=0xc147a384, start=676794368, end=676798464, syncio=0, invalidate=0) at /usr/src/sys/vm/vm_map.c:2149 #12 0xc07cb4e1 in msync (td=0x0, uap=0x0) at /usr/src/sys/vm/vm_mmap.c:465 #13 0xc0828553 in syscall (frame= [ snip ] (Interesting kgdb buglet on line 9 of that trace, by the way.) Based on a bit more snooping, it's a basic deadlock. The sync() is trying to lock the vnode at 0xc19af318. That vnode is already locked by innd. innd is waiting for the suspended filesystem, so won't free up the vnode. [chasing down owner of the vnode ] (kgdb) print vp $7 = (struct vnode *) 0xc19af318 (kgdb) print *vp $8 = {v_interlock = {mtx_object = {lo_class = 0xc08f945c, lo_name = 0xc08a5792 "vnode interlock", lo_type = 0xc08a5792 "vnode interlock", lo_flags = 196608, lo_list = {tqe_next = 0xc19af210, tqe_prev = 0xc19af430}, lo_witness = 0x0}, mtx_lock = 4, mtx_recurse = 0}, v_iflag = 1536, v_usecount = 3, v_numoutput = 0, v_vxthread = 0x0, v_holdcnt = 4, v_cleanblkhd = {tqh_first = 0x0, tqh_last = 0xc19af350}, v_cleanblkroot = 0x0, v_cleanbufcnt = 0, v_dirtyblkhd = { tqh_first = 0xc5b36570, tqh_last = 0xc5bb2634}, v_dirtyblkroot = 0xc5b36570, v_dirtybufcnt = 4, v_vflag = 8, v_writecount = 1, v_object = 0xc1990b58, v_lastw = 152, v_cstart = 152, v_lasta = 2275200, v_clen = 7, v_un = { vu_mountedhere = 0x0, vu_socket = 0x0, vu_spec = {vu_cdev = 0x0, vu_specnext = { sle_next = 0x0}}, vu_fifoinfo = 0x0}, v_freelist = {tqe_next = 0x0, tqe_prev = 0x0}, v_nmntvnodes = {tqe_next = 0xc1a67420, tqe_prev = 0xc19af4b0}, v_synclist = {le_next = 0xc1901528, le_prev = 0xc19fd6c8}, v_type = VREG, v_tag = 0xc08a3f13 "ufs", v_data = 0xc199108c, v_lock = {lk_interlock = 0xc095e4cc, lk_flags = 18875456, lk_sharecount = 0, lk_waitcount = 1, lk_exclusivecount = 1, lk_prio = 80, lk_wmesg = 0xc08a3f13 "ufs", lk_timo = 6, lk_lockholder = 0xc1c16c00, lk_newlock = 0x0}, v_vnlock = 0xc19af3c4, v_op = 0xc160c500, v_mount = 0xc15e5c00, v_cache_src = {lh_first = 0x0}, v_cache_dst = {tqh_first = 0xc1986e9c, tqh_last = 0xc1986eac}, v_id = 1413, v_dd = 0xc19af318, v_ddid = 0, v_pollinfo = 0x0, v_label = 0x0, v_cachedfs = 1041, v_cachedid = 141320, v_bsize = 16384} (kgdb) print vp->v_lock.lk_lockholder $9 = (struct thread *) 0xc1c16c00 (kgdb) print *vp->v_lock.lk_lockholder $10 = {td_proc = 0xc1c4e1c4, td_ksegrp = 0xc1a1d5b0, td_plist = {tqe_next = 0x0, tqe_prev = 0xc1c4e1d4}, td_kglist = {tqe_next = 0x0, tqe_prev = 0xc1a1d5bc}, td_slpq = {tqe_next = 0xc2b55c00, tqe_prev = 0xc18b1b00}, td_lockq = {tqe_next = 0x0, tqe_prev = 0x0}, td_runq = {tqe_next = 0x0, tqe_prev = 0xc08fb5ac}, td_selq = { tqh_first = 0x0, tqh_last = 0xc1c16c30}, td_sleepqueue = 0x0, td_turnstile = 0xc1824a80, td_tid = 100188, td_flags = 0, td_inhibitors = 2, td_pflags = 0, td_dupfd = 0, td_wchan = 0xc15e5c6c, td_wmesg = 0xc08a5c44 "suspfs", td_lastcpu = 0 '\0', td_oncpu = 255 'ÿ', td_locks = 5360, td_blocked = 0x0, td_ithd = 0x0, td_lockname = 0x0, td_contested = {lh_first = 0x0}, td_sleeplocks = 0x0, td_intr_nesting_level = 0, td_pinned = 0, td_mailbox = 0x0, td_ucred = 0xc241cb00, td_standin = 0x0, td_prticks = 0, td_upcall = 0x0, td_sticks = 42112, td_uuticks = 0, td_usticks = 0, td_intrval = 0, td_oldsigmask = { __bits = {0, 0, 0, 0}}, td_sigmask = {__bits = {0, 0, 0, 0}}, td_siglist = { __bits = {0, 0, 0, 0}}, td_umtx = {tqe_next = 0x0, tqe_prev = 0x0}, td_generation = 2863987, td_sigstk = {ss_sp = 0x0, ss_size = 0, ss_flags = 4}, td_kflags = 0, td_xsig = 0, td_profil_addr = 0, td_profil_ticks = 0, td_base_pri = 159 '\237', td_priority = 159 '\237', td_pcb = 0xd022bd90, td_state = TDS_INHIBITED, td_retval = {0, 1}, td_slpcallout = {c_links = {sle = { sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0xc5ad0718}}, c_time = 112874447, c_arg = 0xc1c16c00, c_func = 0, c_flags = 8}, td_frame = 0xd022bd38, td_kstack_obj = 0xc1b0d108, td_kstack = 3491930112, td_kstack_pages = 2, td_altkstack_obj = 0x0, td_altkstack = 0, td_altkstack_pages = 0, td_critnest = 1, td_md = {md_savecrit = 582}, td_sched = 0xc1c16d50} (kgdb) print *vp->v_lock.lk_lockholder->td_proc $11 = {p_list = {le_next = 0xc1c1354c, le_prev = 0xc195e8d4}, p_ksegrps = { tqh_first = 0xc1a1d5b0, tqh_last = 0xc1a1d5b4}, p_threads = {tqh_first = 0xc1c16c00, tqh_last = 0xc1c16c08}, p_suspended = {tqh_first = 0x0, tqh_last = 0xc1c4e1dc}, p_ucred = 0xc241cb00, p_fd = 0xc1fe3300, p_fdtol = 0x0, p_stats = 0xc1fd9500, p_limit = 0xc1fd7700, p_unused1 = 0x0, p_sigacts = 0xc1a85000, p_flag = 1, p_sflag = 1, p_state = PRS_NORMAL, p_pid = 97648, p_hash = {le_next = 0x0, le_prev = 0xc2cd3ca8}, p_pglist = {le_next = 0xc1c4e54c, le_prev = 0xc2a00c48}, p_pptr = 0xc146ee20, p_sibling = {le_next = 0xc1c1354c, le_prev = 0xc195e934}, p_children = {lh_first = 0xc1c4e54c}, p_mtx = {mtx_object = {lo_class = 0xc08f945c, lo_name = 0xc089e844 "process lock", lo_type = 0xc089e844 "process lock", lo_flags = 4390912, lo_list = {tqe_next = 0xc1c4e3f4, tqe_prev = 0xc1c4e07c}, lo_witness = 0x0}, mtx_lock = 4, mtx_recurse = 0}, p_oppid = 0, p_vmspace = 0xc147a384, p_swtime = 10816, p_realtimer = {it_interval = {tv_sec = 0, tv_usec = 0}, it_value = {tv_sec = 0, tv_usec = 0}}, p_runtime = {sec = 457, frac = 14633034536159281822}, p_uu = 128520517, p_su = 329272740, p_iu = 1, p_uticks = 16437, p_sticks = 42112, p_iticks = 0, p_profthreads = 0, p_maxthrwaits = 0, p_traceflag = 0, p_tracevp = 0x0, p_tracecred = 0x0, p_textvp = 0xc19aa210, p_siglist = {__bits = {0, 0, 0, 0}}, p_lock = 0 '\0', p_sigiolst = {slh_first = 0x0}, p_sigparent = 20, p_sig = 0, p_code = 0, p_stops = 0, p_stype = 0, p_step = 0 '\0', p_pfsflags = 0 '\0', p_nlminfo = 0x0, p_aioinfo = 0x0, p_singlethread = 0x0, p_suspcount = 0, p_xthread = 0x0, p_boundary_count = 0, p_magic = 3203398350, p_comm = "innd\000tart\000ic\000\000\000\000\000\000\000", p_pgrp = 0xc2a00c40, p_sysent = 0xc093a320, p_args = 0xc2a09380, p_cpulimit = 9223372036854775807, p_nice = 0 '\0', p_xstat = 0, p_klist = { kl_lock = 0xc1c4e230, kl_list = {slh_first = 0x0}}, p_numthreads = 1, p_numksegrps = 1, p_md = {md_ldt = 0xc1c569a0}, p_itcallout = {c_links = {sle = { sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, c_arg = 0x0, c_func = 0, c_flags = 8}, p_unused2 = 0x0, p_acflag = 1, p_ru = 0x0, p_peers = 0x0, p_leader = 0xc1c4e1c4, p_emuldata = 0x0, p_label = 0x0, p_sched = 0xc1c4e388} OK, so why is inn holding a locked vnode while it's sleeping for the suspend? That's pretty clear -- there's a vn_lock(vp) which happens to be the vnode in question (0xc19af318) in vm_object_sync(), line 1022. Now for the really hard question: Fixes. The obvious (but not so elegant) fix is for the snapshot creation code to run under Giant, which will hold vm_object_sync out at the correct place. But that's *adding* Giant, which seems like a bad idea. Also, it seems like this bug should be easily reproduced by having a process do a lot of msync() while another creates snapshots. I don't have a box that's scratch enough to try running that on, though. >How-To-Repeat: This is somewhat difficult to repeat. Running innd on a system that gets backed up daily causes the deadlock roughly once evey few weeks. >Fix: Kirk wrote: Excellent detective work on your part. The invarient that is being broken here is that you are never supposed to hold a vnode locked when you call vn_start_write. The call to vn_start_write should be done in vm_object_sync before acquiring the vnode lock rather than later in vnode_pager_putpages. Of course, moving the vn_start_write out of vnode_pager_putpages means that we have to track down every other caller of vnode_pager_putpages to make sure that they have also done the vn_start_write call as well. Jeff Robertson has come up with a much cleaner way of dealing with the suspension code that I believe he is using in the -current tree. It puts a hook in the ufs_lock code that tracks the number of locks held in each filesystem. To do a suspend, it blocks all new lock requests on that filesystem by any thread that does not already hold a lock and waits for all the existing locks to be released. This obviates the need for the vn_start_write calls sprinkled all through the system. I have copied Jeff on this email so that he can comment further on this issue as he is much more up to speed on it at the moment than I am. Then I wrote: I'm trying to decide which is a simpler hack-around -- do the Giant thing, or pull the lock up a level and grep. Opinions? To which Kirk replied: I would definite advocate the pull the lock up a level and grep option as that is the "correct" solution. If for whatever reason Jeff decides to stick with the vn_start_write approach, it will be a solved problem rather than one that needs fixing. Plus we really don't want to regress on Giant if we can help it. Naturally, I haven't found the time to get my pull-up working, a bad combination of job changes and the minor detail that the vnode code is a bit ... shall we say ... complex? >Release-Note: >Audit-Trail: >Unformatted: