From owner-freebsd-stable@FreeBSD.ORG Fri Aug 23 13:43:59 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 238EC8DC for ; Fri, 23 Aug 2013 13:43:59 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id AEA612813 for ; Fri, 23 Aug 2013 13:43:58 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqMEAHtmF1KDaFve/2dsb2JhbABagzxRgxu8Y4E0dIIkAQEBAwEBAQEgKyALBRYYAgINGQIpAQkmBggHBAEIFASHaQYMpEySOoEpjgSBBTQHgmiBLgOVHoN1kC2DOyAygQQ5 X-IronPort-AV: E=Sophos;i="4.89,941,1367985600"; d="scan'208";a="46233941" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 23 Aug 2013 09:43:57 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 74986B4032; Fri, 23 Aug 2013 09:43:57 -0400 (EDT) Date: Fri, 23 Aug 2013 09:43:57 -0400 (EDT) From: Rick Macklem To: J David Message-ID: <158996730.12825374.1377265437471.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: NFS deadlock on 9.2-Beta1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-stable X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 23 Aug 2013 13:43:59 -0000 J. David wrote: > One deadlocked process cropped up overnight, but I managed to panic > the box before getting too much debugging info. :( > > The process was in state T instead of D, which I guess must be a side > effect of some of the debugging code compiled in. > > Here are the details I was able to capture: > > db> show proc 7692 > Process 7692 (httpd) at 0xfffffe0158793000: > state: NORMAL > uid: 25000 gids: 25000 > parent: pid 1 at 0xfffffe00039c3950 > ABI: FreeBSD ELF64 > arguments: /nfsn/apps/tapache22/bin/httpd > threads: 3 > 100674 D newnfs 0xfffffe021cdd9848 httpd > 100597 D pgrbwt 0xfffffe02fda788b8 httpd > 100910 s httpd > > db> show thread 100674 > Thread 100674 at 0xfffffe0108c79480: > proc (pid 7692): 0xfffffe0158793000 > name: httpd > stack: 0xffffff834c80f000-0xffffff834c812fff > flags: 0x2a804 pflags: 0 > state: INHIBITED: {SLEEPING} > wmesg: newnfs wchan: 0xfffffe021cdd9848 > priority: 96 > container lock: sleepq chain (0xffffffff813c03c8) > > db> tr 100674 > Tracing pid 7692 tid 100674 td 0xfffffe0108c79480 > sched_switch() at sched_switch+0x234/frame 0xffffff834c812360 > mi_switch() at mi_switch+0x15c/frame 0xffffff834c8123a0 > sleepq_switch() at sleepq_switch+0x17d/frame 0xffffff834c8123e0 > sleepq_wait() at sleepq_wait+0x43/frame 0xffffff834c812410 > sleeplk() at sleeplk+0x11a/frame 0xffffff834c812460 > __lockmgr_args() at __lockmgr_args+0x9a9/frame 0xffffff834c812580 > nfs_lock1() at nfs_lock1+0x87/frame 0xffffff834c8125b0 > VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xffffff834c8125e0 > _vn_lock() at _vn_lock+0x63/frame 0xffffff834c812640 > ncl_upgrade_vnlock() at ncl_upgrade_vnlock+0x5e/frame > 0xffffff834c812670 > ncl_bioread() at ncl_bioread+0x195/frame 0xffffff834c8127e0 > VOP_READ_APV() at VOP_READ_APV+0xd1/frame 0xffffff834c812810 > vn_rdwr() at vn_rdwr+0x2bc/frame 0xffffff834c8128d0 > kern_sendfile() at kern_sendfile+0xa90/frame 0xffffff834c812ac0 > do_sendfile() at do_sendfile+0x92/frame 0xffffff834c812b20 > amd64_syscall() at amd64_syscall+0x259/frame 0xffffff834c812c30 > Xfast_syscall() at Xfast_syscall+0xfb/frame 0xffffff834c812c30 > --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c, > rsp > = 0x7ffffe9f43c8, rbp = 0x7ffffe9f4700 --- > > db> show lockchain 100674 > thread 100674 (pid 7692, httpd) inhibited > > db> show thread 100597 > Thread 100597 at 0xfffffe021c976000: > proc (pid 7692): 0xfffffe0158793000 > name: httpd > stack: 0xffffff834c80a000-0xffffff834c80dfff > flags: 0x28804 pflags: 0 > state: INHIBITED: {SLEEPING} > wmesg: pgrbwt wchan: 0xfffffe02fda788b8 > priority: 84 > container lock: sleepq chain (0xffffffff813c0148) > > db> tr 100597 > Tracing pid 7692 tid 100597 td 0xfffffe021c976000 > sched_switch() at sched_switch+0x234/frame 0xffffff834c80d750 > mi_switch() at mi_switch+0x15c/frame 0xffffff834c80d790 > sleepq_switch() at sleepq_switch+0x17d/frame 0xffffff834c80d7d0 > sleepq_wait() at sleepq_wait+0x43/frame 0xffffff834c80d800 > _sleep() at _sleep+0x30f/frame 0xffffff834c80d890 > vm_page_grab() at vm_page_grab+0x120/frame 0xffffff834c80d8d0 > kern_sendfile() at kern_sendfile+0x992/frame 0xffffff834c80dac0 > do_sendfile() at do_sendfile+0x92/frame 0xffffff834c80db20 > amd64_syscall() at amd64_syscall+0x259/frame 0xffffff834c80dc30 > Xfast_syscall() at Xfast_syscall+0xfb/frame 0xffffff834c80dc30 > --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c, > rsp > = 0x7ffffebf53c8, rbp = 0x7ffffebf5700 --- > > db> show lockchain 100597 > thread 100597 (pid 7692, httpd) inhibited > > The "inhibited" is not something I'm familiar with and didn't match > the example output; I thought that maybe the T state was overpowering > the locks, and that maybe I should continue the system and then -CONT > the process. However, a few seconds after I issued "c" at the DDB > prompt, the system panicked in the console driver ("mtx_lock_spin: > recursed on non-recursive mutex cnputs_mtx @ > /usr/src/sys/kern/kern_cons.c:500"), so I guess that's not a thing to > do. :( > > Sorry my stupidity and ignorance is dragging this out. :( This is > all > well outside my comfort zone, but next time I'll get it for sure. > No problem. Thanks for trying to capture this stuff. Unfortunately, what you have above doesn't tell me anything more about the problem. The main question to me is "Why is the thread stuck in "pgrbwt" permanently?". To figure this out, we need the info on all threads on the system. In particular, the status (the output of "ps axHl" would be a start, before going into the debugger) of the "nfsiod" threads might point to the cause, although it may involve other threads as well. If you are running a serial console, just start "script" and then type the commands "ps axHl" followed by going into the debugger and doing the commands here: (basically everything with "all"): http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html (I don't know why your console crashed. Hopefully you can get this done before the crash the next time.) If you do get it all, please include kib@freebsd.org in the cc list as well as me. Thanks for trying to do this, rick > Thanks! > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to > "freebsd-stable-unsubscribe@freebsd.org" >