From owner-freebsd-fs@FreeBSD.ORG Tue Oct 16 15:16:46 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 55860A1C; Tue, 16 Oct 2012 15:16:46 +0000 (UTC) (envelope-from dg17@penx.com) Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2]) by mx1.freebsd.org (Postfix) with ESMTP id F11078FC0A; Tue, 16 Oct 2012 15:16:44 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by btw.pki2.com (8.14.5/8.14.5) with ESMTP id q9GFGbt3041969; Tue, 16 Oct 2012 08:16:37 -0700 (PDT) (envelope-from dg17@penx.com) Subject: Re: I have a DDB session open to a crashed ZFS server From: Dennis Glatting To: John Baldwin In-Reply-To: <201210160844.41042.jhb@freebsd.org> References: <1350317019.71982.50.camel@btw.pki2.com> <201210160844.41042.jhb@freebsd.org> Content-Type: text/plain; charset="us-ascii" Date: Tue, 16 Oct 2012 08:16:37 -0700 Message-ID: <1350400597.72003.32.camel@btw.pki2.com> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit X-yoursite-MailScanner-Information: Dennis Glatting X-yoursite-MailScanner-ID: q9GFGbt3041969 X-yoursite-MailScanner: Found to be clean X-MailScanner-From: dg17@penx.com Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: dg17@penx.com List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Oct 2012 15:16:46 -0000 On Tue, 2012-10-16 at 08:44 -0400, John Baldwin wrote: > On Monday, October 15, 2012 12:03:39 pm Dennis Glatting wrote: > > FreeBSD/amd64 (mc) (ttyu0) > > > > login: NMI ... going to debugger > > [ thread pid 11 tid 100003 ] > > You got an NMI, not a crash. What happens if you just continue ('c' command) > from DDB? > I hit the NMI button because of the "crash," which is a misword, to get into DDB. The problem I am having with ZFS where the file systems go dorment under load within 24 hours. Specifically, the processes are still alive, stuck in disk wait, and there is no disk I/O. I have this problem across four machines for months. The network and console still work but if I enter a command requiring a pull from disk, nothing comes back. > I have heard of machines sending spurious NMIs in the past. If that is what > you are seeing, there is a sysctl to disable dropping into DDB due to an NMI: > > machdep.kdb_on_nmi: 1 > > If you keep getting NMIs, try setting that to 0. > The DDB session is still open but I don't see why this system is stuck. I have been looking at locked processes (two below) but I'm not familiar with the code. Maybe a deadlock? Maybe a missed interrupt? Maybe an unsupported controller? I dunno. 0xfffffe0b989803f0: 0xfffffe0b989803f0: tag zfs, type VDIR tag zfs, type VDIR usecount 0, writecount 0, refcount 2 mountedhere 0 usecount 0, writecount 0, refcount 2 mountedhere 0 flags (VI_DOINGINACT|VI(0x200)) flags (VI_DOINGINACT|VI(0x200)) v_object 0xfffffe0b87af7488 ref 0 pages 0 v_object 0xfffffe0b87af7488 ref 0 pages 0 lock type zfs: EXCL by thread 0xfffffe09b48fe900 (pid 70646) lock type zfs: EXCL by thread 0xfffffe09b48fe900 (pid 70646) db> show thread 0xfffffe09b48fe900 Thread 104609 at 0xfffffe09b48fe900: proc (pid 70646): 0xfffffe0af7e79940 name: find stack: 0xffffffa3329b2000-0xffffffa3329b5fff flags: 0x4 pflags: 0 state: INHIBITED: {SLEEPING} wmesg: tx->tx_quiesce_done_cv) wchan: 0xfffffe0059b60240 priority: 120 container lock: sleepq chain (0xffffffff8126d498) db> sh proc 70646 Process 70646 (find) at 0xfffffe0af7e79940: state: NORMAL uid: 0 gids: 0, 5 parent: pid 70645 at 0xfffffe006e4974a0 ABI: FreeBSD ELF64 arguments: find threads: 1 104609 D tx->tx_q 0xfffffe0059b60240 find db> tr 104609 Tracing pid 70646 tid 104609 td 0xfffffe09b48fe900 sched_switch() at sched_switch+0x28b mi_switch() at mi_switch+0xdf sleepq_wait() at sleepq_wait+0x3a _cv_wait() at _cv_wait+0x164 txg_wait_open() at txg_wait_open+0x85 dmu_tx_assign() at dmu_tx_assign+0x38 zfs_inactive() at zfs_inactive+0x8e zfs_freebsd_inactive() at zfs_freebsd_inactive+0xd VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0x5d vinactive() at vinactive+0xef vputx() at vputx+0x244 sys_fchdir() at sys_fchdir+0x3f0 amd64_syscall() at amd64_syscall+0x334 Xfast_syscall() at Xfast_syscall+0xfb --- syscall (13, FreeBSD ELF64, sys_fchdir), rip = 0x80088396c, rsp = 0x7fffffffd998, rbp = 0x7fffffffda40 --- 0xfffffe005c8855e8: 0xfffffe005c8855e8: tag syncer, type VNON tag syncer, type VNON usecount 1, writecount 0, refcount 2 mountedhere 0 usecount 1, writecount 0, refcount 2 mountedhere 0 flags (VI(0x200)) flags (VI(0x200)) lock type syncer: EXCL by thread 0xfffffe0039147480 (pid 20) lock type syncer: EXCL by thread 0xfffffe0039147480 (pid 20) db> sh thread 0xfffffe0039147480 Thread 100242 at 0xfffffe0039147480: proc (pid 20): 0xfffffe003f4bc940 name: syncer stack: 0xffffffa2fc73b000-0xffffffa2fc73efff flags: 0x4 pflags: 0x240800 state: INHIBITED: {SLEEPING} wmesg: zio->io_cv) wchan: 0xfffffe01187c5320 priority: 116 container lock: sleepq chain (0xffffffff8126e140) db> sh proc 20 Process 20 (syncer) at 0xfffffe003f4bc940: state: NORMAL uid: 0 gids: 0 parent: pid 0 at 0xffffffff812cbdd8 ABI: null threads: 1 100242 D zio->io_ 0xfffffe01187c5320 [syncer] db> tr 100242 Tracing pid 20 tid 100242 td 0xfffffe0039147480 sched_switch() at sched_switch+0x28b mi_switch() at mi_switch+0xdf sleepq_wait() at sleepq_wait+0x3a _cv_wait() at _cv_wait+0x164 zio_wait() at zio_wait+0x5b zil_commit() at zil_commit+0x833 zfs_sync() at zfs_sync+0xaa sync_fsync() at sync_fsync+0x168 VOP_FSYNC_APV() at VOP_FSYNC_APV+0x5d sync_vnode() at sync_vnode+0x1b0 sched_sync() at sched_sync+0x29f fork_exit() at fork_exit+0x9a fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xffffffa2fc73eb30, rbp = 0 ---