From owner-freebsd-fs@FreeBSD.ORG  Tue Oct 16 15:16:46 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 55860A1C;
 Tue, 16 Oct 2012 15:16:46 +0000 (UTC) (envelope-from dg17@penx.com)
Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2])
 by mx1.freebsd.org (Postfix) with ESMTP id F11078FC0A;
 Tue, 16 Oct 2012 15:16:44 +0000 (UTC)
Received: from [127.0.0.1] (localhost [127.0.0.1])
 by btw.pki2.com (8.14.5/8.14.5) with ESMTP id q9GFGbt3041969;
 Tue, 16 Oct 2012 08:16:37 -0700 (PDT) (envelope-from dg17@penx.com)
Subject: Re: I have a DDB session open to a crashed ZFS server
From: Dennis Glatting <dg17@penx.com>
To: John Baldwin <jhb@freebsd.org>
In-Reply-To: <201210160844.41042.jhb@freebsd.org>
References: <1350317019.71982.50.camel@btw.pki2.com>
 <201210160844.41042.jhb@freebsd.org>
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 16 Oct 2012 08:16:37 -0700
Message-ID: <1350400597.72003.32.camel@btw.pki2.com>
Mime-Version: 1.0
X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port 
Content-Transfer-Encoding: 7bit
X-yoursite-MailScanner-Information: Dennis Glatting
X-yoursite-MailScanner-ID: q9GFGbt3041969
X-yoursite-MailScanner: Found to be clean
X-MailScanner-From: dg17@penx.com
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: dg17@penx.com
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Oct 2012 15:16:46 -0000

On Tue, 2012-10-16 at 08:44 -0400, John Baldwin wrote:
> On Monday, October 15, 2012 12:03:39 pm Dennis Glatting wrote:
> > FreeBSD/amd64 (mc) (ttyu0)
> > 
> > login: NMI ... going to debugger
> > [ thread pid 11 tid 100003 ]
> 
> You got an NMI, not a crash.  What happens if you just continue ('c' command) 
> from DDB?
> 

I hit the NMI button because of the "crash," which is a misword, to get
into DDB. 

The problem I am having with ZFS where the file systems go dorment under
load within 24 hours. Specifically, the processes are still alive, stuck
in disk wait, and there is no disk I/O. I have this problem across four
machines for months.

The network and console still work but if I enter a command requiring a
pull from disk, nothing comes back.


> I have heard of machines sending spurious NMIs in the past.  If that is what 
> you are seeing, there is a sysctl to disable dropping into DDB due to an NMI:
> 
> machdep.kdb_on_nmi: 1
> 
> If you keep getting NMIs, try setting that to 0.
> 

The DDB session is still open but I don't see why this system is stuck.
I have been looking at locked processes (two below) but I'm not familiar
with the code. Maybe a deadlock? Maybe a missed interrupt? Maybe an
unsupported controller? I dunno.


0xfffffe0b989803f0: 0xfffffe0b989803f0: tag zfs, type VDIR
tag zfs, type VDIR
    usecount 0, writecount 0, refcount 2 mountedhere 0
    usecount 0, writecount 0, refcount 2 mountedhere 0
    flags (VI_DOINGINACT|VI(0x200))
    flags (VI_DOINGINACT|VI(0x200))
    v_object 0xfffffe0b87af7488 ref 0 pages 0
    v_object 0xfffffe0b87af7488 ref 0 pages 0
        lock type zfs: EXCL by thread 0xfffffe09b48fe900 (pid 70646)
lock type zfs: EXCL by thread 0xfffffe09b48fe900 (pid 70646)

db> show thread 0xfffffe09b48fe900
Thread 104609 at 0xfffffe09b48fe900:
 proc (pid 70646): 0xfffffe0af7e79940
 name: find
 stack: 0xffffffa3329b2000-0xffffffa3329b5fff
 flags: 0x4  pflags: 0
 state: INHIBITED: {SLEEPING}
 wmesg: tx->tx_quiesce_done_cv)  wchan: 0xfffffe0059b60240
 priority: 120
 container lock: sleepq chain (0xffffffff8126d498)

db> sh proc 70646
Process 70646 (find) at 0xfffffe0af7e79940:
 state: NORMAL
 uid: 0  gids: 0, 5
 parent: pid 70645 at 0xfffffe006e4974a0
 ABI: FreeBSD ELF64
 arguments: find
 threads: 1
104609                   D       tx->tx_q 0xfffffe0059b60240 find

db> tr 104609
Tracing pid 70646 tid 104609 td 0xfffffe09b48fe900
sched_switch() at sched_switch+0x28b
mi_switch() at mi_switch+0xdf
sleepq_wait() at sleepq_wait+0x3a
_cv_wait() at _cv_wait+0x164
txg_wait_open() at txg_wait_open+0x85
dmu_tx_assign() at dmu_tx_assign+0x38
zfs_inactive() at zfs_inactive+0x8e
zfs_freebsd_inactive() at zfs_freebsd_inactive+0xd
VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0x5d
vinactive() at vinactive+0xef
vputx() at vputx+0x244
sys_fchdir() at sys_fchdir+0x3f0
amd64_syscall() at amd64_syscall+0x334
Xfast_syscall() at Xfast_syscall+0xfb
--- syscall (13, FreeBSD ELF64, sys_fchdir), rip = 0x80088396c, rsp =
0x7fffffffd998, rbp = 0x7fffffffda40 ---


0xfffffe005c8855e8: 0xfffffe005c8855e8: tag syncer, type VNON
tag syncer, type VNON
    usecount 1, writecount 0, refcount 2 mountedhere 0
    usecount 1, writecount 0, refcount 2 mountedhere 0
    flags (VI(0x200))
    flags (VI(0x200))
        lock type syncer: EXCL by thread 0xfffffe0039147480 (pid 20)
lock type syncer: EXCL by thread 0xfffffe0039147480 (pid 20)

db> sh thread 0xfffffe0039147480
Thread 100242 at 0xfffffe0039147480:
 proc (pid 20): 0xfffffe003f4bc940
 name: syncer
 stack: 0xffffffa2fc73b000-0xffffffa2fc73efff
 flags: 0x4  pflags: 0x240800
 state: INHIBITED: {SLEEPING}
 wmesg: zio->io_cv)  wchan: 0xfffffe01187c5320
 priority: 116
 container lock: sleepq chain (0xffffffff8126e140)

db> sh proc 20
Process 20 (syncer) at 0xfffffe003f4bc940:
 state: NORMAL
 uid: 0  gids: 0
 parent: pid 0 at 0xffffffff812cbdd8
 ABI: null
 threads: 1
100242                   D       zio->io_ 0xfffffe01187c5320 [syncer]

db> tr 100242
Tracing pid 20 tid 100242 td 0xfffffe0039147480
sched_switch() at sched_switch+0x28b
mi_switch() at mi_switch+0xdf
sleepq_wait() at sleepq_wait+0x3a
_cv_wait() at _cv_wait+0x164
zio_wait() at zio_wait+0x5b
zil_commit() at zil_commit+0x833
zfs_sync() at zfs_sync+0xaa
sync_fsync() at sync_fsync+0x168
VOP_FSYNC_APV() at VOP_FSYNC_APV+0x5d
sync_vnode() at sync_vnode+0x1b0
sched_sync() at sched_sync+0x29f
fork_exit() at fork_exit+0x9a
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffffa2fc73eb30, rbp = 0 ---