From owner-freebsd-stable@FreeBSD.ORG Tue Dec 16 17:07:50 2014 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 9E7013A3; Tue, 16 Dec 2014 17:07:50 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 58AF4D50; Tue, 16 Dec 2014 17:07:50 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 4A6ADB953; Tue, 16 Dec 2014 12:07:49 -0500 (EST) From: John Baldwin To: freebsd-stable@freebsd.org Subject: Re: Help debugging stable/10 Date: Tue, 16 Dec 2014 11:29:57 -0500 User-Agent: KMail/1.13.5 (FreeBSD/8.4-CBSD-20140415; KDE/4.5.5; amd64; ; ) References: <5488F58D.7060708@ShaneWare.Biz> In-Reply-To: <5488F58D.7060708@ShaneWare.Biz> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201412161129.57704.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 16 Dec 2014 12:07:49 -0500 (EST) Cc: Alexander Motin , hselasky@freebsd.org, mjg@freebsd.org, Shane Ambler X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Dec 2014 17:07:50 -0000 On Wednesday, December 10, 2014 8:38:21 pm Shane Ambler wrote: > Since upgrading to 10.1 (RC2) I have had trouble getting uptimes greater > than 1 day. I have little experience debugging the OS so could use some > help. > > # uname -a > FreeBSD leader.local 10.1-STABLE FreeBSD 10.1-STABLE #0 r275364: Tue Dec > 2 08:13:06 ACDT 2014 > root@leader.local:/usr/obj/usr/src-stable/sys/GENERIC amd64 > > This is on an ASUS P8H61-M LE/USB3 corei5 8GB with 3x 2TB Seagate drives > in raidz. > Full backtraces and dmesg at http://shaneware.biz/freebsddebugdata/ > > The thing that breaks which forces me to reset the machine is that I am > unable to start new processes. Existing processes continue to work I > just can't start new ones. Some simple commands do work but top ps > procstat usbconfig all fail to start. I have been able use script to > get some backtraces from kgdb before restarting. It looks like your processes are hanging on a global lock used by sysctl(3). That would explain hangs in top/ps/procstat as they all use sysctls. For example: Thread 709 (Thread 101172): #0 sched_switch (td=0xfffff8006cccd490, newtd=, flags=) at /usr/src-stable/sys/kern/sched_ule.c:1945 #1 0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493 #2 0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617 #3 0xffffffff8093284a in _sx_xlock_hard (sx=0xffffffff8157adc0, tid=18446735279441892496, opts=, file=0x0, line=0) at /usr/src-stable/sys/kern/kern_sx.c:681 #4 0xffffffff80936004 in userland_sysctl (td=, name=0xfffffe021e889930, namelen=3, old=, oldlenp=, inkernel=, new=, newlen=, retval=, flags=0) at sx.h:152 #5 0xffffffff80935e64 in sys___sysctl (td=0xfffff8006cccd490, uap=0xfffffe021e889a40) at /usr/src-stable/sys/kern/kern_sysctl.c:1536 #6 0xffffffff80d29c51 in amd64_syscall (td=0xfffff8006cccd490, traced=0) at subr_syscall.c:134 #7 0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391 #8 0x0000000800b6e64a in ?? () Previous frame inner to this frame (corrupt stack?) ---Type to continue, or q to quit--- To hang like this means that some sysctl handler is running for a long time. Looking in your traces, this appears to be the problematic sysctl: Thread 397 (Thread 100422): #0 sched_switch (td=0xfffff8001049f920, newtd=, flags=) at /usr/src-stable/sys/kern/sched_ule.c:1945 #1 0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493 #2 0xffffffff8097189a in sleepq_timedwait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:652 #3 0xffffffff8093320e in _sleep (ident=, lock=, priority=, wmesg=, sbt=, pr=, flags=) at /usr/src-stable/sys/kern/kern_synch.c:251 #4 0xffffffff80891ae3 in g_waitfor_event (func=, arg=, flag=) at /usr/src-stable/sys/geom/geom_event.c:427 #5 0xffffffff808933d9 in sysctl_kern_geom_conftxt (oidp=, arg1=, arg2=, req=0xfffffe021dc13868) at /usr/src-stable/sys/geom/geom_kern.c:169 #6 0xffffffff80935a9e in sysctl_root (arg1=, arg2=) at /usr/src-stable/sys/kern/kern_sysctl.c:1500 #7 0xffffffff80936078 in userland_sysctl (td=, name=0xfffffe021dc13930, namelen=, old=, oldlenp=, inkernel=, new=, newlen=, retval=, flags=0) at /usr/src-stable/sys/kern/kern_sysctl.c:1610 #8 0xffffffff80935e64 in sys___sysctl (td=0xfffff8001049f920, uap=0xfffffe021dc13a40) at /usr/src-stable/sys/kern/kern_sysctl.c:1536 #9 0xffffffff80d29c51 in amd64_syscall (td=0xfffff8001049f920, traced=0) at subr_syscall.c:134 #10 0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391 #11 0x00000008019c864a in ?? () Previous frame inner to this frame (corrupt stack?) (Note that HEAD has some changes by mjg@ (cc'd) to allow concurrent sysctls that would at least make this less painful.) The root issue is why is geom hanging. Hmm, that thread is also stuck on an sx lock: Thread 259 (Thread 100015): #0 sched_switch (td=0xfffff80005170000, newtd=, flags=) at /usr/src-stable/sys/kern/sched_ule.c:1945 #1 0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493 #2 0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617 #3 0xffffffff8093284a in _sx_xlock_hard (sx=0xffffffff816451e0, tid=18446735277701922816, opts=, file=0x0, line=0) at /usr/src-stable/sys/kern/kern_sx.c:681 #4 0xffffffff808912b2 in g_run_events () at sx.h:152 #5 0xffffffff808fad3a in fork_exit (callout=0xffffffff80893240 , arg=0x0, frame=0xfffffe01ef98aac0) at /usr/src-stable/sys/kern/kern_fork.c:996 #6 0xffffffff80d0ef3e in fork_trampoline () at /usr/src-stable/sys/amd64/amd64/exception.S:606 #7 0x0000000000000000 in ?? () That is probably g_topology_lock. If so, it is held by this thread (g_dev_open() grabs it around g_access()): Thread 673 (Thread 101936): #0 sched_switch (td=0xfffff80162200490, newtd=, flags=) at /usr/src-stable/sys/kern/sched_ule.c:1945 #1 0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493 #2 0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617 #3 0xffffffff80933227 in _sleep (ident=, lock=, priority=, wmesg=, sbt=, pr=, flags=) at /usr/src-stable/sys/kern/kern_synch.c:255 #4 0xffffffff8030b9b3 in daopen (dp=) at /usr/src-stable/sys/cam/scsi/scsi_da.c:1295 #5 0xffffffff8089013b in g_disk_access (pp=0xfffff800097f4c00, r=, w=, e=) at /usr/src-stable/sys/geom/geom_disk.c:136 #6 0xffffffff80894f3e in g_access (cp=0xfffff800097f2e00, dcr=1, dcw=0, dce=0) at /usr/src-stable/sys/geom/geom_subr.c:913 #7 0xffffffff8088df12 in g_dev_open (dev=, flags=, fmt=, td=) at /usr/src-stable/sys/geom/geom_dev.c:361 #8 0xffffffff808180f2 in devfs_open (ap=) at /usr/src-stable/sys/fs/devfs/devfs_vnops.c:1086 ---Type to continue, or q to quit--- #9 0xffffffff80e44fd1 in VOP_OPEN_APV (vop=, a=) at vnode_if.c:469 #10 0xffffffff809d9bb4 in vn_open_vnode (vp=0xfffff8017d14d000, fmode=1, cred=0xfffff8000c2e4b00, td=0xfffff80162200490, fp=0xfffff80081060230) at vnode_if.h:196 #11 0xffffffff809d97ac in vn_open_cred (ndp=0xfffffe021e97b880, flagp=0xfffffe021e97b95c, cmode=, vn_open_flags=, cred=0x0, fp=0xfffff80081060230) at /usr/src-stable/sys/kern/vfs_vnops.c:256 #12 0xffffffff809d2def in kern_openat (td=0xfffff80162200490, fd=-100, path=0x7fffffffe99e , pathseg=UIO_USERSPACE, flags=1, mode=) at /usr/src-stable/sys/kern/vfs_syscalls.c:1091 #13 0xffffffff80d29c51 in amd64_syscall (td=0xfffff80162200490, traced=0) at subr_syscall.c:134 #14 0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391 #15 0x0000000800f8dcea in ?? () Previous frame inner to this frame (corrupt stack?) Line 1295 of scsi_da.c in a stable/10 checkout of today is this: /* Wait for the disk size update. */ error = cam_periph_sleep(periph, &softc->disk->d_mediasize, PRIBIO, "dareprobe", 0); To be stuck here means that the request dareprobe() queued is stuck behind some other request on the device. Looking in your traces, there are several threads all waiting for an ioctl() to a passX device to complete: Thread 432 (Thread 100542): #0 sched_switch (td=0xfffff801acb8a920, newtd=, flags=) at /usr/src-stable/sys/kern/sched_ule.c:1945 #1 0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493 #2 0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617 #3 0xffffffff80933227 in _sleep (ident=, lock=, priority=, wmesg=, sbt=, pr=, flags=) at /usr/src-stable/sys/kern/kern_synch.c:255 ---Type to continue, or q to quit--- #4 0xffffffff802ddc0e in cam_periph_runccb (ccb=0xfffff80187888000, error_routine=0xffffffff8030d2a0 , camflags=CAM_RETRY_SELTO, sense_flags=18, ds=0xfffff8000984fa20) at /usr/src-stable/sys/cam/cam_periph.c:969 #5 0xffffffff8030d233 in passdoioctl (dev=, cmd=, addr=0xfffff800b50f4800 "", flag=, td=) at /usr/src-stable/sys/cam/scsi/scsi_pass.c:680 #6 0xffffffff8030cf32 in passioctl (dev=0xfffff80009835e00, cmd=3302496258, addr=0xfffff800b50f4800 "", flag=3, td=0xfffff801acb8a920) at /usr/src-stable/sys/cam/scsi/scsi_pass.c:533 #7 0xffffffff80819a99 in devfs_ioctl_f (fp=0xfffff800b5d6f320, com=3302496258, data=0xfffff800b50f4800, cred=, td=0xfffff801acb8a920) at /usr/src-stable/sys/fs/devfs/devfs_vnops.c:759 #8 0xffffffff8097d865 in kern_ioctl (td=0xfffff801acb8a920, fd=, com=0) at file.h:320 #9 0xffffffff8097d560 in sys_ioctl (td=0xfffff801acb8a920, uap=0xfffffe021de9aa40) at /usr/src-stable/sys/kern/sys_generic.c:718 #10 0xffffffff80d29c51 in amd64_syscall (td=0xfffff801acb8a920, traced=0) at subr_syscall.c:134 #11 0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391 #12 0x0000000800ff3f7a in ?? () Previous frame inner to this frame (corrupt stack?) I suspect the daopen() is hanging on the disk associated with the given passX device. Are you running smartd or anything else that accesses passX devices? Also, do you have a coredump from this or were you just running kgdb against the live system? If you have a coredump you can run ps against it (or use some kgdb macros I have) to see the command name of the process doing the passX ioctls. We could also see which passX devices were blocking to see if this is related to either AHCI or USB. Also, do you possibly have any errors in your dmesg from the disk devices? -- John Baldwin