Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 16 Dec 2014 11:29:57 -0500
From:      John Baldwin <jhb@freebsd.org>
To:        freebsd-stable@freebsd.org
Cc:        Alexander Motin <mav@freebsd.org>, hselasky@freebsd.org, mjg@freebsd.org, Shane Ambler <FreeBSD@shaneware.biz>
Subject:   Re: Help debugging stable/10
Message-ID:  <201412161129.57704.jhb@freebsd.org>
In-Reply-To: <5488F58D.7060708@ShaneWare.Biz>
References:  <5488F58D.7060708@ShaneWare.Biz>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday, December 10, 2014 8:38:21 pm Shane Ambler wrote:
> Since upgrading to 10.1 (RC2) I have had trouble getting uptimes greater
> than 1 day. I have little experience debugging the OS so could use some
> help.
> 
> # uname -a
> FreeBSD leader.local 10.1-STABLE FreeBSD 10.1-STABLE #0 r275364: Tue Dec
> 2 08:13:06 ACDT 2014 
> root@leader.local:/usr/obj/usr/src-stable/sys/GENERIC amd64
> 
> This is on an ASUS P8H61-M LE/USB3 corei5 8GB with 3x 2TB Seagate drives
> in raidz.
> Full backtraces and dmesg at http://shaneware.biz/freebsddebugdata/
> 
> The thing that breaks which forces me to reset the machine is that I am
> unable to start new processes. Existing processes continue to work I
> just can't start new ones. Some simple commands do work but top ps
> procstat usbconfig all fail to start. I have been able use script to
> get some backtraces from kgdb before restarting.

It looks like your processes are hanging on a global lock used by sysctl(3).
That would explain hangs in top/ps/procstat as they all use sysctls.  For
example:

Thread 709 (Thread 101172):
#0  sched_switch (td=0xfffff8006cccd490, newtd=<value optimized out>, flags=<value optimized out>)
    at /usr/src-stable/sys/kern/sched_ule.c:1945
#1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493
#2  0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617
#3  0xffffffff8093284a in _sx_xlock_hard (sx=0xffffffff8157adc0, tid=18446735279441892496, 
    opts=<value optimized out>, file=0x0, line=0) at /usr/src-stable/sys/kern/kern_sx.c:681
#4  0xffffffff80936004 in userland_sysctl (td=<value optimized out>, name=0xfffffe021e889930, namelen=3, 
    old=<value optimized out>, oldlenp=<value optimized out>, inkernel=<value optimized out>, 
    new=<value optimized out>, newlen=<value optimized out>, retval=<value optimized out>, flags=0) at sx.h:152
#5  0xffffffff80935e64 in sys___sysctl (td=0xfffff8006cccd490, uap=0xfffffe021e889a40)
    at /usr/src-stable/sys/kern/kern_sysctl.c:1536
#6  0xffffffff80d29c51 in amd64_syscall (td=0xfffff8006cccd490, traced=0) at subr_syscall.c:134
#7  0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391
#8  0x0000000800b6e64a in ?? ()
Previous frame inner to this frame (corrupt stack?)
---Type <return> to continue, or q <return> to quit---

To hang like this means that some sysctl handler is running for a long time.  

Looking in your traces, this appears to be the problematic sysctl:

Thread 397 (Thread 100422):
#0  sched_switch (td=0xfffff8001049f920, newtd=<value optimized out>, flags=<value optimized out>)
    at /usr/src-stable/sys/kern/sched_ule.c:1945
#1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493
#2  0xffffffff8097189a in sleepq_timedwait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:652
#3  0xffffffff8093320e in _sleep (ident=<value optimized out>, lock=<value optimized out>, 
    priority=<value optimized out>, wmesg=<value optimized out>, sbt=<value optimized out>, pr=<value optimized out>, 
    flags=<value optimized out>) at /usr/src-stable/sys/kern/kern_synch.c:251
#4  0xffffffff80891ae3 in g_waitfor_event (func=<value optimized out>, arg=<value optimized out>, 
    flag=<value optimized out>) at /usr/src-stable/sys/geom/geom_event.c:427
#5  0xffffffff808933d9 in sysctl_kern_geom_conftxt (oidp=<value optimized out>, arg1=<value optimized out>, 
    arg2=<value optimized out>, req=0xfffffe021dc13868) at /usr/src-stable/sys/geom/geom_kern.c:169
#6  0xffffffff80935a9e in sysctl_root (arg1=<value optimized out>, arg2=<value optimized out>)
    at /usr/src-stable/sys/kern/kern_sysctl.c:1500
#7  0xffffffff80936078 in userland_sysctl (td=<value optimized out>, name=0xfffffe021dc13930, 
    namelen=<value optimized out>, old=<value optimized out>, oldlenp=<value optimized out>, 
    inkernel=<value optimized out>, new=<value optimized out>, newlen=<value optimized out>, 
    retval=<value optimized out>, flags=0) at /usr/src-stable/sys/kern/kern_sysctl.c:1610
#8  0xffffffff80935e64 in sys___sysctl (td=0xfffff8001049f920, uap=0xfffffe021dc13a40)
    at /usr/src-stable/sys/kern/kern_sysctl.c:1536
#9  0xffffffff80d29c51 in amd64_syscall (td=0xfffff8001049f920, traced=0) at subr_syscall.c:134
#10 0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391
#11 0x00000008019c864a in ?? ()
Previous frame inner to this frame (corrupt stack?)

(Note that HEAD has some changes by mjg@ (cc'd) to allow concurrent sysctls
that would at least make this less painful.)

The root issue is why is geom hanging.  Hmm, that thread is also stuck on an
sx lock:

Thread 259 (Thread 100015):
#0  sched_switch (td=0xfffff80005170000, newtd=<value optimized out>, flags=<value optimized out>)
    at /usr/src-stable/sys/kern/sched_ule.c:1945
#1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493
#2  0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617
#3  0xffffffff8093284a in _sx_xlock_hard (sx=0xffffffff816451e0, tid=18446735277701922816, 
    opts=<value optimized out>, file=0x0, line=0) at /usr/src-stable/sys/kern/kern_sx.c:681
#4  0xffffffff808912b2 in g_run_events () at sx.h:152
#5  0xffffffff808fad3a in fork_exit (callout=0xffffffff80893240 <g_event_procbody>, arg=0x0, frame=0xfffffe01ef98aac0)
    at /usr/src-stable/sys/kern/kern_fork.c:996
#6  0xffffffff80d0ef3e in fork_trampoline () at /usr/src-stable/sys/amd64/amd64/exception.S:606
#7  0x0000000000000000 in ?? ()

That is probably g_topology_lock.  If so, it is held by this thread
(g_dev_open() grabs it around g_access()):

Thread 673 (Thread 101936):
#0  sched_switch (td=0xfffff80162200490, newtd=<value optimized out>, flags=<value optimized out>)
    at /usr/src-stable/sys/kern/sched_ule.c:1945
#1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493
#2  0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617
#3  0xffffffff80933227 in _sleep (ident=<value optimized out>, lock=<value optimized out>, 
    priority=<value optimized out>, wmesg=<value optimized out>, sbt=<value optimized out>, pr=<value optimized out>, 
    flags=<value optimized out>) at /usr/src-stable/sys/kern/kern_synch.c:255
#4  0xffffffff8030b9b3 in daopen (dp=<value optimized out>) at /usr/src-stable/sys/cam/scsi/scsi_da.c:1295
#5  0xffffffff8089013b in g_disk_access (pp=0xfffff800097f4c00, r=<value optimized out>, w=<value optimized out>, 
    e=<value optimized out>) at /usr/src-stable/sys/geom/geom_disk.c:136
#6  0xffffffff80894f3e in g_access (cp=0xfffff800097f2e00, dcr=1, dcw=0, dce=0)
    at /usr/src-stable/sys/geom/geom_subr.c:913
#7  0xffffffff8088df12 in g_dev_open (dev=<value optimized out>, flags=<value optimized out>, 
    fmt=<value optimized out>, td=<value optimized out>) at /usr/src-stable/sys/geom/geom_dev.c:361
#8  0xffffffff808180f2 in devfs_open (ap=<value optimized out>) at /usr/src-stable/sys/fs/devfs/devfs_vnops.c:1086
---Type <return> to continue, or q <return> to quit---
#9  0xffffffff80e44fd1 in VOP_OPEN_APV (vop=<value optimized out>, a=<value optimized out>) at vnode_if.c:469
#10 0xffffffff809d9bb4 in vn_open_vnode (vp=0xfffff8017d14d000, fmode=1, cred=0xfffff8000c2e4b00, 
    td=0xfffff80162200490, fp=0xfffff80081060230) at vnode_if.h:196
#11 0xffffffff809d97ac in vn_open_cred (ndp=0xfffffe021e97b880, flagp=0xfffffe021e97b95c, 
    cmode=<value optimized out>, vn_open_flags=<value optimized out>, cred=0x0, fp=0xfffff80081060230)
    at /usr/src-stable/sys/kern/vfs_vnops.c:256
#12 0xffffffff809d2def in kern_openat (td=0xfffff80162200490, fd=-100, 
    path=0x7fffffffe99e <Error reading address 0x7fffffffe99e: Bad address>, pathseg=UIO_USERSPACE, flags=1, 
    mode=<value optimized out>) at /usr/src-stable/sys/kern/vfs_syscalls.c:1091
#13 0xffffffff80d29c51 in amd64_syscall (td=0xfffff80162200490, traced=0) at subr_syscall.c:134
#14 0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391
#15 0x0000000800f8dcea in ?? ()
Previous frame inner to this frame (corrupt stack?)

Line 1295 of scsi_da.c in a stable/10 checkout of today is this:

        /* Wait for the disk size update.  */
        error = cam_periph_sleep(periph, &softc->disk->d_mediasize, PRIBIO,
            "dareprobe", 0);

To be stuck here means that the request dareprobe() queued is stuck behind
some other request on the device.  Looking in your traces, there are several
threads all waiting for an ioctl() to a passX device to complete:

Thread 432 (Thread 100542):
#0  sched_switch (td=0xfffff801acb8a920, newtd=<value optimized out>, flags=<value optimized out>)
    at /usr/src-stable/sys/kern/sched_ule.c:1945
#1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493
#2  0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617
#3  0xffffffff80933227 in _sleep (ident=<value optimized out>, lock=<value optimized out>, 
    priority=<value optimized out>, wmesg=<value optimized out>, sbt=<value optimized out>, pr=<value optimized out>, 
    flags=<value optimized out>) at /usr/src-stable/sys/kern/kern_synch.c:255
---Type <return> to continue, or q <return> to quit---
#4  0xffffffff802ddc0e in cam_periph_runccb (ccb=0xfffff80187888000, error_routine=0xffffffff8030d2a0 <passerror>, 
    camflags=CAM_RETRY_SELTO, sense_flags=18, ds=0xfffff8000984fa20) at /usr/src-stable/sys/cam/cam_periph.c:969
#5  0xffffffff8030d233 in passdoioctl (dev=<value optimized out>, cmd=<value optimized out>, 
    addr=0xfffff800b50f4800 "", flag=<value optimized out>, td=<value optimized out>)
    at /usr/src-stable/sys/cam/scsi/scsi_pass.c:680
#6  0xffffffff8030cf32 in passioctl (dev=0xfffff80009835e00, cmd=3302496258, addr=0xfffff800b50f4800 "", flag=3, 
    td=0xfffff801acb8a920) at /usr/src-stable/sys/cam/scsi/scsi_pass.c:533
#7  0xffffffff80819a99 in devfs_ioctl_f (fp=0xfffff800b5d6f320, com=3302496258, data=0xfffff800b50f4800, 
    cred=<value optimized out>, td=0xfffff801acb8a920) at /usr/src-stable/sys/fs/devfs/devfs_vnops.c:759
#8  0xffffffff8097d865 in kern_ioctl (td=0xfffff801acb8a920, fd=<value optimized out>, com=0) at file.h:320
#9  0xffffffff8097d560 in sys_ioctl (td=0xfffff801acb8a920, uap=0xfffffe021de9aa40)
    at /usr/src-stable/sys/kern/sys_generic.c:718
#10 0xffffffff80d29c51 in amd64_syscall (td=0xfffff801acb8a920, traced=0) at subr_syscall.c:134
#11 0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391
#12 0x0000000800ff3f7a in ?? ()
Previous frame inner to this frame (corrupt stack?)

I suspect the daopen() is hanging on the disk associated with the given
passX device.

Are you running smartd or anything else that accesses passX devices?

Also, do you have a coredump from this or were you just running kgdb against
the live system?  If you have a coredump you can run ps against it (or use
some kgdb macros I have) to see the command name of the process doing the
passX ioctls.  We could also see which passX devices were blocking to see if
this is related to either AHCI or USB.  Also, do you possibly have any errors
in your dmesg from the disk devices?

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201412161129.57704.jhb>