From owner-freebsd-current@FreeBSD.ORG Sat Jan 19 05:58:16 2013 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id EC63E791 for ; Sat, 19 Jan 2013 05:58:16 +0000 (UTC) (envelope-from jamesbrandongooch@gmail.com) Received: from mail-wi0-x229.google.com (wi-in-x0229.1e100.net [IPv6:2a00:1450:400c:c05::229]) by mx1.freebsd.org (Postfix) with ESMTP id 73DFFB79 for ; Sat, 19 Jan 2013 05:58:16 +0000 (UTC) Received: by mail-wi0-f169.google.com with SMTP id hq12so1013300wib.2 for ; Fri, 18 Jan 2013 21:58:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=PGH9UYV62l65y8Mlr+XMjE/UxVqVuQvViXLASxzlH08=; b=dCcaStNQ6lk83oUF2ZFAhqVfYfaPDpR3YKm8DPWcb+KGPFh1k3LaaxZEOyl22QbOz6 ePGvi9qcN9PS4qkWX/rYeoWSNNqkMpyv6h7M0e1Uvnz/cqQFXAeXiFiORMzyp39PIN3d Tbx+aRkl8Fk+E8v17ZSQELiqkjajsJcN6TZqA/yrJW8eOp4R15GgFd3i++eqKdIel03L MnxzkpRL4WUWeut7OcZ95jdYbrRGy3B4V6+si7Keq9HA00Ys7vXKrTTdFFJikgN9jEkf vW3rBW/6nfMeLl1n9WgMZWniIqaGlTqXDCJOQMQdHDcqFXIb6w2UBB8buKhxRoWnpkJr I0ZA== MIME-Version: 1.0 X-Received: by 10.180.99.165 with SMTP id er5mr11908783wib.1.1358575094595; Fri, 18 Jan 2013 21:58:14 -0800 (PST) Received: by 10.216.100.194 with HTTP; Fri, 18 Jan 2013 21:58:14 -0800 (PST) In-Reply-To: <50F9B70A.5040305@delphij.net> References: <50EB602F.9050300@delphij.net> <20130108000233.GZ82219@kib.kiev.ua> <50EB63A9.50903@delphij.net> <50EB870D.3020306@delphij.net> <50EF3FEC.60605@delphij.net> <50F9B70A.5040305@delphij.net> Date: Fri, 18 Jan 2013 23:58:14 -0600 Message-ID: Subject: Re: sysctl -a causes kernel trap 12 From: Brandon Gooch To: d@delphij.net Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Konstantin Belousov , freebsd-current@freebsd.org X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Jan 2013 05:58:17 -0000 On Fri, Jan 18, 2013 at 2:56 PM, Xin Li wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA512 > > On 01/18/13 12:50, Brandon Gooch wrote: > > On Thu, Jan 10, 2013 at 4:25 PM, Xin Li > > wrote: > > > > -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 > > > > To all: this became more and more hard to replicate lately. I've > > tried these options and the most important progress is that it's > > possible to get a crashdump when debug.debugger_on_panic=0 and I > > managed to get a backtrace which indicates the panic occur when > > trying to do mtx_lock(&Giant) -> __mtx_lock_sleep -> turnstile_wait > > -> propagate_priority, but after I've added some instruments to > > the surrounding code and enabled INVARIANT and/or WITNESS, it > > mysteriously went away. > > > > Reverting my instruments code and update to latest svn makes the > > issue disappear for one day. I've hit it again today but > > unfortunately didn't get a successful dump and after reboot I can't > > reproduce it again :( > > > > Still trying... > > > > > > Any updates Xin? > > No, it mysteriously disappeared for now. According to my > understanding to recent svn commits, I didn't see anybody committing > something that fixes it but I can no longer panic my system, with or > without debugging code :( > > > I was actually hitting what I believe to be exactly the same issue > > as you on one of my systems, and, as you've seen, adding any extra > > debugging or diagnostics seemed to eliminate the issue. > > > > I was able to generate quite a few vmcores and still have these > > sitting around in my filesystem (along with the kernels that helped > > produce them). > > > > I can recreate this crash on my system by compiling the NVIDIA > > driver with clang at -01 and above. Although it's been noted that > > this issue has been seen in scenarios without an NIVIDIA driver in > > the mix, whatever is happening in the kernel to cause the panic is > > somehow triggered by this, at least on my system. > > I'm not sure if this is the same problem. Could you please try using > gcc to compile the nVIdia driver and see if that "fixes" the problem? > > Cheers, > - -- > Xin LI https://www.delphij.net/ > FreeBSD - The Power to Serve! Live free or die > Indeed, a gcc compiled NVIDIA module eliminates the issue, sorry if I hadn't mentioned this earlier. What was happening to me at first was that my system would just hang while booting. I was able to figure out that it was during /etc/rc.d/initrandom. I actually got to a point where I removed the call to sysctl -a from 'better_than_nothing()' in /etc/rc.d/initrandom to have a booting system. I finally had a situation where I could get a panic by adding SW_WATCHDOG to my kernel and running watchdogd(8). For me, this panic would come and go seemingly at random as well, and I couldn't fumble my way around in the debugger to learn much of anything when I first started seeing it. I just started a process of modularizing everything I could in my kernel config, then loading modules 1-by-1 and booting over-and-over until I finally found what appeared to be the problem, which was the NVIDIA module compiled with clang. Oh, another thing: at times it seemed as though it was the number of modules loaded, as I could get the hang with 41 modules loaded, but not 40 or 42?! I admit, when I was seeing that behavior, I hadn't eliminated the NVIDIA driver from my loaded modules. I need to revisit the panic situation to confirm this particular strangeness. Here's the last panic I had: Unread portion of the kernel message buffer: = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 1175 (sysctl) (kgdb) bt #0 doadump (textdump=1694704112) at pcpu.h:229 #1 0xffffffff802fab82 in db_fncall (dummy1=, dummy2=, dummy3=, dummy4=) at /usr/src/sys/ddb/db_command.c:578 #2 0xffffffff802fa85a in db_command (last_cmdp=, cmd_table=, dopager=1) at /usr/src/sys/ddb/db_command.c:449 #3 0xffffffff802fa612 in db_command_loop () at /usr/src/sys/ddb/db_command.c:502 #4 0xffffffff802fcf60 in db_trap (type=, code=0) at /usr/src/sys/ddb/db_main.c:231 #5 0xffffffff804a7b93 in kdb_trap (type=12, code=0, tf=) at /usr/src/sys/kern/subr_kdb.c:654 #6 0xffffffff807157c5 in trap_fatal (frame=0xffffff8865032670, eva=) at /usr/src/sys/amd64/amd64/trap.c:867 #7 0xffffffff80715adb in trap_pfault (frame=0x0, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:698 #8 0xffffffff8071529b in trap (frame=0xffffff8865032670) at /usr/src/sys/amd64/amd64/trap.c:463 #9 0xffffffff806ff382 in calltrap () at exception.S:228 #10 0xffffffff8047bd50 in sysctl_sysctl_next_ls (lsp=, name=0xffffff8865032a80, namelen=, next=0xffffff8865032898, len=0xffffff8865032904, level=3) at /usr/src/sys/kern/kern_sysctl.c:759 #11 0xffffffff8047be5e in sysctl_sysctl_next_ls (lsp=0xfffffe000d3f0080, name=0xffffff8865032a7c, namelen=, next=0xffffff8865032894, len=0xffffff8865032904, level=2) at /usr/src/sys/kern/kern_sysctl.c:786 #12 0xffffffff8047be5e in sysctl_sysctl_next_ls (lsp=0xfffffe000d3f0080, name=0xffffff8865032a78, namelen=, next=0xffffff8865032890, len=0xffffff8865032904, level=1) at /usr/src/sys/kern/kern_sysctl.c:786 #13 0xffffffff8047bca3 in sysctl_sysctl_next (oidp=, arg1=0xffffff8865032a78, arg2=4, req=0xffffff88650329a8) at /usr/src/sys/kern/kern_sysctl.c:808 #14 0xffffffff8047b03f in sysctl_root (arg1=, arg2=) at /usr/src/sys/kern/kern_sysctl.c:1513 #15 0xffffffff8047b5d8 in userland_sysctl (td=, name=0xffffff8865032a70, namelen=, old=, oldlenp=, inkernel=, new=, newlen=, retval=, flags=1694706064) at /usr/src/sys/kern/kern_sysctl.c:1623 #16 0xffffffff8047b3c4 in sys___sysctl (td=0xfffffe001e2d4900, uap=0xffffff8865032b80) at /usr/src/sys/kern/kern_sysctl.c:1549 #17 0xffffffff807160f7 in amd64_syscall (td=0xfffffe001e2d4900, traced=0) at subr_syscall.c:135 #18 0xffffffff806ff66b in Xfast_syscall () at exception.S:387 #19 0x000000080093697a in ?? () Previous frame inner to this frame (corrupt stack?) Current language: auto; currently minimal Any ideas on where to look through this vmcore? -Brandon