From owner-freebsd-current@FreeBSD.ORG Fri Jul 10 19:01:48 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C9FBA106566B; Fri, 10 Jul 2009 19:01:48 +0000 (UTC) (envelope-from serenity@exscape.org) Received: from ch-smtp01.sth.basefarm.net (ch-smtp01.sth.basefarm.net [80.76.149.212]) by mx1.freebsd.org (Postfix) with ESMTP id 464B18FC16; Fri, 10 Jul 2009 19:01:48 +0000 (UTC) (envelope-from serenity@exscape.org) Received: from c83-253-252-234.bredband.comhem.se ([83.253.252.234]:45621 helo=mx.exscape.org) by ch-smtp01.sth.basefarm.net with esmtp (Exim 4.69) (envelope-from ) id 1MPLLl-0007fh-4v; Fri, 10 Jul 2009 21:01:39 +0200 Received: from [192.168.1.5] (macbookpro [192.168.1.5]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mx.exscape.org (Postfix) with ESMTPSA id 8E192F9973; Fri, 10 Jul 2009 21:01:34 +0200 (CEST) Message-Id: From: Thomas Backman To: FreeBSD current In-Reply-To: <72163521-40BF-4764-8B74-5446A88DFBF8@exscape.org> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v935.3) Date: Fri, 10 Jul 2009 21:01:33 +0200 References: <72163521-40BF-4764-8B74-5446A88DFBF8@exscape.org> X-Mailer: Apple Mail (2.935.3) X-Originating-IP: 83.253.252.234 X-Scan-Result: No virus found in message 1MPLLl-0007fh-4v. X-Scan-Signature: ch-smtp01.sth.basefarm.net 1MPLLl-0007fh-4v b688837e5904323df4ea64d43091b07b Cc: freebsd-fs@freebsd.org Subject: Reproducible ZFS panic, w/ script (Was: "New" ZFS crash on FS (pool?) unmount/export) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Jul 2009 19:01:49 -0000 On Jun 20, 2009, at 09:11, Thomas Backman wrote: > I just ran into this tonight. Not sure exactly what triggered it - > the box stopped responding to pings at 02:07AM and it has a cron > backup job using zfs send/recv at 02:00, so I'm guessing it's > related, even though the backup probably should have finished before > then... Hmm. Anyway. > > r194478. > > kernel trap 12 with interrupts disabled > > Fatal trap 12: page fault while in kernel mode > cpuid = 0; apic id = 00 > fault virtual address = 0x288 > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff805a4989 > stack pointer = 0x28:0xffffff803e8b57e0 > frame pointer = 0x28:0xffffff803e8b5840 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = resume, IOPL = 0 > current process = 57514 (zpool) > panic: from debugger > cpuid = 0 > Uptime: 10h22m13s > Physical memory: 2027 MB > > (kgdb) bt > ... > #9 0xffffffff80887fb2 in trap (frame=0xffffff803e8b5730) at /usr/ > src/sys/amd64/amd64/trap.c:345 > #10 0xffffffff8086e007 in calltrap () at /usr/src/sys/amd64/amd64/ > exception.S:223 > #11 0xffffffff805a4989 in _sx_xlock_hard (sx=0xffffff0043557d50, > tid=18446742975830720512, opts=Variable "opts" is not available. > ) > at /usr/src/sys/kern/kern_sx.c:575 > #12 0xffffffff805a52fe in _sx_xlock (sx=Variable "sx" is not > available. > ) at sx.h:155 > #13 0xffffffff80fe2995 in zfs_freebsd_reclaim () from /boot/kernel/ > zfs.ko > #14 0xffffffff808cefca in VOP_RECLAIM_APV (vop=0xffffff0043557d38, > a=0xffffff0043557d50) at vnode_if.c:1926 > #15 0xffffffff80626f6e in vgonel (vp=0xffffff00437a7938) at > vnode_if.h:830 > #16 0xffffffff8062b528 in vflush (mp=0xffffff0060f2a000, rootrefs=0, > flags=0, td=0xffffff0061528000) > at /usr/src/sys/kern/vfs_subr.c:2450 > #17 0xffffffff80fdd3a8 in zfs_umount () from /boot/kernel/zfs.ko > #18 0xffffffff8062420a in dounmount (mp=0xffffff0060f2a000, > flags=1626513408, td=Variable "td" is not available. > ) > at /usr/src/sys/kern/vfs_mount.c:1287 > #19 0xffffffff80624975 in unmount (td=0xffffff0061528000, > uap=0xffffff803e8b5c00) > at /usr/src/sys/kern/vfs_mount.c:1172 > #20 0xffffffff8088783f in syscall (frame=0xffffff803e8b5c90) at /usr/ > src/sys/amd64/amd64/trap.c:984 > #21 0xffffffff8086e290 in Xfast_syscall () at /usr/src/sys/amd64/ > amd64/exception.S:364 > #22 0x000000080104e49c in ?? () > Previous frame inner to this frame (corrupt stack?) > > BTW, I got a (one) "force unmount is experimental" on the console. > On regular shutdown I usually get one per filesystem, it seems (at > least 10) and this pool should contain exactly as many filesystems > as the root pool since it's a copy of it. On running the backup > script manually post-crash, though, I didn't get any. OK, I've finally written a script that reproduces this panic for me every time (6-7 tries in a row should be good enough, plus one on another box). It would be great to have a few testers - and if you do test it, PLEASE report your results here - positive or negative! The main aim is, of course, to provide ZFS devs with their own core dumps, DDB consoles and whatnot to possibly resolve this issue. It requires: * bash (or something compatible, for the script itself) * a box you're willing to crash. ;) * ~200MB free on your /root/ partition (or just edit the paths at the top of the ugly hack of a script.) If you use ggatel with silly high numbers (1482 and 1675 - chosen since they're unlikely to be used), again, edit at the top. * The libzfs_sendrecv.c patch - see http://lists.freebsd.org/pipermail/freebsd-current/2009-May/006814.html or fetch the patch from my server (if it's down, it's for less than 2 minutes - my modem restarts every 181 minutes atm...): http://exscape.org/temp/libzfs_sendrecv.patch Here's a oneliner to fetch, patch, compile and install: cd /usr/src && fetch http://exscape.org/temp/libzfs_sendrecv.patch && patch -p0 < libzfs_sendrecv.patch && cd /usr/src/cddl/lib/libzfs && make && make install No reboot required (nor do you need a reboot to revert, see the end of the mail). PLEASE NOTE that without this patch, you'll just get a segfault and then an infinite loop of backups (since send/recv doesn't work in HEAD since at least february, I'm guessing since ZFS v13). Too bad that the patch is needed, since I suspect quite a few testers will bail out there... If not (great!), here's the script to wreck havoc: http://exscape.org/temp/zfs_clone_panic.sh After fetching the above, just run the script like "bash zfs_clone_panic.sh crash" and you should have a panic in about 20 seconds. The other possiblee arguments are mostly there because I'm too lazy to clean it up now that it "works". Back to the panic: The problem appears to be related to clones somehow - the first two times I ran in to this panic (in real use) was when messing with clone/ promote... so that's what this script does. Here's the backtrace it produces, and some info (show lockedvnods, if that helps at all): kernel trap 12 with interrupts disabled Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x288 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff8036df39 stack pointer = 0x28:0xffffff803ea6d7e0 frame pointer = 0x28:0xffffff803ea6d840 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = resume, IOPL = 0 current process = 16367 (zpool) 0xffffff000510fce8: tag zfs, type VDIR usecount 1, writecount 0, refcount 1 mountedhere 0xffffff0002cd8bc0 flags () lock type zfs: EXCL by thread 0xffffff0024c5d000 (pid 16367) 0xffffff00050dc3b0: tag zfs, type VDIR usecount 0, writecount 0, refcount 1 mountedhere 0 flags (VI_DOOMED) VI_LOCKed lock type zfs: EXCL by thread 0xffffff0024c5d000 (pid 16367) panic: from debugger ... ... boot, panic, trap etc ... #10 0xffffffff805d36a7 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:223 #11 0xffffffff8036df39 in _sx_xlock_hard (sx=0xffffff005d11a8e8, tid=18446742974814867456, opts=Variable "opts" is not available. ) at /usr/src/sys/kern/kern_sx.c:575 #12 0xffffffff8036e8ae in _sx_xlock (sx=Variable "sx" is not available. ) at sx.h:155 #13 0xffffffff80b889e5 in zfs_freebsd_reclaim () from /boot/kernel/ zfs.ko #14 0xffffffff8062447a in VOP_RECLAIM_APV (vop=0xffffff005d11a8d0, a=0xffffff005d11a8e8) at vnode_if.c:1926 #15 0xffffffff803f3b0e in vgonel (vp=0xffffff00050dc3b0) at vnode_if.h: 830 #16 0xffffffff803f80c8 in vflush (mp=0xffffff0002cd8bc0, rootrefs=0, flags=0, td=0xffffff0024c5d000) at /usr/src/sys/kern/vfs_subr.c:2449 #17 0xffffffff80b833d8 in zfs_umount () from /boot/kernel/zfs.ko #18 0xffffffff803f0d3a in dounmount (mp=0xffffff0002cd8bc0, flags=47025088, td=Variable "td" is not available. ) at /usr/src/sys/kern/vfs_mount.c:1289 #19 0xffffffff803f1568 in unmount (td=0xffffff0024c5d000, uap=0xffffff803ea6dc00) at /usr/src/sys/kern/vfs_mount.c:1174 #20 0xffffffff805ed4cf in syscall (frame=0xffffff803ea6dc90) at /usr/src/sys/amd64/amd64/trap.c:984 #21 0xffffffff805d3930 in Xfast_syscall () at /usr/src/sys/amd64/ amd64/exception.S:364 #22 0x000000080104e9ac in ?? () Previous frame inner to this frame (corrupt stack?) Regards, Thomas PS. A oneliner to get back to the non-patched state, if you're using SVN (if not, sorry): cd /usr/src && svn revert cddl/contrib/opensolaris/lib/libzfs/common/ libzfs_sendrecv.c && cd /usr/src/cddl/lib/libzfs && make && make install DS. Hope this helps track this down, as I spent quite a while on finding the root cause (the clones, in one way or another), writing the script, this mail, etc.