From owner-freebsd-current@FreeBSD.ORG Thu Jul 9 09:02:25 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7FA5A106566C; Thu, 9 Jul 2009 09:02:25 +0000 (UTC) (envelope-from serenity@exscape.org) Received: from ch-smtp01.sth.basefarm.net (ch-smtp01.sth.basefarm.net [80.76.149.212]) by mx1.freebsd.org (Postfix) with ESMTP id EF22C8FC13; Thu, 9 Jul 2009 09:02:24 +0000 (UTC) (envelope-from serenity@exscape.org) Received: from c83-253-252-234.bredband.comhem.se ([83.253.252.234]:48033 helo=mx.exscape.org) by ch-smtp01.sth.basefarm.net with esmtp (Exim 4.69) (envelope-from ) id 1MOpVr-0001TP-5L; Thu, 09 Jul 2009 11:01:57 +0200 Received: from [192.168.1.5] (macbookpro [192.168.1.5]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mx.exscape.org (Postfix) with ESMTPSA id D98393F629; Thu, 9 Jul 2009 11:01:52 +0200 (CEST) Message-Id: <766FFF07-181A-4180-B020-AA3EE46CF6F8@exscape.org> From: Thomas Backman To: FreeBSD current In-Reply-To: <72163521-40BF-4764-8B74-5446A88DFBF8@exscape.org> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v935.3) Date: Thu, 9 Jul 2009 11:01:51 +0200 References: <72163521-40BF-4764-8B74-5446A88DFBF8@exscape.org> X-Mailer: Apple Mail (2.935.3) X-Originating-IP: 83.253.252.234 X-Scan-Result: No virus found in message 1MOpVr-0001TP-5L. X-Scan-Signature: ch-smtp01.sth.basefarm.net 1MOpVr-0001TP-5L e2f9b7d864960823e715427db0b86fde Cc: freebsd-fs@freebsd.org Subject: Re: "New" ZFS crash on FS (pool?) unmount/export X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Jul 2009 09:02:26 -0000 On Jun 20, 2009, at 09:11, Thomas Backman wrote: > I just ran into this tonight. Not sure exactly what triggered it - > the box stopped responding to pings at 02:07AM and it has a cron > backup job using zfs send/recv at 02:00, so I'm guessing it's > related, even though the backup probably should have finished before > then... Hmm. Anyway. > > r194478. > > kernel trap 12 with interrupts disabled > > Fatal trap 12: page fault while in kernel mode > cpuid = 0; apic id = 00 > fault virtual address = 0x288 > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff805a4989 > stack pointer = 0x28:0xffffff803e8b57e0 > frame pointer = 0x28:0xffffff803e8b5840 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = resume, IOPL = 0 > current process = 57514 (zpool) > panic: from debugger > cpuid = 0 > Uptime: 10h22m13s > Physical memory: 2027 MB > > (kgdb) bt > #0 doadump () at pcpu.h:223 > #1 0xffffffff8059c409 in boot (howto=260) at /usr/src/sys/kern/ > kern_shutdown.c:419 > #2 0xffffffff8059c85c in panic (fmt=Variable "fmt" is not available. > ) at /usr/src/sys/kern/kern_shutdown.c:575 > #3 0xffffffff801f1377 in db_panic (addr=Variable "addr" is not > available. > ) at /usr/src/sys/ddb/db_command.c:478 > #4 0xffffffff801f1781 in db_command (last_cmdp=0xffffffff80c38620, > cmd_table=Variable "cmd_table" is not available. > ) at /usr/src/sys/ddb/db_command.c:445 > #5 0xffffffff801f19d0 in db_command_loop () at /usr/src/sys/ddb/ > db_command.c:498 > #6 0xffffffff801f3969 in db_trap (type=Variable "type" is not > available. > ) at /usr/src/sys/ddb/db_main.c:229 > #7 0xffffffff805ce465 in kdb_trap (type=12, code=0, > tf=0xffffff803e8b5730) at /usr/src/sys/kern/subr_kdb.c:534 > #8 0xffffffff8088715d in trap_fatal (frame=0xffffff803e8b5730, > eva=Variable "eva" is not available. > ) at /usr/src/sys/amd64/amd64/trap.c:847 > #9 0xffffffff80887fb2 in trap (frame=0xffffff803e8b5730) at /usr/ > src/sys/amd64/amd64/trap.c:345 > #10 0xffffffff8086e007 in calltrap () at /usr/src/sys/amd64/amd64/ > exception.S:223 > #11 0xffffffff805a4989 in _sx_xlock_hard (sx=0xffffff0043557d50, > tid=18446742975830720512, opts=Variable "opts" is not available. > ) > at /usr/src/sys/kern/kern_sx.c:575 > #12 0xffffffff805a52fe in _sx_xlock (sx=Variable "sx" is not > available. > ) at sx.h:155 > #13 0xffffffff80fe2995 in zfs_freebsd_reclaim () from /boot/kernel/ > zfs.ko > #14 0xffffffff808cefca in VOP_RECLAIM_APV (vop=0xffffff0043557d38, > a=0xffffff0043557d50) at vnode_if.c:1926 > #15 0xffffffff80626f6e in vgonel (vp=0xffffff00437a7938) at > vnode_if.h:830 > #16 0xffffffff8062b528 in vflush (mp=0xffffff0060f2a000, rootrefs=0, > flags=0, td=0xffffff0061528000) > at /usr/src/sys/kern/vfs_subr.c:2450 > #17 0xffffffff80fdd3a8 in zfs_umount () from /boot/kernel/zfs.ko > #18 0xffffffff8062420a in dounmount (mp=0xffffff0060f2a000, > flags=1626513408, td=Variable "td" is not available. > ) > at /usr/src/sys/kern/vfs_mount.c:1287 > #19 0xffffffff80624975 in unmount (td=0xffffff0061528000, > uap=0xffffff803e8b5c00) > at /usr/src/sys/kern/vfs_mount.c:1172 > #20 0xffffffff8088783f in syscall (frame=0xffffff803e8b5c90) at /usr/ > src/sys/amd64/amd64/trap.c:984 > #21 0xffffffff8086e290 in Xfast_syscall () at /usr/src/sys/amd64/ > amd64/exception.S:364 > #22 0x000000080104e49c in ?? () > Previous frame inner to this frame (corrupt stack?) I might have hit the same thing again... only it didn't *crash* this time, just freeze! I got a "GEOM_GATE: Device ggateX destroyed" on my console, and it stopped responding to pings, keyboard input, etc. (NOTE: The GEOM_GATE message MAY have been an old one. I *think* it was from the night before but can't be sure...) It obviously happened while running my ugly-hack backup script this time too, since it stopped responding to pings ~02:02AM with the script running at 02:00. I'm not sure where it crashed, since snapshots were NOT taken on the "local box". It usually crashes during export, long after taking the snapshots. BTW, current system is BETA1 r195422M (dtrace timestamp patch + libzfs_sendrecv.c patch ( http://lists.freebsd.org/pipermail/freebsd-current/2009-May/006814.html ). Here's the script in its relevant entirety (I removed the "initial backup" part since it never runs using cron anyway). I realize it's an ugly hack (and that my bash-fu could be stronger), but what the heck. Essentially, it creates a GEOM provider of a file, containing a zpool, imports the pool and creates a clone to it, and then exports the pool. The export appears to be what causes all the trouble - usually, not this time around. Every time it crashes it seems to be during or very soon after the export - only this time it didn't even take the snapshots. #!/bin/bash PATH="$PATH:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/ sbin" function die() { echo "$@" 2>&1 zpool export slave 2>&1 > /dev/null ggatel destroy -u 666 2>&1 > /dev/null exit 1 } function mount_unmount { if [ -z "$1" ]; then die 'Invalid argument given to mount_unmount' elif [[ "$1" == "mount" ]]; then zpool list | grep -q slave if [ "$?" = "0" ]; then echo Already mounted return 0 fi echo Creating ggate device ggatel create -u 666 /mnt/backup/chaos/slavefile || die 'Unable to create GEOM provider from file' echo 'Sleeping for 5 seconds...' sleep 5 echo Importing pool zpool import -R /slave slave || die 'Unable to import slave pool' elif [[ "$1" == "unmount" ]]; then echo Exporting pool zpool export slave || die 'Unable to export slave pool' ggatel destroy -u 666 || die 'Unable to destroy GEOM provider' fi } f [ ! -z "$1" ]; then case $1 in mount) mount_unmount mount; exit 0;; unmount) mount_unmount unmount; exit 0;; initial) initial; exit 0;; backup) ;; *) help;; esac else help fi if [ ! -f "/mnt/backup/chaos/slavefile" ]; then echo 'Backup error! slavefile does not exist!' | mail -s "Backup error" root echo 'Slavefile does not exist!' exit 1 fi mount_unmount mount CURR=$(date +"backup-%F-%H%M") echo Taking snapshots zfs snapshot -r tank@$CURR || die 'Unable to create $CURR snapshot' echo Starting backup... LAST=$(cat /root/.last-backup) zfs send -R -I $LAST tank@$CURR | zfs recv -Fvd slave echo $CURR > /root/.last-backup mount_unmount unmount echo Running rsync rsync -av --delete /bootdir/boot exscape::backup-freebsd/chaos rsync -av --delete /root exscape::backup-freebsd/chaos rsync -av --delete ~serenity exscape::backup-freebsd/chaos echo 'All done!' ------------------- So, in *normal* cases, everything runs just fine. This is the case perhaps 90% of the time. In normal *crashes*, it hangs on export with the above backtrace. This time, all I know is that it crashes soon after starting the backup... during import, perhaps? Regards, Thomas