Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 7 Jul 2013 09:25:53 +0200
From:      Andre Albsmeier <Andre.Albsmeier@siemens.com>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org>, John Baldwin <jhb@freebsd.org>
Subject:   Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found
Message-ID:  <20130707072553.GA38133@bali>
In-Reply-To: <20130704061550.GI91021@kib.kiev.ua>
References:  <20130531122611.GA6607@bali> <201305311051.03157.jhb@freebsd.org> <20130616063942.GA72803@bali> <201306171530.31208.jhb@freebsd.org> <20130704051409.GA22021@bali> <20130704052440.GG91021@kib.kiev.ua> <20130704052659.GA23398@bali> <20130704061550.GI91021@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 04-Jul-2013 at 08:15:50 +0200, Konstantin Belousov wrote:
> On Thu, Jul 04, 2013 at 07:27:00AM +0200, Andre Albsmeier wrote:
> > On Thu, 04-Jul-2013 at 07:24:40 +0200, Konstantin Belousov wrote:
> > > On Thu, Jul 04, 2013 at 07:14:09AM +0200, Andre Albsmeier wrote:
> > > > On Mon, 17-Jun-2013 at 21:30:31 +0200, John Baldwin wrote:
> > > > > On Sunday, June 16, 2013 2:39:42 am Andre Albsmeier wrote:
> > > > > > On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote:
> > > > > > > On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote:
> > > > > > > > Each day at 5:15 we are generating snapshots on various machines.
> > > > > > > > This used to work perfectly under 7-STABLE for years but since
> > > > > > > > we started to use 9.1-STABLE the machine reboots in about 10%
> > > > > > > > of all cases.
> > > > > > > > 
> > > > > > > > After rebooting we find a new snapshot file which is a bit
> > > > > > > > smaller than the good ones and with different permissions
> > > > > > > > It does not succeed a fsck. In this example it is the one
> > > > > > > > whose name is beginning with s3:
> > > > > > > > 
> > > > > > > > -r--r-----   1 root  operator  snapshot 72802894528 29 May 05:15 s2-2013.05.28-03.15.04
> > > > > > > > -r--------   1 root  operator  snapshot 72802893824 29 May 05:15 s3-2013.05.29-03.15.03
> > > > > > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s4-2013.05.23-06.38.44
> > > > > > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s5-2013.05.24-03.15.03
> > > > > > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s6-2013.05.25-03.15.03
> > > > > > > > 
> > > > > > > > After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel
> > > > > > > > I see the following LORs (mksnap_ffs starts exactly at 5:15):
> > > > > > > > 
> > > > > > > > May 29 05:15:00 <kern.crit> palveli kernel: lock order reversal:
> > > > > > > > May 29 05:15:00 <kern.crit> palveli kernel: 1st 0xc2371da8 ufs (ufs) @ /src/src-9/sys/kern/vfs_mount.c:1240
> > > > > > > > May 29 05:15:00 <kern.crit> palveli kernel: 2nd 0xc2371ec4 devfs (devfs) @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414
> > > > > > > > May 29 05:15:04 <kern.crit> palveli kernel: lock order reversal:
> > > > > > > > May 29 05:15:04 <kern.crit> palveli kernel: 1st 0xc228471c snaplk (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976
> > > > > > > > May 29 05:15:04 <kern.crit> palveli kernel: 2nd 0xc22f25e4 ufs (ufs) @ /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626
> > > > > > > > 
> > > > > > > > Unfortunatley no corefiles are being generated ;-(.
> > > > > > > > 
> > > > > > > > I have checked and even rebuilt the (UFS1) fs in question
> > > > > > > > from scratch. I have also seen this happen on an UFS2 on
> > > > > > > > another machine and on a third one when running "dump -L"
> > > > > > > > on a root fs.
> > > > > > > > 
> > > > > > > > Any hints of how to proceed?
> > > > > > > 
> > > > > > > Would it be possible to setup a serial console that is logged on this machine
> > > > > > > to see if it is panic'ing but failing to write out a crashdump?
> > > > > > 
> > > > > > Couldn't attach the serial console yet ;-(. But I had people
> > > > > > attach a KVMoverIP switch and enabled the various KDB options
> > > > > > in the kernel. Now we can see a bit more (see below) -- no
> > > > > > crashdump is being generated though.
> > > > > 
> > > > > :(  Unfortunately these LORs don't really help with discerning the cause of
> > > > > the reboot.  If you have remote power access (and still wanted to test this)
> > > > > one option would be to change KDB to drop into the debugger on a panic.
> > > > > Then you could connect over the KVM and take images of the original panic
> > > > > along with a stack trace.
> > > > 
> > > > After a few days of no problems, the box decided to crash
> > > > during mksnap_ffs today ;-(. But now I have a crashdump,
> > > > see below. Unfortunatley, I cannot upload the dump somewhere
> > > > but if you ask me check whatever things I'll be happy to help.
> > > > 
> > > > kgdb /usr/obj/src/src-9/sys/palveli/kernel.debug vmcore.4
> > > > GNU gdb 6.1.1 [FreeBSD]
> > > > Copyright 2004 Free Software Foundation, Inc.
> > > > GDB is free software, covered by the GNU General Public License, and you are
> > > > welcome to change it and/or distribute copies of it under certain conditions.
> > > > Type "show copying" to see the conditions.
> > > > There is absolutely no warranty for GDB.  Type "show warranty" for details.
> > > > This GDB was configured as "i386-marcel-freebsd"...
> > > > 
> > > > Unread portion of the kernel message buffer:
> > > > 
> > > > 
> > > > Fatal trap 12: page fault while in kernel mode
> > > > fault virtual address   = 0xcfb5e000
> > > > fault code              = supervisor write, page not present
> > > > instruction pointer     = 0x20:0xc07cb2fe
> > > > stack pointer           = 0x28:0xd83545d0
> > > > frame pointer           = 0x28:0xd835490c
> > > > code segment            = base 0x0, limit 0xfffff, type 0x1b
> > > >                         = DPL 0, pres 1, def32 1, gran 1
> > > > processor eflags        = interrupt enabled, resume, IOPL = 0
> > > > current process         = 12929 (mksnap_ffs)
> > > > trap number             = 12
> > > > panic: page fault
> > > > KDB: stack backtrace:
> > > > db_trace_self_wrapper(c08207eb,d835441c,c05fdfc9,c081df13,c08a82e0,...) at db_trace_self_wrapper+0x26/frame 0xd83543ec
> > > > kdb_backtrace(c081df13,c08a82e0,c0801bfa,d8354428,d8354428,...) at kdb_backtrace+0x29/frame 0xd83543f8
> > > > panic(c0801bfa,c0845a01,c2bafae4,1,1,...) at panic+0xc9/frame 0xd835441c
> > > > trap_fatal(c0ff6000,cfb5e000,2,0,265abf,...) at trap_fatal+0x353/frame 0xd835445c
> > > > trap_pfault(140da,0,c2baf930,c08b6a40,c282145c,...) at trap_pfault+0x2d7/frame 0xd83544a4
> > > > trap(d8354590) at trap+0x41a/frame 0xd8354584
> > > > calltrap() at calltrap+0x6/frame 0xd8354584
> > > > --- trap 0xc, eip = 0xc07cb2fe, esp = 0xd83545d0, ebp = 0xd835490c ---
> > > > bcopy(c2b36548,c2f194e0,0,0,0,...) at bcopy+0x1a/frame 0xd835490c
> > > > ffs_mount(c2b36548,c2db9000,ff,d8354c08,c2b665e4,...) at ffs_mount+0x15ee/frame 0xd8354a3c
> > > 
> > > From the crash dump in kgdb, do
> > > list *ffs_mount+0x15ee
> > 
> > (kgdb) list *ffs_mount+0x15ee
> > 0xc0748e8e is in ffs_mount (/src/src-9/sys/ufs/ffs/ffs_vfsops.c:483).
> > 478
> > 479                     /*
> > 480                      * If this is a snapshot request, take the snapshot.
> > 481                      */
> > 482                     if (mp->mnt_flag & MNT_SNAPSHOT)
> > 483                             return (ffs_snapshot(mp, fspec));
> > 484             }
> > 485
> > 486             /*
> > 487              * Not an update, or updating the name: look up the name
> 
> It is not useful, bcopy does not create a frame, so the real caller
> of the failing bcopy gets lost.  It could be uncovered with some
> stack digging, but I believe it would be easier just fix bcopy.
> 
> Please apply this patch and reproduce the panic again, then the kgdb
> backtrace should be more useful.

OK, here we go (looks better now):

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd"...

Unread portion of the kernel message buffer:
dev = stripe/p, block = 592, fs = /palveli
panic: ffs_blkfree_cg: freeing free block
KDB: stack backtrace:
db_trace_self_wrapper(c08207eb,d70fc924,c05fdfc9,c081df13,c08a82e0,...) at db_trace_self_wrapper+0x26/frame 0xd70fc8f4
kdb_backtrace(c081df13,c08a82e0,c0833a0b,d70fc930,d70fc930,...) at kdb_backtrace+0x29/frame 0xd70fc900
panic(c0833a0b,c2aae178,250,0,c2af80d4,...) at panic+0xc9/frame 0xd70fc924
ffs_blkfree_cg(250,0,8000,49f,d70fcad0,...) at ffs_blkfree_cg+0x399/frame 0xd70fc9c8
ffs_blkfree(c2b35100,c2af8000,c2b0d470,250,0,...) at ffs_blkfree+0xad/frame 0xd70fca00
indir_trunc(fffa3ff4,ffffffff,0,8000,0,...) at indir_trunc+0x658/frame 0xd70fcae0
indir_trunc(ffffdff3,ffffffff,c072df0a,c2d68d00,c087abd8,...) at indir_trunc+0x514/frame 0xd70fcbc0
handle_workitem_freeblocks(0,d70fcc4c,2,246,c2ab1000,...) at handle_workitem_freeblocks+0x2dc/frame 0xd70fcc24
process_worklist_item(0,0,0,c086ae78,0,...) at process_worklist_item+0x27a/frame 0xd70fcc6c
softdep_process_worklist(c2b36548,0,54,c0835825,64,...) at softdep_process_worklist+0x91/frame 0xd70fcc9c
softdep_flush(0,d70fcd08,0,c2aac2f0,0,...) at softdep_flush+0x3e4/frame 0xd70fcccc
fork_exit(c0738bb0,0,d70fcd08) at fork_exit+0xa2/frame 0xd70fccf4
fork_trampoline() at fork_trampoline+0x8/frame 0xd70fccf4
--- trap 0, eip = 0, esp = 0xd70fcd40, ebp = 0 ---
Uptime: 2d16h29m37s
Physical memory: 503 MB
Dumping 95 MB: 80 64 48 32 16

No symbol "stopped_cpus" in current context.
No symbol "stoppcbs" in current context.
#0  doadump (textdump=1) at pcpu.h:249
249     pcpu.h: No such file or directory.
        in pcpu.h
(kgdb) where
#0  doadump (textdump=1) at pcpu.h:249
#1  0xc05fdddd in kern_reboot (howto=260) at /src/src-9/sys/kern/kern_shutdown.c:449
#2  0xc05fe028 in panic (fmt=<value optimized out>) at /src/src-9/sys/kern/kern_shutdown.c:637
#3  0xc0717899 in ffs_blkfree_cg (ump=0xc2b35100, fs=0xc2af8000, devvp=0xc2b0d470, bno=592, 
    size=32768, inum=1183, dephd=0xd70fcad0) at /src/src-9/sys/ufs/ffs/ffs_alloc.c:2151
#4  0xc0717c8d in ffs_blkfree (ump=0xc2b35100, fs=0xc2af8000, devvp=0xc2b0d470, bno=592, 
    size=32768, inum=1183, vtype=VREG, dephd=0xd70fcad0) at /src/src-9/sys/ufs/ffs/ffs_alloc.c:2280
#5  0xc0730348 in indir_trunc (freework=0xc2f99100, dbn=1642816, lbn=-376844)
    at /src/src-9/sys/ufs/ffs/ffs_softdep.c:7965
#6  0xc0730204 in indir_trunc (freework=0xc2f99100, dbn=1639680, lbn=-8205)
    at /src/src-9/sys/ufs/ffs/ffs_softdep.c:7946
#7  0xc07324bc in handle_workitem_freeblocks (freeblks=0xc2fc1e00, flags=512)
    at /src/src-9/sys/ufs/ffs/ffs_softdep.c:7588
#8  0xc0730dfa in process_worklist_item (mp=0xc2b36548, target=10, flags=512)
    at /src/src-9/sys/ufs/ffs/ffs_softdep.c:1774
#9  0xc07360c1 in softdep_process_worklist (mp=0xc2b36548, full=0)
    at /src/src-9/sys/ufs/ffs/ffs_softdep.c:1558
#10 0xc0738f94 in softdep_flush () at /src/src-9/sys/ufs/ffs/ffs_softdep.c:1414
#11 0xc05d1b82 in fork_exit (callout=0xc0738bb0 <softdep_flush>, arg=0x0, frame=0xd70fcd08)
    at /src/src-9/sys/kern/kern_fork.c:988
#12 0xc07ba904 in fork_trampoline () at /src/src-9/sys/i386/i386/exception.s:279
(kgdb) up 10
#10 0xc0738f94 in softdep_flush () at /src/src-9/sys/ufs/ffs/ffs_softdep.c:1414
1414                            progress += softdep_process_worklist(mp, 0);

	-Andre



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130707072553.GA38133>