Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 4 Jul 2013 16:29:19 +0200
From:      Andre Albsmeier <Andre.Albsmeier@siemens.com>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org>, John Baldwin <jhb@freebsd.org>
Subject:   Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found
Message-ID:  <20130704142919.GA1798@bali>
In-Reply-To: <20130704061550.GI91021@kib.kiev.ua>
References:  <20130531122611.GA6607@bali> <201305311051.03157.jhb@freebsd.org> <20130616063942.GA72803@bali> <201306171530.31208.jhb@freebsd.org> <20130704051409.GA22021@bali> <20130704052440.GG91021@kib.kiev.ua> <20130704052659.GA23398@bali> <20130704061550.GI91021@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 04-Jul-2013 at 08:15:50 +0200, Konstantin Belousov wrote:
> On Thu, Jul 04, 2013 at 07:27:00AM +0200, Andre Albsmeier wrote:
> > On Thu, 04-Jul-2013 at 07:24:40 +0200, Konstantin Belousov wrote:
> > > On Thu, Jul 04, 2013 at 07:14:09AM +0200, Andre Albsmeier wrote:
> > > > On Mon, 17-Jun-2013 at 21:30:31 +0200, John Baldwin wrote:
> > > > > On Sunday, June 16, 2013 2:39:42 am Andre Albsmeier wrote:
> > > > > > On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote:
> > > > > > > On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote:
> > > > > > > > Each day at 5:15 we are generating snapshots on various machines.
> > > > > > > > This used to work perfectly under 7-STABLE for years but since
> > > > > > > > we started to use 9.1-STABLE the machine reboots in about 10%
> > > > > > > > of all cases.
> > > > > > > > 
> > > > > > > > After rebooting we find a new snapshot file which is a bit
> > > > > > > > smaller than the good ones and with different permissions
> > > > > > > > It does not succeed a fsck. In this example it is the one
> > > > > > > > whose name is beginning with s3:
> > > > > > > > 
> > > > > > > > -r--r-----   1 root  operator  snapshot 72802894528 29 May 05:15 s2-2013.05.28-03.15.04
> > > > > > > > -r--------   1 root  operator  snapshot 72802893824 29 May 05:15 s3-2013.05.29-03.15.03
> > > > > > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s4-2013.05.23-06.38.44
> > > > > > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s5-2013.05.24-03.15.03
> > > > > > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s6-2013.05.25-03.15.03
> > > > > > > > 
> > > > > > > > After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel
> > > > > > > > I see the following LORs (mksnap_ffs starts exactly at 5:15):
> > > > > > > > 
> > > > > > > > May 29 05:15:00 <kern.crit> palveli kernel: lock order reversal:
> > > > > > > > May 29 05:15:00 <kern.crit> palveli kernel: 1st 0xc2371da8 ufs (ufs) @ /src/src-9/sys/kern/vfs_mount.c:1240
> > > > > > > > May 29 05:15:00 <kern.crit> palveli kernel: 2nd 0xc2371ec4 devfs (devfs) @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414
> > > > > > > > May 29 05:15:04 <kern.crit> palveli kernel: lock order reversal:
> > > > > > > > May 29 05:15:04 <kern.crit> palveli kernel: 1st 0xc228471c snaplk (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976
> > > > > > > > May 29 05:15:04 <kern.crit> palveli kernel: 2nd 0xc22f25e4 ufs (ufs) @ /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626
> > > > > > > > 
> > > > > > > > Unfortunatley no corefiles are being generated ;-(.
> > > > > > > > 
> > > > > > > > I have checked and even rebuilt the (UFS1) fs in question
> > > > > > > > from scratch. I have also seen this happen on an UFS2 on
> > > > > > > > another machine and on a third one when running "dump -L"
> > > > > > > > on a root fs.
> > > > > > > > 
> > > > > > > > Any hints of how to proceed?
> > > > > > > 
> > > > > > > Would it be possible to setup a serial console that is logged on this machine
> > > > > > > to see if it is panic'ing but failing to write out a crashdump?
> > > > > > 
> > > > > > Couldn't attach the serial console yet ;-(. But I had people
> > > > > > attach a KVMoverIP switch and enabled the various KDB options
> > > > > > in the kernel. Now we can see a bit more (see below) -- no
> > > > > > crashdump is being generated though.
> > > > > 
> > > > > :(  Unfortunately these LORs don't really help with discerning the cause of
> > > > > the reboot.  If you have remote power access (and still wanted to test this)
> > > > > one option would be to change KDB to drop into the debugger on a panic.
> > > > > Then you could connect over the KVM and take images of the original panic
> > > > > along with a stack trace.
> > > > 
> > > > After a few days of no problems, the box decided to crash
> > > > during mksnap_ffs today ;-(. But now I have a crashdump,
> > > > see below. Unfortunatley, I cannot upload the dump somewhere
> > > > but if you ask me check whatever things I'll be happy to help.
> > > > 
> > > > kgdb /usr/obj/src/src-9/sys/palveli/kernel.debug vmcore.4
> > > > GNU gdb 6.1.1 [FreeBSD]
> > > > Copyright 2004 Free Software Foundation, Inc.
> > > > GDB is free software, covered by the GNU General Public License, and you are
> > > > welcome to change it and/or distribute copies of it under certain conditions.
> > > > Type "show copying" to see the conditions.
> > > > There is absolutely no warranty for GDB.  Type "show warranty" for details.
> > > > This GDB was configured as "i386-marcel-freebsd"...
> > > > 
> > > > Unread portion of the kernel message buffer:
> > > > 
> > > > 
> > > > Fatal trap 12: page fault while in kernel mode
> > > > fault virtual address   = 0xcfb5e000
> > > > fault code              = supervisor write, page not present
> > > > instruction pointer     = 0x20:0xc07cb2fe
> > > > stack pointer           = 0x28:0xd83545d0
> > > > frame pointer           = 0x28:0xd835490c
> > > > code segment            = base 0x0, limit 0xfffff, type 0x1b
> > > >                         = DPL 0, pres 1, def32 1, gran 1
> > > > processor eflags        = interrupt enabled, resume, IOPL = 0
> > > > current process         = 12929 (mksnap_ffs)
> > > > trap number             = 12
> > > > panic: page fault
> > > > KDB: stack backtrace:
> > > > db_trace_self_wrapper(c08207eb,d835441c,c05fdfc9,c081df13,c08a82e0,...) at db_trace_self_wrapper+0x26/frame 0xd83543ec
> > > > kdb_backtrace(c081df13,c08a82e0,c0801bfa,d8354428,d8354428,...) at kdb_backtrace+0x29/frame 0xd83543f8
> > > > panic(c0801bfa,c0845a01,c2bafae4,1,1,...) at panic+0xc9/frame 0xd835441c
> > > > trap_fatal(c0ff6000,cfb5e000,2,0,265abf,...) at trap_fatal+0x353/frame 0xd835445c
> > > > trap_pfault(140da,0,c2baf930,c08b6a40,c282145c,...) at trap_pfault+0x2d7/frame 0xd83544a4
> > > > trap(d8354590) at trap+0x41a/frame 0xd8354584
> > > > calltrap() at calltrap+0x6/frame 0xd8354584
> > > > --- trap 0xc, eip = 0xc07cb2fe, esp = 0xd83545d0, ebp = 0xd835490c ---
> > > > bcopy(c2b36548,c2f194e0,0,0,0,...) at bcopy+0x1a/frame 0xd835490c
> > > > ffs_mount(c2b36548,c2db9000,ff,d8354c08,c2b665e4,...) at ffs_mount+0x15ee/frame 0xd8354a3c
> > > 
> > > From the crash dump in kgdb, do
> > > list *ffs_mount+0x15ee
> > 
> > (kgdb) list *ffs_mount+0x15ee
> > 0xc0748e8e is in ffs_mount (/src/src-9/sys/ufs/ffs/ffs_vfsops.c:483).
> > 478
> > 479                     /*
> > 480                      * If this is a snapshot request, take the snapshot.
> > 481                      */
> > 482                     if (mp->mnt_flag & MNT_SNAPSHOT)
> > 483                             return (ffs_snapshot(mp, fspec));
> > 484             }
> > 485
> > 486             /*
> > 487              * Not an update, or updating the name: look up the name
> 
> It is not useful, bcopy does not create a frame, so the real caller
> of the failing bcopy gets lost.  It could be uncovered with some
> stack digging, but I believe it would be easier just fix bcopy.
> 
> Please apply this patch and reproduce the panic again, then the kgdb
> backtrace should be more useful.

OK, patch is applied. I will reboot the machine later
and see what happens tomorrow in the morning. However,
it might take a few days since the last 2 weeks all was
fine.

BTW, should this patch be used in general or is it just
for debugging? My understanding is that it is something
which could stay in the code...

Thanks,

	-Andre

> 
> diff --git a/sys/i386/i386/support.s b/sys/i386/i386/support.s
> index 967a09e..779fa38 100644
> --- a/sys/i386/i386/support.s
> +++ b/sys/i386/i386/support.s
> @@ -181,11 +181,13 @@ END(bcopyb)
>   *  ws@tools.de     (Wolfgang Solfrank, TooLs GmbH) +49-228-985800
>   */
>  ENTRY(bcopy)
> +	pushl	%ebp
> +	movl	%esp,%ebp
>  	pushl	%esi
>  	pushl	%edi
> -	movl	12(%esp),%esi
> -	movl	16(%esp),%edi
> -	movl	20(%esp),%ecx
> +	movl	8(%ebp),%esi
> +	movl	12(%ebp),%edi
> +	movl	16(%ebp),%ecx
>  
>  	movl	%edi,%eax
>  	subl	%esi,%eax
> @@ -196,12 +198,13 @@ ENTRY(bcopy)
>  	cld					/* nope, copy forwards */
>  	rep
>  	movsl
> -	movl	20(%esp),%ecx
> +	movl	16(%ebp),%ecx
>  	andl	$3,%ecx				/* any bytes left? */
>  	rep
>  	movsb
>  	popl	%edi
>  	popl	%esi
> +	popl	%ebp
>  	ret
>  
>  	ALIGN_TEXT
> @@ -214,7 +217,7 @@ ENTRY(bcopy)
>  	std
>  	rep
>  	movsb
> -	movl	20(%esp),%ecx			/* copy remainder by 32-bit words */
> +	movl	16(%ebp),%ecx			/* copy remainder by 32-bit words */
>  	shrl	$2,%ecx
>  	subl	$3,%esi
>  	subl	$3,%edi
> @@ -223,6 +226,7 @@ ENTRY(bcopy)
>  	popl	%edi
>  	popl	%esi
>  	cld
> +	popl	%ebp
>  	ret
>  END(bcopy)
>  



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130704142919.GA1798>