Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 18 Jun 2013 07:27:58 +0200
From:      Andre Albsmeier <Andre.Albsmeier@siemens.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org>
Subject:   Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found
Message-ID:  <20130618052758.GA1467@bali>
In-Reply-To: <201306171530.31208.jhb@freebsd.org>
References:  <20130531122611.GA6607@bali> <201305311051.03157.jhb@freebsd.org> <20130616063942.GA72803@bali> <201306171530.31208.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 17-Jun-2013 at 21:30:31 +0200, John Baldwin wrote:
> On Sunday, June 16, 2013 2:39:42 am Andre Albsmeier wrote:
> > On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote:
> > > On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote:
> > > > Each day at 5:15 we are generating snapshots on various machines.
> > > > This used to work perfectly under 7-STABLE for years but since
> > > > we started to use 9.1-STABLE the machine reboots in about 10%
> > > > of all cases.
> > > > 
> > > > After rebooting we find a new snapshot file which is a bit
> > > > smaller than the good ones and with different permissions
> > > > It does not succeed a fsck. In this example it is the one
> > > > whose name is beginning with s3:
> > > > 
> > > > -r--r-----   1 root  operator  snapshot 72802894528 29 May 05:15 s2-2013.05.28-03.15.04
> > > > -r--------   1 root  operator  snapshot 72802893824 29 May 05:15 s3-2013.05.29-03.15.03
> > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s4-2013.05.23-06.38.44
> > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s5-2013.05.24-03.15.03
> > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s6-2013.05.25-03.15.03
> > > > 
> > > > After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel
> > > > I see the following LORs (mksnap_ffs starts exactly at 5:15):
> > > > 
> > > > May 29 05:15:00 <kern.crit> palveli kernel: lock order reversal:
> > > > May 29 05:15:00 <kern.crit> palveli kernel: 1st 0xc2371da8 ufs (ufs) @ /src/src-9/sys/kern/vfs_mount.c:1240
> > > > May 29 05:15:00 <kern.crit> palveli kernel: 2nd 0xc2371ec4 devfs (devfs) @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414
> > > > May 29 05:15:04 <kern.crit> palveli kernel: lock order reversal:
> > > > May 29 05:15:04 <kern.crit> palveli kernel: 1st 0xc228471c snaplk (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976
> > > > May 29 05:15:04 <kern.crit> palveli kernel: 2nd 0xc22f25e4 ufs (ufs) @ /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626
> > > > 
> > > > Unfortunatley no corefiles are being generated ;-(.
> > > > 
> > > > I have checked and even rebuilt the (UFS1) fs in question
> > > > from scratch. I have also seen this happen on an UFS2 on
> > > > another machine and on a third one when running "dump -L"
> > > > on a root fs.
> > > > 
> > > > Any hints of how to proceed?
> > > 
> > > Would it be possible to setup a serial console that is logged on this machine
> > > to see if it is panic'ing but failing to write out a crashdump?
> > 
> > Couldn't attach the serial console yet ;-(. But I had people
> > attach a KVMoverIP switch and enabled the various KDB options
> > in the kernel. Now we can see a bit more (see below) -- no
> > crashdump is being generated though.
> 
> :(  Unfortunately these LORs don't really help with discerning the cause of
> the reboot.  If you have remote power access (and still wanted to test this)
> one option would be to change KDB to drop into the debugger on a panic.
> Then you could connect over the KVM and take images of the original panic
> along with a stack trace.

As described yesterday, I think I know why we don't get dumps:
I dump on da1 and da1 is spun down. On FreeBSD-7 da1 started
automatically in this case, on FreeBSD-9 it doesn't. I now
dump on da0 which is running already...

My suggestion is that I will try to get a dump now -- however,
I have to arrange it with people using the machine. I'll come
back when I have a dump ready...

Thanks,

	-Andre



> 
> -- 
> John Baldwin

-- 
Win98: useless extension to a minor patch release for 32-bit extensions
       and a graphical shell for a 16-bit patch to an 8-bit operating
       system originally coded for a 4-bit microprocessor, written by
       a 2-bit company that can't stand for 1 bit of competition.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130618052758.GA1467>