Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 16 Jun 2013 11:55:38 +0200
From:      Andre Albsmeier <Andre.Albsmeier@siemens.com>
To:        Jeremy Chadwick <jdc@koitsu.org>
Cc:        "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org>, John Baldwin <jhb@freebsd.org>
Subject:   Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found
Message-ID:  <20130616095538.GA73648@bali>
In-Reply-To: <20130616084937.GA17277@icarus.home.lan>
References:  <20130531122611.GA6607@bali> <201305311051.03157.jhb@freebsd.org> <20130531172523.GA9188@bali> <20130616065441.GA15175@icarus.home.lan> <20130616080239.GA73100@bali> <20130616084937.GA17277@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 16-Jun-2013 at 10:49:37 +0200, Jeremy Chadwick wrote:
> On Sun, Jun 16, 2013 at 10:02:39AM +0200, Andre Albsmeier wrote:
> > On Sun, 16-Jun-2013 at 08:54:41 +0200, Jeremy Chadwick wrote:
> > > On Fri, May 31, 2013 at 07:25:23PM +0200, Andre Albsmeier wrote:
> > > > On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote:
> > > > > On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote:
> > > > > > Each day at 5:15 we are generating snapshots on various machines.
> > > > > > This used to work perfectly under 7-STABLE for years but since
> > > > > > we started to use 9.1-STABLE the machine reboots in about 10%
> > > > > > of all cases.
> > > > > > 
> > > > > > After rebooting we find a new snapshot file which is a bit
> > > > > > smaller than the good ones and with different permissions
> > > > > > It does not succeed a fsck. In this example it is the one
> > > > > > whose name is beginning with s3:
> > > > > > 
> > > > > > -r--r-----   1 root  operator  snapshot 72802894528 29 May 05:15 s2-2013.05.28-03.15.04
> > > > > > -r--------   1 root  operator  snapshot 72802893824 29 May 05:15 s3-2013.05.29-03.15.03
> > > > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s4-2013.05.23-06.38.44
> > > > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s5-2013.05.24-03.15.03
> > > > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s6-2013.05.25-03.15.03
> > > > > > 
> > > > > > After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel
> > > > > > I see the following LORs (mksnap_ffs starts exactly at 5:15):
> > > > > > 
> > > > > > May 29 05:15:00 <kern.crit> palveli kernel: lock order reversal:
> > > > > > May 29 05:15:00 <kern.crit> palveli kernel: 1st 0xc2371da8 ufs (ufs) @ /src/src-9/sys/kern/vfs_mount.c:1240
> > > > > > May 29 05:15:00 <kern.crit> palveli kernel: 2nd 0xc2371ec4 devfs (devfs) @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414
> > > > > > May 29 05:15:04 <kern.crit> palveli kernel: lock order reversal:
> > > > > > May 29 05:15:04 <kern.crit> palveli kernel: 1st 0xc228471c snaplk (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976
> > > > > > May 29 05:15:04 <kern.crit> palveli kernel: 2nd 0xc22f25e4 ufs (ufs) @ /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626
> > > > > > 
> > > > > > Unfortunatley no corefiles are being generated ;-(.
> > > > > > 
> > > > > > I have checked and even rebuilt the (UFS1) fs in question
> > > > > > from scratch. I have also seen this happen on an UFS2 on
> > > > > > another machine and on a third one when running "dump -L"
> > > > > > on a root fs.
> > > > > > 
> > > > > > Any hints of how to proceed?
> > > > > 
> > > > > Would it be possible to setup a serial console that is logged on this machine
> > > > > to see if it is panic'ing but failing to write out a crashdump?
> > > > 
> > > > I'll try to arrange that. It'll take a bit since this
> > > > box is 200 km away... 
> > > > 
> > > > Maybe I'll find another one nearby to reproduce it...
> > > 
> > > SPECIFICALLY regarding "lack of crash dumps": I need to see the
> > > following:
> > > 
> > > * cat /etc/rc.conf
> > > * cat /etc/fstab
> > > 
> > > I may need output from other commands, but shall deal with that when I
> > > see output from the above.  Thanks.
> > 
> > No problem, see below...
> > 
> > To make a long story short, the machine dumps core perfectly
> > (tested that a while ago), but not when dealing with _this_
> > issue...
> > 
> > I dump on da1s1b and savecore fetches it from there and puts
> > it on /var (sitting on da0), that's faster.
> > 
> > rc.conf (beware, rc.conf.local exists):
> > ---------------------------------------
> > rcshutdown_timeout=180
> > tmpmfs=YES
> > tmpsize="$(( `/sbin/sysctl -n hw.usermem` / 3000000 ))m"
> > tmpmfs_flags="$tmpmfs_flags -v 1 -n"
> > 
> > background_fsck=NO
> > 
> > nisdomainname=ofw.tld
> > pflog_flags=-S
> > 
> > syslogd_flags=-svv
> > inetd_enable=YES
> > inetd_flags=-l
> > named_flags="-S 1000"
> > named_chrootdir=""
> > rwhod_enable=YES
> > sshd_enable=YES
> > amd_enable=YES
> > amd_flags="-F /etc/amd.conf"
> > nfs_client_enable=YES
> > nfs_access_cache=2
> > mountd_flags=-n
> > rpcbind_enable=YES
> > 
> > ntpdate_enable=YES
> > ntpdate_hosts=ntp
> > ntpd_enable=YES
> > ntpd_flags="-p /var/run/ntpd.pid"
> > 
> > nis_client_enable=YES
> > nis_client_flags="-s -S ofw.tld,nis-16-1,nis-16-2"
> > nis_server_flags=-n
> > nis_yppasswdd_flags="-t /var/yp/src/master.passwd -f -v"
> > 
> > defaultrouter=192.168.16.2
> > 
> > keyrate=fast
> > 
> > sendmail_flags="-bd -q5m"
> > sendmail_submit_flags="$sendmail_flags -ODaemonPortOptions=Addr=localhost"
> > sendmail_msp_queue_flags="-Ac -q30m"
> > sendmail_rebuild_aliases=NO
> > 
> > lpd_enable=YES
> > lpd_flags=-s
> > chkprintcap_enable=YES
> > dumpdev=AUTO
> > clear_tmp_X=NO
> > ldconfig_paths=/usr/local/lib
> > ldconfig_paths_aout=""
> > entropy_file=/boot/entropy-file
> > 
> > 
> > rc.conf.local:
> > --------------
> > hostname=typhon.ofw.tld
> > ifconfig_msk0="inet 192.168.24.1/21"
> > ifconfig_msk0_alias0="inet 192.168.24.10/32"
> > 
> > named_enable=YES
> > nfs_server_enable=YES
> > 
> > nis_client_flags="-s -S ofw.tld,nis-24-1,nis-24-2"
> > nis_server_enable=YES
> > 
> > defaultrouter=192.168.24.2
> > 
> > lpd_flags=-l
> > dumpdev=/dev/da1s1b
> > quota_enable=YES
> > 
> > 
> > fstab:
> > ------
> > /dev/da0s1a	/		ufs	noatime,rw				0 1
> > /dev/da0s1b	none		swap	sw					0 0
> > proc		/proc		procfs	rw					0 0
> > /dev/da0s1d	/usr		ufs	noatime,rw				0 2
> > /dev/da0s1e	/var		ufs	noatime,nosuid,rw			0 2
> > 
> > /dev/da10p1	/share2		ufs	suiddir,groupquota,noatime,nosuid,rw	0 2
> > /dev/da10p2	/raid2		ufs	userquota,noatime,nosuid,rw		0 2
> 
> Thank you.  Can you show me output from the following?

Thanks to you for looking into this...

> 
> * camcontrol devlist

<IBM DDRS-39130W S92A>             at scbus0 target 0 lun 0 (da0,pass0)
<IBM DDRS-39130W S97B>             at scbus0 target 1 lun 0 (da1,pass1)
<AMCC 9690SA-8I  DISK 4.10>        at scbus1 target 0 lun 0 (da10,pass2)

> * gpart show -p da1

=>      63  17849937    da1  MBR  (8.5G)
        63  17849937  da1s1  freebsd  [active]  (8.5G)

And here is gpart show -p da1s1

=>       0  17849937   da1s1  BSD  (8.5G)
         0        16          - free -  (8.0k)
        16    599984  da1s1a  freebsd-ufs  (293M)
    600000   2000000  da1s1d  freebsd-ufs  (976M)
   2600000  11000000  da1s1e  freebsd-ufs  (5.3G)
  13600000   4249937  da1s1b  freebsd-swap  (2.0G)

> 
> I'm pretty sure I see the problem, but I want to be extra sure.

I am curious already!

	-Andre
> 
> -- 
> | Jeremy Chadwick                                   jdc@koitsu.org |
> | UNIX Systems Administrator                http://jdc.koitsu.org/ |
> | Making life hard for others since 1977.             PGP 4BD6C0CB |
> 

-- 
A: Because it fouls the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130616095538.GA73648>