Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 8 Jul 2013 08:28:46 +0200
From:      Andre Albsmeier <Andre.Albsmeier@siemens.com>
To:        Jeremy Chadwick <jdc@koitsu.org>
Cc:        Konstantin Belousov <kostikbel@gmail.com>, "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org>, John Baldwin <jhb@freebsd.org>
Subject:   Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found
Message-ID:  <20130708062846.GA46217@bali>
In-Reply-To: <20130707123217.GA54979@icarus.home.lan>
References:  <20130616063942.GA72803@bali> <201306171530.31208.jhb@freebsd.org> <20130704051409.GA22021@bali> <20130704052440.GG91021@kib.kiev.ua> <20130704052659.GA23398@bali> <20130704061550.GI91021@kib.kiev.ua> <20130707072553.GA38133@bali> <20130707074112.GD91021@kib.kiev.ua> <20130707121354.GA39055@bali> <20130707123217.GA54979@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 07-Jul-2013 at 14:32:17 +0200, Jeremy Chadwick wrote:
> On Sun, Jul 07, 2013 at 02:13:54PM +0200, Andre Albsmeier wrote:
> > On Sun, 07-Jul-2013 at 09:41:12 +0200, Konstantin Belousov wrote:
> > > On Sun, Jul 07, 2013 at 09:25:53AM +0200, Andre Albsmeier wrote:
> > > > OK, here we go (looks better now):
> > > > 
> > > > GNU gdb 6.1.1 [FreeBSD]
> > > > Copyright 2004 Free Software Foundation, Inc.
> > > > GDB is free software, covered by the GNU General Public License, and you are
> > > > welcome to change it and/or distribute copies of it under certain conditions.
> > > > Type "show copying" to see the conditions.
> > > > There is absolutely no warranty for GDB.  Type "show warranty" for details.
> > > > This GDB was configured as "i386-marcel-freebsd"...
> > > > 
> > > > Unread portion of the kernel message buffer:
> > > > dev = stripe/p, block = 592, fs = /palveli
> > > > panic: ffs_blkfree_cg: freeing free block
> > > > KDB: stack backtrace:
> > > > db_trace_self_wrapper(c08207eb,d70fc924,c05fdfc9,c081df13,c08a82e0,...) at db_trace_self_wrapper+0x26/frame 0xd70fc8f4
> > > > kdb_backtrace(c081df13,c08a82e0,c0833a0b,d70fc930,d70fc930,...) at kdb_backtrace+0x29/frame 0xd70fc900
> > > > panic(c0833a0b,c2aae178,250,0,c2af80d4,...) at panic+0xc9/frame 0xd70fc924
> > > > ffs_blkfree_cg(250,0,8000,49f,d70fcad0,...) at ffs_blkfree_cg+0x399/frame 0xd70fc9c8
> > > > ffs_blkfree(c2b35100,c2af8000,c2b0d470,250,0,...) at ffs_blkfree+0xad/frame 0xd70fca00
> > > > indir_trunc(fffa3ff4,ffffffff,0,8000,0,...) at indir_trunc+0x658/frame 0xd70fcae0
> > > > indir_trunc(ffffdff3,ffffffff,c072df0a,c2d68d00,c087abd8,...) at indir_trunc+0x514/frame 0xd70fcbc0
> > > > handle_workitem_freeblocks(0,d70fcc4c,2,246,c2ab1000,...) at handle_workitem_freeblocks+0x2dc/frame 0xd70fcc24
> > > > process_worklist_item(0,0,0,c086ae78,0,...) at process_worklist_item+0x27a/frame 0xd70fcc6c
> > > > softdep_process_worklist(c2b36548,0,54,c0835825,64,...) at softdep_process_worklist+0x91/frame 0xd70fcc9c
> > > > softdep_flush(0,d70fcd08,0,c2aac2f0,0,...) at softdep_flush+0x3e4/frame 0xd70fcccc
> > > > fork_exit(c0738bb0,0,d70fcd08) at fork_exit+0xa2/frame 0xd70fccf4
> > > > fork_trampoline() at fork_trampoline+0x8/frame 0xd70fccf4
> > > > --- trap 0, eip = 0, esp = 0xd70fcd40, ebp = 0 ---
> > > > Uptime: 2d16h29m37s
> > > > Physical memory: 503 MB
> > > > Dumping 95 MB: 80 64 48 32 16
> > > > 
> > > > No symbol "stopped_cpus" in current context.
> > > > No symbol "stoppcbs" in current context.
> > > > #0  doadump (textdump=1) at pcpu.h:249
> > > > 249     pcpu.h: No such file or directory.
> > > >         in pcpu.h
> > > > (kgdb) where
> > > > #0  doadump (textdump=1) at pcpu.h:249
> > > > #1  0xc05fdddd in kern_reboot (howto=260) at /src/src-9/sys/kern/kern_shutdown.c:449
> > > > #2  0xc05fe028 in panic (fmt=<value optimized out>) at /src/src-9/sys/kern/kern_shutdown.c:637
> > > > #3  0xc0717899 in ffs_blkfree_cg (ump=0xc2b35100, fs=0xc2af8000, devvp=0xc2b0d470, bno=592, 
> > > >     size=32768, inum=1183, dephd=0xd70fcad0) at /src/src-9/sys/ufs/ffs/ffs_alloc.c:2151
> > > > #4  0xc0717c8d in ffs_blkfree (ump=0xc2b35100, fs=0xc2af8000, devvp=0xc2b0d470, bno=592, 
> > > >     size=32768, inum=1183, vtype=VREG, dephd=0xd70fcad0) at /src/src-9/sys/ufs/ffs/ffs_alloc.c:2280
> > > > #5  0xc0730348 in indir_trunc (freework=0xc2f99100, dbn=1642816, lbn=-376844)
> > > >     at /src/src-9/sys/ufs/ffs/ffs_softdep.c:7965
> > > > #6  0xc0730204 in indir_trunc (freework=0xc2f99100, dbn=1639680, lbn=-8205)
> > > >     at /src/src-9/sys/ufs/ffs/ffs_softdep.c:7946
> > > > #7  0xc07324bc in handle_workitem_freeblocks (freeblks=0xc2fc1e00, flags=512)
> > > >     at /src/src-9/sys/ufs/ffs/ffs_softdep.c:7588
> > > > #8  0xc0730dfa in process_worklist_item (mp=0xc2b36548, target=10, flags=512)
> > > >     at /src/src-9/sys/ufs/ffs/ffs_softdep.c:1774
> > > > #9  0xc07360c1 in softdep_process_worklist (mp=0xc2b36548, full=0)
> > > >     at /src/src-9/sys/ufs/ffs/ffs_softdep.c:1558
> > > > #10 0xc0738f94 in softdep_flush () at /src/src-9/sys/ufs/ffs/ffs_softdep.c:1414
> > > > #11 0xc05d1b82 in fork_exit (callout=0xc0738bb0 <softdep_flush>, arg=0x0, frame=0xd70fcd08)
> > > >     at /src/src-9/sys/kern/kern_fork.c:988
> > > > #12 0xc07ba904 in fork_trampoline () at /src/src-9/sys/i386/i386/exception.s:279
> > > > (kgdb) up 10
> > > > #10 0xc0738f94 in softdep_flush () at /src/src-9/sys/ufs/ffs/ffs_softdep.c:1414
> > > > 1414                            progress += softdep_process_worklist(mp, 0);
> > > > 
> > > > 	-Andre
> > > 
> > > This looks unrelated, and exactly this panic is usually has one of two
> > > causes:
> > > - corrupted filesystem, run fsck to recheck it;
> > 
> > root@palveli:~>fsck /dev/stripe/p 
> > ** /dev/stripe/p
> > ** Last Mounted on /palveli
> > ** Phase 1 - Check Blocks and Sizes
> > ** Phase 2 - Check Pathnames
> > ** Phase 3 - Check Connectivity
> > ** Phase 4 - Check Reference Counts
> > ** Phase 5 - Check Cyl groups
> > 9895 files, 2039706 used, 15697693 free (5397 frags, 1961537 blocks, 0.0% fragmentation)
> > 
> > ***** FILE SYSTEM IS CLEAN *****
> 
> Taken from your previous mail (showing only UFS stuff):
> 
> http://lists.freebsd.org/pipermail/freebsd-stable/2013-June/073817.html
> 
> >>>> fstab:
> >>>> ------
> >>>> /dev/da0s1a	/		ufs	noatime,rw				0 1
> >>>> /dev/da0s1d	/usr		ufs	noatime,rw				0 2
> >>>> /dev/da0s1e	/var		ufs	noatime,nosuid,rw			0 2
> >>>> /dev/da10p1	/share2		ufs	suiddir,groupquota,noatime,nosuid,rw	0 2
> >>>> /dev/da10p2	/raid2		ufs	userquota,noatime,nosuid,rw		0 2
> 
> Where is gstripe(8) in that picture?  Are you **sure** this is the same
> system?  Surely I'm missing something here...

It is the same system that produced the (bad) dump in my
previous mail (the one with the bcopy problem). It is NOT
the same system which we used for finding out why it didn't
dump (which we found out now and which was due to the spun
down da1).

Just for the sake of clarity: There are two systems showing
this problem when running the daily snapshot. Since users
complained about these disruption, I have moved important
stuff from one machine to the other (where I disabled the
sanpshot generation) so I can concentrate on this one (the
one to which belong the dumps) for finding the problem.

> 
> Can you provide details of the stripe, specifically "gstripe list" so I
> can see what the disks are and then ask you for "smartctl -a" output for
> each of them (to try and rule out disk-level problems that may be
> causing oddities at the layer underneathe the filesystem (sometimes fsck
> will not catch this))?

Here is "gstripe list":

Geom name: p
State: UP
Status: Total=2, Online=2
Type: AUTOMATIC
Stripesize: 32768
ID: 2179163030
Providers:
1. Name: stripe/p
   Mediasize: 72802893824 (67G)
   Sectorsize: 512
   Stripesize: 32768
   Stripeoffset: 0
   Mode: r0w0e0
Consumers:
1. Name: da10
   Mediasize: 36401479680 (33G)
   Sectorsize: 512
   Mode: r0w0e0
   Number: 0
2. Name: da11
   Mediasize: 36401479680 (33G)
   Sectorsize: 512
   Mode: r0w0e0
   Number: 1

The disks are old but seem to work properly:

da10 at ahc1 bus 0 scbus1 target 0 lun 0
da10: <IBM DDYS-T36950N SB0A> Fixed Direct Access SCSI-3 device 
da10: 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit)
da10: Command Queueing enabled
da10: 34715MB (71096640 512 byte sectors: 255H 63S/T 4425C)
da11 at ahc1 bus 0 scbus1 target 1 lun 0
da11: <IBM DDYS-T36950N SB0A> Fixed Direct Access SCSI-3 device 
da11: 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit)
da11: Command Queueing enabled
da11: 34715MB (71096640 512 byte sectors: 255H 63S/T 4425C)

On both disks the PER bit is set so I'll see any read error
problems even if they were retried or ECC-corrected (which
haven't been there for ages). When the snapshot problem
appeared for the first time, I also rebuilt the fs from
scratch.

	-Andre



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130708062846.GA46217>