From owner-freebsd-stable@FreeBSD.ORG Mon Jul 8 06:28:55 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 95BC5712; Mon, 8 Jul 2013 06:28:55 +0000 (UTC) (envelope-from Andre.Albsmeier@siemens.com) Received: from goliath.siemens.de (goliath.siemens.de [192.35.17.28]) by mx1.freebsd.org (Postfix) with ESMTP id 20A3E1ACF; Mon, 8 Jul 2013 06:28:53 +0000 (UTC) Received: from mail3.siemens.de (localhost [127.0.0.1]) by goliath.siemens.de (8.13.6/8.13.6) with ESMTP id r686SlJ0001963; Mon, 8 Jul 2013 08:28:47 +0200 Received: from curry.mchp.siemens.de (curry.mchp.siemens.de [139.25.40.130]) by mail3.siemens.de (8.13.6/8.13.6) with ESMTP id r686SkPH029571; Mon, 8 Jul 2013 08:28:47 +0200 Received: (from localhost) by curry.mchp.siemens.de (8.14.7/8.14.7) id r686SkYY032383; Date: Mon, 8 Jul 2013 08:28:46 +0200 From: Andre Albsmeier To: Jeremy Chadwick Subject: Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found Message-ID: <20130708062846.GA46217@bali> References: <20130616063942.GA72803@bali> <201306171530.31208.jhb@freebsd.org> <20130704051409.GA22021@bali> <20130704052440.GG91021@kib.kiev.ua> <20130704052659.GA23398@bali> <20130704061550.GI91021@kib.kiev.ua> <20130707072553.GA38133@bali> <20130707074112.GD91021@kib.kiev.ua> <20130707121354.GA39055@bali> <20130707123217.GA54979@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130707123217.GA54979@icarus.home.lan> X-Echelon: X-Advice: Drop that crappy M$-Outlook, I'm tired of your viruses! User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Konstantin Belousov , "freebsd-stable@freebsd.org" , John Baldwin X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Jul 2013 06:28:55 -0000 On Sun, 07-Jul-2013 at 14:32:17 +0200, Jeremy Chadwick wrote: > On Sun, Jul 07, 2013 at 02:13:54PM +0200, Andre Albsmeier wrote: > > On Sun, 07-Jul-2013 at 09:41:12 +0200, Konstantin Belousov wrote: > > > On Sun, Jul 07, 2013 at 09:25:53AM +0200, Andre Albsmeier wrote: > > > > OK, here we go (looks better now): > > > > > > > > GNU gdb 6.1.1 [FreeBSD] > > > > Copyright 2004 Free Software Foundation, Inc. > > > > GDB is free software, covered by the GNU General Public License, and you are > > > > welcome to change it and/or distribute copies of it under certain conditions. > > > > Type "show copying" to see the conditions. > > > > There is absolutely no warranty for GDB. Type "show warranty" for details. > > > > This GDB was configured as "i386-marcel-freebsd"... > > > > > > > > Unread portion of the kernel message buffer: > > > > dev = stripe/p, block = 592, fs = /palveli > > > > panic: ffs_blkfree_cg: freeing free block > > > > KDB: stack backtrace: > > > > db_trace_self_wrapper(c08207eb,d70fc924,c05fdfc9,c081df13,c08a82e0,...) at db_trace_self_wrapper+0x26/frame 0xd70fc8f4 > > > > kdb_backtrace(c081df13,c08a82e0,c0833a0b,d70fc930,d70fc930,...) at kdb_backtrace+0x29/frame 0xd70fc900 > > > > panic(c0833a0b,c2aae178,250,0,c2af80d4,...) at panic+0xc9/frame 0xd70fc924 > > > > ffs_blkfree_cg(250,0,8000,49f,d70fcad0,...) at ffs_blkfree_cg+0x399/frame 0xd70fc9c8 > > > > ffs_blkfree(c2b35100,c2af8000,c2b0d470,250,0,...) at ffs_blkfree+0xad/frame 0xd70fca00 > > > > indir_trunc(fffa3ff4,ffffffff,0,8000,0,...) at indir_trunc+0x658/frame 0xd70fcae0 > > > > indir_trunc(ffffdff3,ffffffff,c072df0a,c2d68d00,c087abd8,...) at indir_trunc+0x514/frame 0xd70fcbc0 > > > > handle_workitem_freeblocks(0,d70fcc4c,2,246,c2ab1000,...) at handle_workitem_freeblocks+0x2dc/frame 0xd70fcc24 > > > > process_worklist_item(0,0,0,c086ae78,0,...) at process_worklist_item+0x27a/frame 0xd70fcc6c > > > > softdep_process_worklist(c2b36548,0,54,c0835825,64,...) at softdep_process_worklist+0x91/frame 0xd70fcc9c > > > > softdep_flush(0,d70fcd08,0,c2aac2f0,0,...) at softdep_flush+0x3e4/frame 0xd70fcccc > > > > fork_exit(c0738bb0,0,d70fcd08) at fork_exit+0xa2/frame 0xd70fccf4 > > > > fork_trampoline() at fork_trampoline+0x8/frame 0xd70fccf4 > > > > --- trap 0, eip = 0, esp = 0xd70fcd40, ebp = 0 --- > > > > Uptime: 2d16h29m37s > > > > Physical memory: 503 MB > > > > Dumping 95 MB: 80 64 48 32 16 > > > > > > > > No symbol "stopped_cpus" in current context. > > > > No symbol "stoppcbs" in current context. > > > > #0 doadump (textdump=1) at pcpu.h:249 > > > > 249 pcpu.h: No such file or directory. > > > > in pcpu.h > > > > (kgdb) where > > > > #0 doadump (textdump=1) at pcpu.h:249 > > > > #1 0xc05fdddd in kern_reboot (howto=260) at /src/src-9/sys/kern/kern_shutdown.c:449 > > > > #2 0xc05fe028 in panic (fmt=) at /src/src-9/sys/kern/kern_shutdown.c:637 > > > > #3 0xc0717899 in ffs_blkfree_cg (ump=0xc2b35100, fs=0xc2af8000, devvp=0xc2b0d470, bno=592, > > > > size=32768, inum=1183, dephd=0xd70fcad0) at /src/src-9/sys/ufs/ffs/ffs_alloc.c:2151 > > > > #4 0xc0717c8d in ffs_blkfree (ump=0xc2b35100, fs=0xc2af8000, devvp=0xc2b0d470, bno=592, > > > > size=32768, inum=1183, vtype=VREG, dephd=0xd70fcad0) at /src/src-9/sys/ufs/ffs/ffs_alloc.c:2280 > > > > #5 0xc0730348 in indir_trunc (freework=0xc2f99100, dbn=1642816, lbn=-376844) > > > > at /src/src-9/sys/ufs/ffs/ffs_softdep.c:7965 > > > > #6 0xc0730204 in indir_trunc (freework=0xc2f99100, dbn=1639680, lbn=-8205) > > > > at /src/src-9/sys/ufs/ffs/ffs_softdep.c:7946 > > > > #7 0xc07324bc in handle_workitem_freeblocks (freeblks=0xc2fc1e00, flags=512) > > > > at /src/src-9/sys/ufs/ffs/ffs_softdep.c:7588 > > > > #8 0xc0730dfa in process_worklist_item (mp=0xc2b36548, target=10, flags=512) > > > > at /src/src-9/sys/ufs/ffs/ffs_softdep.c:1774 > > > > #9 0xc07360c1 in softdep_process_worklist (mp=0xc2b36548, full=0) > > > > at /src/src-9/sys/ufs/ffs/ffs_softdep.c:1558 > > > > #10 0xc0738f94 in softdep_flush () at /src/src-9/sys/ufs/ffs/ffs_softdep.c:1414 > > > > #11 0xc05d1b82 in fork_exit (callout=0xc0738bb0 , arg=0x0, frame=0xd70fcd08) > > > > at /src/src-9/sys/kern/kern_fork.c:988 > > > > #12 0xc07ba904 in fork_trampoline () at /src/src-9/sys/i386/i386/exception.s:279 > > > > (kgdb) up 10 > > > > #10 0xc0738f94 in softdep_flush () at /src/src-9/sys/ufs/ffs/ffs_softdep.c:1414 > > > > 1414 progress += softdep_process_worklist(mp, 0); > > > > > > > > -Andre > > > > > > This looks unrelated, and exactly this panic is usually has one of two > > > causes: > > > - corrupted filesystem, run fsck to recheck it; > > > > root@palveli:~>fsck /dev/stripe/p > > ** /dev/stripe/p > > ** Last Mounted on /palveli > > ** Phase 1 - Check Blocks and Sizes > > ** Phase 2 - Check Pathnames > > ** Phase 3 - Check Connectivity > > ** Phase 4 - Check Reference Counts > > ** Phase 5 - Check Cyl groups > > 9895 files, 2039706 used, 15697693 free (5397 frags, 1961537 blocks, 0.0% fragmentation) > > > > ***** FILE SYSTEM IS CLEAN ***** > > Taken from your previous mail (showing only UFS stuff): > > http://lists.freebsd.org/pipermail/freebsd-stable/2013-June/073817.html > > >>>> fstab: > >>>> ------ > >>>> /dev/da0s1a / ufs noatime,rw 0 1 > >>>> /dev/da0s1d /usr ufs noatime,rw 0 2 > >>>> /dev/da0s1e /var ufs noatime,nosuid,rw 0 2 > >>>> /dev/da10p1 /share2 ufs suiddir,groupquota,noatime,nosuid,rw 0 2 > >>>> /dev/da10p2 /raid2 ufs userquota,noatime,nosuid,rw 0 2 > > Where is gstripe(8) in that picture? Are you **sure** this is the same > system? Surely I'm missing something here... It is the same system that produced the (bad) dump in my previous mail (the one with the bcopy problem). It is NOT the same system which we used for finding out why it didn't dump (which we found out now and which was due to the spun down da1). Just for the sake of clarity: There are two systems showing this problem when running the daily snapshot. Since users complained about these disruption, I have moved important stuff from one machine to the other (where I disabled the sanpshot generation) so I can concentrate on this one (the one to which belong the dumps) for finding the problem. > > Can you provide details of the stripe, specifically "gstripe list" so I > can see what the disks are and then ask you for "smartctl -a" output for > each of them (to try and rule out disk-level problems that may be > causing oddities at the layer underneathe the filesystem (sometimes fsck > will not catch this))? Here is "gstripe list": Geom name: p State: UP Status: Total=2, Online=2 Type: AUTOMATIC Stripesize: 32768 ID: 2179163030 Providers: 1. Name: stripe/p Mediasize: 72802893824 (67G) Sectorsize: 512 Stripesize: 32768 Stripeoffset: 0 Mode: r0w0e0 Consumers: 1. Name: da10 Mediasize: 36401479680 (33G) Sectorsize: 512 Mode: r0w0e0 Number: 0 2. Name: da11 Mediasize: 36401479680 (33G) Sectorsize: 512 Mode: r0w0e0 Number: 1 The disks are old but seem to work properly: da10 at ahc1 bus 0 scbus1 target 0 lun 0 da10: Fixed Direct Access SCSI-3 device da10: 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit) da10: Command Queueing enabled da10: 34715MB (71096640 512 byte sectors: 255H 63S/T 4425C) da11 at ahc1 bus 0 scbus1 target 1 lun 0 da11: Fixed Direct Access SCSI-3 device da11: 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit) da11: Command Queueing enabled da11: 34715MB (71096640 512 byte sectors: 255H 63S/T 4425C) On both disks the PER bit is set so I'll see any read error problems even if they were retried or ECC-corrected (which haven't been there for ages). When the snapshot problem appeared for the first time, I also rebuilt the fs from scratch. -Andre