From owner-freebsd-stable@FreeBSD.ORG Mon Nov 26 02:21:36 2007 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BDC5B16A418 for ; Mon, 26 Nov 2007 02:21:36 +0000 (UTC) (envelope-from jdc@parodius.com) Received: from mx01.sc1.parodius.com (mx01.sc1.parodius.com [72.20.106.3]) by mx1.freebsd.org (Postfix) with ESMTP id A151013C43E for ; Mon, 26 Nov 2007 02:21:36 +0000 (UTC) (envelope-from jdc@parodius.com) Received: by mx01.sc1.parodius.com (Postfix, from userid 1000) id 6832F1CC079; Sun, 25 Nov 2007 18:21:36 -0800 (PST) Date: Sun, 25 Nov 2007 18:21:36 -0800 From: Jeremy Chadwick To: freebsd-stable@freebsd.org Message-ID: <20071126022136.GA1564@eos.sc1.parodius.com> References: <20071107191611.GA1400@eos.sc1.parodius.com> <20071107232328.GA1678@eos.sc1.parodius.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20071107232328.GA1678@eos.sc1.parodius.com> User-Agent: Mutt/1.5.16 (2007-06-09) Subject: Re: RELENG_6 kernel panic + savecore(8) problem X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Nov 2007 02:21:36 -0000 On Wed, Nov 07, 2007 at 03:23:28PM -0800, Jeremy Chadwick wrote: > On Wed, Nov 07, 2007 at 11:16:11AM -0800, Jeremy Chadwick wrote: > > Tracing pid 3 tid 100001 td 0xc7c6ad80 > > kdb_enter(3228441820,3228796672,3228487817,3867634632,256,...) at kdb_enter+48 > > panic(3228487817,3426817152,256,3228643296,0,...) at panic+206 > > handle_written_inodeblock(3459887104,3688934424,3226775710,3228787204,3228175693,...) at handle_written_inodeblock+1503 > > softdep_disk_write_complete(3688934424,3227842097,3356275348,3867634836,3226342800,...) at softdep_disk_write_complete+241 > > bufdone(3688934424,0,3867634856,3226352850,3356275348,...) at bufdone+126 > > g_vfs_done(3356275348,0,0,3352445440,3355957180) at g_vfs_done+198 > > biodone(3356275348,3228786984,588,3228423470,100,...) at biodone+178 > > g_io_schedule_up(3351686528,76,3351679512,3226344072,3867634980,...) at g_io_schedule_up+137 > > g_up_procbody(0,3867635000,0,0,0,...) at g_up_procbody+122 > > fork_exit(3226344072,0,3867635000) at fork_exit+122 > > fork_trampoline() at fork_trampoline+8 > > A follow-up to this: > > It appears that somehow a few of the filesystems on the disk (it's a > single-disk system) were suffering from some bizarre form of soft update > corruption. And tonight we had the same problem happen again. The backtrace when the machine panic'd is identical: Tracing pid 3 tid 100001 td 0xc7c6ed80 kdb_enter(c06e475e,c073ade0,c06efb55,e6876bc8,100,...) at kdb_enter+0x30 panic(c06efb55,ce04b280,100,c07156c0,0,...) at panic+0xce handle_written_inodeblock(c858d200,dbda0a70,c07388e4,c06a3e4a,e6876c30,...) at handle_written_inodeblock+0x5df softdep_disk_write_complete(dbda0a70,c0652591,c80e65ac,e6876c94,c04e16c4,...) at softdep_disk_write_complete+0xf1 bufdone(dbda0a70,0,e6876ca8,c04e3e06,c80e65ac,...) at bufdone+0x7e g_vfs_done(c80e65ac,0,0,c7d28200,c80a418c) at g_vfs_done+0xc6 biodone(c80e65ac,c0738808,24c,c06dff1c,64,...) at biodone+0xb2 g_io_schedule_up(c7c6ed80,4c,c7c6d218,c04e1bbc,e6876d24,...) at g_io_schedule_up+0x89 g_up_procbody(0,e6876d38,0,0,0,...) at g_up_procbody+0x7a fork_exit(c04e1bbc,0,e6876d38) at fork_exit+0x7a fork_trampoline() at fork_trampoline+0x8 Background fsck reported similar weirdness until I brought down the system into single-user, fsck'd things, and brought the system back up again. Also like last time, savecore found no cores on /dev/ad0s1b, despite one being done after doing "panic" in DDB. I believe I may have figured out why that's occurring, though: Using `rcorder /etc/rc.d/*` I'm able to see that the following rc scripts are run in this respective order: /etc/rc.d/dumpon <-- Runs `dumpon -v /dev/ad0s1b` /etc/rc.d/swap1 <-- Runs `swapon -a` /etc/rc.d/addswap <-- Adds additional swap spaces /etc/rc.d/savecore <-- Runs `savecore /var/crash /dev/ad0s1b` I believe the problem is that /etc/rc.d/swap1 is being run before savecore. I'm guessing that swapon(8) actually destroys/clobbers the existing saved kernel panic/core data, thus one will never get a coredump in /var/crash. I believe some re-organisation of rcorder(8) arguments in the files is in order, but I don't know what should be changed to what. I'll submit a PR for the above, because IMHO that's a serious one. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |