From owner-freebsd-stable@FreeBSD.ORG Thu Jun 24 18:15:41 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2CE3A1065676 for ; Thu, 24 Jun 2010 18:15:41 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta14.westchester.pa.mail.comcast.net (qmta14.westchester.pa.mail.comcast.net [76.96.59.212]) by mx1.freebsd.org (Postfix) with ESMTP id C8F0F8FC2A for ; Thu, 24 Jun 2010 18:15:40 +0000 (UTC) Received: from omta10.westchester.pa.mail.comcast.net ([76.96.62.28]) by qmta14.westchester.pa.mail.comcast.net with comcast id Zpwe1e0060cZkys5EuFgDf; Thu, 24 Jun 2010 18:15:40 +0000 Received: from koitsu.dyndns.org ([98.248.46.159]) by omta10.westchester.pa.mail.comcast.net with comcast id ZuFc1e00M3S48mS3WuFdgf; Thu, 24 Jun 2010 18:15:39 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 58A829B425; Thu, 24 Jun 2010 11:15:35 -0700 (PDT) Date: Thu, 24 Jun 2010 11:15:35 -0700 From: Jeremy Chadwick To: Matthew Lear Message-ID: <20100624181535.GA58443@icarus.home.lan> References: <1276844904.7519.19.camel@almscliff.bubblegen.co.uk> <20100618082127.GA34578@icarus.home.lan> <1276876031.7519.39.camel@almscliff.bubblegen.co.uk> <20100618174208.GA47470@icarus.home.lan> <1276889330.2210.44.camel@almscliff.bubblegen.co.uk> <1277155992.1860.3.camel@almscliff.bubblegen.co.uk> <20100622074541.GA71157@icarus.home.lan> <82A96ECD-676C-4A4D-A328-0CFAABD64D50@gid.co.uk> <1277401934.1874.12.camel@almscliff.bubblegen.co.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1277401934.1874.12.camel@almscliff.bubblegen.co.uk> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: freebsd-stable@freebsd.org Subject: Re: 7.2-RELEASE-p4, IO errors & RAID1 failure X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jun 2010 18:15:41 -0000 On Thu, Jun 24, 2010 at 06:52:14PM +0100, Matthew Lear wrote: > On Tue, 2010-06-22 at 20:04 +0100, Bob Bishop wrote: > > Hi, > > > > On 22 Jun 2010, at 08:45, Jeremy Chadwick wrote: > > > > > On Mon, Jun 21, 2010 at 10:33:12PM +0100, Matthew Lear wrote: > > >> [tale of woe elided] > > > > > > I don't really have any other thoughts on the matter, sadly. > > > [helpful suggestions elided] > > > > > > Anyone else have ideas/recommendations? > > > > The disks sure look OK. I wouldn't rule out the controller(s), I've had various chipsets fail in odd ways. > > > > Thanks Bob. I think we all thought the same. > I've actually just rebooted the machine and FreeBSD no longer boots. > This isn't what I was expecting at all. Something has clearly gone wrong > with some file system metadata. > > When I commissioned the machine I installed an 'early' bootloader > (apologies for perhaps using an incorrect term) which boots FreeBSD by > default (F1 option) or from Drive 1 (F5). Drive 1 is the DVD drive. I believe this is the boot0 stage of the FreeBSD bootstrap process, otherwise known as "BootMgr" during the OS installation. I tend to avoid this and pick "Standard" instead, which lets the system boot right into boot2/loader. > It appears to be the case that the early bootloader tries to boot > FreeBSD and fails. I get the messages: > > error 1 lba 795079 > Invalid format > > FreeBSD/i386 boot > Default: 0:ad(0,a)/boot/kernel/kernel > boot: > error 1 lba 786815 > No /boot/kernel/kernel > > FreeBSD/i386 boot > Default: 0:ad(0,a)/boot/kernel/kernel > boot: > > ...and I'm at a boot prompt. You're at the boot0 stage. The bootstrap stage looks wrong: this should be 0:ad(0,a)/boot/loader, not /boot/kernel/kernel. You should load the kernel from boot2/loader, not boot0. After you powered off the system, did you physically remove the ad0 disk, or is it still in the system? I would recommend taking ad0 out of the picture (power down the machine and physically unplug it), and make sure your BIOS is set to boot from the first hard disk *and* the 2nd hard disk. "Hard disk" in this context means "any disk that's part of the RAID-1 array". You want to make sure your other disks (whatever that thing is on ata0-slave, and the backup disk you have on ad1) *are not* bootable from the BIOS. If they've ever been used as bootable disks in the past, then you should have cleared the MBR on them to ensure they couldn't be booted by the BIOS. What I'm documenting here is the need to make sure that you don't boot the wrong device/disk. I'm talking about what the *BIOS* boots, not the FreeBSD boot0 bootstrap. You should keep the 2nd disk in the RAID-1 mirror connected to its current SATA port; do not move it to what ad0 was connected to. > So, given that ad0 was the failed disk, the bootloader has failed to > find specific boot data on ad0 and dropped me into a boot prompt. Actually, it's reporting an I/O error at a specific LBA, indicating it either can't load the kernel. > I'm tempted to replace the boot line with 0:ad(2,a)/boot/kernel/kernel > or should that be 2:ad(0,a)/boot/kernel/kernel but I'm a little > suspicious of doing anything at this point? I believe you want 0:ad(2,a)/boot/loader, but you'll have to enter this every time unless you follow what I wrote above (re: BIOS disk boot order). > Can anybody offer any guidance of what I can do to restore my system? I > was able to shut down the machine cleanly (shutdown -p now) and despite > the RAID mirror going offline, everything seemed to be behaving normally > (expected I guess given that I just lost some redundancy). > > I'm just that little bit more worried now :-( If the disks are ok, what > on earth could have happened and more importantly, how can I restore > what was an operational system when I shut it down?! At this point you need to make a judgement call: which are you going to spend more time doing: a) futzing around with this weird situation, or b) reinstalling everything and restoring data from backups? If I was in your shoes at this point, I'd probably choose (b) and go with installing 8.1-RC1 using gmirror for the RAID-1 capability. There isn't much else I can say about the issue, other than that proper failure testing may have caught this before it was too late. If there's anything positive to take away from this experience, it's that. :-) -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |