Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 24 Jun 2010 11:15:35 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Matthew Lear <matt@bubblegen.co.uk>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: 7.2-RELEASE-p4, IO errors & RAID1 failure
Message-ID:  <20100624181535.GA58443@icarus.home.lan>
In-Reply-To: <1277401934.1874.12.camel@almscliff.bubblegen.co.uk>
References:  <1276844904.7519.19.camel@almscliff.bubblegen.co.uk> <20100618082127.GA34578@icarus.home.lan> <1276876031.7519.39.camel@almscliff.bubblegen.co.uk> <20100618174208.GA47470@icarus.home.lan> <1276889330.2210.44.camel@almscliff.bubblegen.co.uk> <1277155992.1860.3.camel@almscliff.bubblegen.co.uk> <20100622074541.GA71157@icarus.home.lan> <82A96ECD-676C-4A4D-A328-0CFAABD64D50@gid.co.uk> <1277401934.1874.12.camel@almscliff.bubblegen.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jun 24, 2010 at 06:52:14PM +0100, Matthew Lear wrote:
> On Tue, 2010-06-22 at 20:04 +0100, Bob Bishop wrote:
> > Hi,
> > 
> > On 22 Jun 2010, at 08:45, Jeremy Chadwick wrote:
> > 
> > > On Mon, Jun 21, 2010 at 10:33:12PM +0100, Matthew Lear wrote:
> > >> [tale of woe elided]
> > > 
> > > I don't really have any other thoughts on the matter, sadly.
> > > [helpful suggestions elided]
> > > 
> > > Anyone else have ideas/recommendations?
> > 
> > The disks sure look OK. I wouldn't rule out the controller(s), I've had various chipsets fail in odd ways.
> > 
> 
> Thanks Bob. I think we all thought the same.
> I've actually just rebooted the machine and FreeBSD no longer boots.
> This isn't what I was expecting at all. Something has clearly gone wrong
> with some file system metadata.
> 
> When I commissioned the machine I installed an 'early' bootloader
> (apologies for perhaps using an incorrect term) which boots FreeBSD by
> default (F1 option) or from Drive 1 (F5). Drive 1 is the DVD drive.

I believe this is the boot0 stage of the FreeBSD bootstrap process,
otherwise known as "BootMgr" during the OS installation.  I tend to
avoid this and pick "Standard" instead, which lets the system boot right
into boot2/loader.

> It appears to be the case that the early bootloader tries to boot
> FreeBSD and fails. I get the messages:
> 
> error 1 lba 795079
> Invalid format
> 
> FreeBSD/i386 boot
> Default: 0:ad(0,a)/boot/kernel/kernel
> boot:
> error 1 lba 786815
> No /boot/kernel/kernel
> 
> FreeBSD/i386 boot
> Default: 0:ad(0,a)/boot/kernel/kernel
> boot:
> 
> ...and I'm at a boot prompt.

You're at the boot0 stage.  The bootstrap stage looks wrong: this should
be 0:ad(0,a)/boot/loader, not /boot/kernel/kernel.  You should load the
kernel from boot2/loader, not boot0.

After you powered off the system, did you physically remove the ad0
disk, or is it still in the system?

I would recommend taking ad0 out of the picture (power down the machine
and physically unplug it), and make sure your BIOS is set to boot from
the first hard disk *and* the 2nd hard disk.  "Hard disk" in this
context means "any disk that's part of the RAID-1 array".  You want to
make sure your other disks (whatever that thing is on ata0-slave, and
the backup disk you have on ad1) *are not* bootable from the BIOS.  If
they've ever been used as bootable disks in the past, then you should
have cleared the MBR on them to ensure they couldn't be booted by the
BIOS.

What I'm documenting here is the need to make sure that you don't boot
the wrong device/disk.  I'm talking about what the *BIOS* boots, not the
FreeBSD boot0 bootstrap.

You should keep the 2nd disk in the RAID-1 mirror connected to its
current SATA port; do not move it to what ad0 was connected to.

> So, given that ad0 was the failed disk, the bootloader has failed to
> find specific boot data on ad0 and dropped me into a boot prompt.

Actually, it's reporting an I/O error at a specific LBA, indicating it
either can't load the kernel.

> I'm tempted to replace the boot line with 0:ad(2,a)/boot/kernel/kernel
> or should that be 2:ad(0,a)/boot/kernel/kernel but I'm a little
> suspicious of doing anything at this point?

I believe you want 0:ad(2,a)/boot/loader, but you'll have to enter this
every time unless you follow what I wrote above (re: BIOS disk boot
order).

> Can anybody offer any guidance of what I can do to restore my system? I
> was able to shut down the machine cleanly (shutdown -p now) and despite
> the RAID mirror going offline, everything seemed to be behaving normally
> (expected I guess given that I just lost some redundancy).
> 
> I'm just that little bit more worried now :-( If the disks are ok, what
> on earth could have happened and more importantly, how can I restore
> what was an operational system when I shut it down?!

At this point you need to make a judgement call: which are you going to
spend more time doing: a) futzing around with this weird situation, or
b) reinstalling everything and restoring data from backups?

If I was in your shoes at this point, I'd probably choose (b) and go
with installing 8.1-RC1 using gmirror for the RAID-1 capability.

There isn't much else I can say about the issue, other than that proper
failure testing may have caught this before it was too late.  If there's
anything positive to take away from this experience, it's that.  :-)

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100624181535.GA58443>