Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 24 Jun 2010 23:06:22 +0100
From:      Matthew Lear <matt@bubblegen.co.uk>
To:        Jeremy Chadwick <freebsd@jdc.parodius.com>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: 7.2-RELEASE-p4, IO errors & RAID1 failure
Message-ID:  <1277417182.1874.30.camel@almscliff.bubblegen.co.uk>
In-Reply-To: <20100624181535.GA58443@icarus.home.lan>
References:  <1276844904.7519.19.camel@almscliff.bubblegen.co.uk> <20100618082127.GA34578@icarus.home.lan> <1276876031.7519.39.camel@almscliff.bubblegen.co.uk> <20100618174208.GA47470@icarus.home.lan> <1276889330.2210.44.camel@almscliff.bubblegen.co.uk> <1277155992.1860.3.camel@almscliff.bubblegen.co.uk> <20100622074541.GA71157@icarus.home.lan> <82A96ECD-676C-4A4D-A328-0CFAABD64D50@gid.co.uk> <1277401934.1874.12.camel@almscliff.bubblegen.co.uk> <20100624181535.GA58443@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 2010-06-24 at 11:15 -0700, Jeremy Chadwick wrote:
> On Thu, Jun 24, 2010 at 06:52:14PM +0100, Matthew Lear wrote:
> > On Tue, 2010-06-22 at 20:04 +0100, Bob Bishop wrote:
> > > Hi,
> > > 
> > > On 22 Jun 2010, at 08:45, Jeremy Chadwick wrote:
> > > 
> > > > On Mon, Jun 21, 2010 at 10:33:12PM +0100, Matthew Lear wrote:
> > > >> [tale of woe elided]
> > > > 
> > > > I don't really have any other thoughts on the matter, sadly.
> > > > [helpful suggestions elided]
> > > > 
> > > > Anyone else have ideas/recommendations?
> > > 
> > > The disks sure look OK. I wouldn't rule out the controller(s), I've had various chipsets fail in odd ways.
> > > 
> > 
> > Thanks Bob. I think we all thought the same.
> > I've actually just rebooted the machine and FreeBSD no longer boots.
> > This isn't what I was expecting at all. Something has clearly gone wrong
> > with some file system metadata.
> > 
> > When I commissioned the machine I installed an 'early' bootloader
> > (apologies for perhaps using an incorrect term) which boots FreeBSD by
> > default (F1 option) or from Drive 1 (F5). Drive 1 is the DVD drive.
> 
> I believe this is the boot0 stage of the FreeBSD bootstrap process,
> otherwise known as "BootMgr" during the OS installation.  I tend to
> avoid this and pick "Standard" instead, which lets the system boot right
> into boot2/loader.
> 
> > It appears to be the case that the early bootloader tries to boot
> > FreeBSD and fails. I get the messages:
> > 
> > error 1 lba 795079
> > Invalid format
> > 
> > FreeBSD/i386 boot
> > Default: 0:ad(0,a)/boot/kernel/kernel
> > boot:
> > error 1 lba 786815
> > No /boot/kernel/kernel
> > 
> > FreeBSD/i386 boot
> > Default: 0:ad(0,a)/boot/kernel/kernel
> > boot:
> > 
> > ...and I'm at a boot prompt.
> 
> You're at the boot0 stage.  The bootstrap stage looks wrong: this should
> be 0:ad(0,a)/boot/loader, not /boot/kernel/kernel.  You should load the
> kernel from boot2/loader, not boot0.
> 
> After you powered off the system, did you physically remove the ad0
> disk, or is it still in the system?
> 

It's still in the system. Given that the disk is ok relative to SMART, I
was of the [probably naive] assumption that I'd be able to boot up
normally, access the array on ar0, re-sync the array and carry on as
normal monitoring any further errors.

> I would recommend taking ad0 out of the picture (power down the machine
> and physically unplug it), and make sure your BIOS is set to boot from
> the first hard disk *and* the 2nd hard disk.  "Hard disk" in this
> context means "any disk that's part of the RAID-1 array".  You want to
> make sure your other disks (whatever that thing is on ata0-slave, and
> the backup disk you have on ad1) *are not* bootable from the BIOS.  If
> they've ever been used as bootable disks in the past, then you should
> have cleared the MBR on them to ensure they couldn't be booted by the
> BIOS.

Understood.

> 
> What I'm documenting here is the need to make sure that you don't boot
> the wrong device/disk.  I'm talking about what the *BIOS* boots, not the
> FreeBSD boot0 bootstrap.
> 
> You should keep the 2nd disk in the RAID-1 mirror connected to its
> current SATA port; do not move it to what ad0 was connected to.
> 
> > So, given that ad0 was the failed disk, the bootloader has failed to
> > find specific boot data on ad0 and dropped me into a boot prompt.
> 
> Actually, it's reporting an I/O error at a specific LBA, indicating it
> either can't load the kernel.
> 
> > I'm tempted to replace the boot line with 0:ad(2,a)/boot/kernel/kernel
> > or should that be 2:ad(0,a)/boot/kernel/kernel but I'm a little
> > suspicious of doing anything at this point?
> 
> I believe you want 0:ad(2,a)/boot/loader, but you'll have to enter this
> every time unless you follow what I wrote above (re: BIOS disk boot
> order).

Again, all understood. I gave this a whirl and saw several ad0 timeout
messages at various LBA, the system boot up hung and dropped me into
single user mode. atacontrol list showed no devices attached to channel
0 which I thought was rather odd. I've no idea if this is indicative of
a hw failure or not. Further investigation is required.

> > Can anybody offer any guidance of what I can do to restore my system? I
> > was able to shut down the machine cleanly (shutdown -p now) and despite
> > the RAID mirror going offline, everything seemed to be behaving normally
> > (expected I guess given that I just lost some redundancy).
> > 
> > I'm just that little bit more worried now :-( If the disks are ok, what
> > on earth could have happened and more importantly, how can I restore
> > what was an operational system when I shut it down?!
> 
> At this point you need to make a judgement call: which are you going to
> spend more time doing: a) futzing around with this weird situation, or
> b) reinstalling everything and restoring data from backups?
> 
> If I was in your shoes at this point, I'd probably choose (b) and go
> with installing 8.1-RC1 using gmirror for the RAID-1 capability.

That's probably fair enough but I'm of the opinion that I'd like to know
what has happened (or rather what FreeBSD has done) to my machine. Given
that the apparently faulty disk is not faulty, something (or probably
more accurately, the OS) has written some absolute LBA values to disk
with the intent of accessing these. Yes the disk has indicated that
there is an error but as to why, well that's the question :-)

IMO it's all fine and well saying upgrade to the next stable release but
that's not actually finding the cause and trying to resolve the problem
in a sensible manner. I'm fortunate enough that I can easily handle a
bit of down time on the machine. You're absolutely right in saying that
the set up should have been tested prior to commissioning. I agree
completely. However, it's a server that I run at home, I'm not an IT
admin, I don't mind getting my hands dirty and do try to learn from
experience - hopefully! :-) 

> There isn't much else I can say about the issue, other than that proper
> failure testing may have caught this before it was too late.  If there's
> anything positive to take away from this experience, it's that.  :-)
> 

Absolutely.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1277417182.1874.30.camel>