Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 24 Jun 2010 17:22:41 -0500
From:      Adam Vande More <amvandemore@gmail.com>
To:        Matthew Lear <matt@bubblegen.co.uk>, Jeremy Chadwick <freebsd@jdc.parodius.com>, freebsd-stable@freebsd.org
Subject:   Re: 7.2-RELEASE-p4, IO errors & RAID1 failure
Message-ID:  <AANLkTimo1Vb461DHw3ZXNwK5BxDcgzKSkdxc3Dnqizge@mail.gmail.com>
In-Reply-To: <1277417182.1874.30.camel@almscliff.bubblegen.co.uk>
References:  <1276844904.7519.19.camel@almscliff.bubblegen.co.uk> <20100618082127.GA34578@icarus.home.lan> <1276876031.7519.39.camel@almscliff.bubblegen.co.uk> <20100618174208.GA47470@icarus.home.lan> <1276889330.2210.44.camel@almscliff.bubblegen.co.uk> <1277155992.1860.3.camel@almscliff.bubblegen.co.uk> <20100622074541.GA71157@icarus.home.lan> <82A96ECD-676C-4A4D-A328-0CFAABD64D50@gid.co.uk> <1277401934.1874.12.camel@almscliff.bubblegen.co.uk> <20100624181535.GA58443@icarus.home.lan> <1277417182.1874.30.camel@almscliff.bubblegen.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
Haven't followed the entire thread, but wanted to point out something
important to remember. SMART is not a reliable indicator of failure.
It's certainly better than listening to it but it picks up less than
1/2 of drive failures. Google released a study of their disks in data
centers a few years ago that was fairly in depth look into drive
failure rate. You might find it interesting.

On 6/24/10, Matthew Lear <matt@bubblegen.co.uk> wrote:
> On Thu, 2010-06-24 at 11:15 -0700, Jeremy Chadwick wrote:
>> On Thu, Jun 24, 2010 at 06:52:14PM +0100, Matthew Lear wrote:
>> > On Tue, 2010-06-22 at 20:04 +0100, Bob Bishop wrote:
>> > > Hi,
>> > >
>> > > On 22 Jun 2010, at 08:45, Jeremy Chadwick wrote:
>> > >
>> > > > On Mon, Jun 21, 2010 at 10:33:12PM +0100, Matthew Lear wrote:
>> > > >> [tale of woe elided]
>> > > >
>> > > > I don't really have any other thoughts on the matter, sadly.
>> > > > [helpful suggestions elided]
>> > > >
>> > > > Anyone else have ideas/recommendations?
>> > >
>> > > The disks sure look OK. I wouldn't rule out the controller(s), I've
>> > > had various chipsets fail in odd ways.
>> > >
>> >
>> > Thanks Bob. I think we all thought the same.
>> > I've actually just rebooted the machine and FreeBSD no longer boots.
>> > This isn't what I was expecting at all. Something has clearly gone wrong
>> > with some file system metadata.
>> >
>> > When I commissioned the machine I installed an 'early' bootloader
>> > (apologies for perhaps using an incorrect term) which boots FreeBSD by
>> > default (F1 option) or from Drive 1 (F5). Drive 1 is the DVD drive.
>>
>> I believe this is the boot0 stage of the FreeBSD bootstrap process,
>> otherwise known as "BootMgr" during the OS installation.  I tend to
>> avoid this and pick "Standard" instead, which lets the system boot right
>> into boot2/loader.
>>
>> > It appears to be the case that the early bootloader tries to boot
>> > FreeBSD and fails. I get the messages:
>> >
>> > error 1 lba 795079
>> > Invalid format
>> >
>> > FreeBSD/i386 boot
>> > Default: 0:ad(0,a)/boot/kernel/kernel
>> > boot:
>> > error 1 lba 786815
>> > No /boot/kernel/kernel
>> >
>> > FreeBSD/i386 boot
>> > Default: 0:ad(0,a)/boot/kernel/kernel
>> > boot:
>> >
>> > ...and I'm at a boot prompt.
>>
>> You're at the boot0 stage.  The bootstrap stage looks wrong: this should
>> be 0:ad(0,a)/boot/loader, not /boot/kernel/kernel.  You should load the
>> kernel from boot2/loader, not boot0.
>>
>> After you powered off the system, did you physically remove the ad0
>> disk, or is it still in the system?
>>
>
> It's still in the system. Given that the disk is ok relative to SMART, I
> was of the [probably naive] assumption that I'd be able to boot up
> normally, access the array on ar0, re-sync the array and carry on as
> normal monitoring any further errors.
>
>> I would recommend taking ad0 out of the picture (power down the machine
>> and physically unplug it), and make sure your BIOS is set to boot from
>> the first hard disk *and* the 2nd hard disk.  "Hard disk" in this
>> context means "any disk that's part of the RAID-1 array".  You want to
>> make sure your other disks (whatever that thing is on ata0-slave, and
>> the backup disk you have on ad1) *are not* bootable from the BIOS.  If
>> they've ever been used as bootable disks in the past, then you should
>> have cleared the MBR on them to ensure they couldn't be booted by the
>> BIOS.
>
> Understood.
>
>>
>> What I'm documenting here is the need to make sure that you don't boot
>> the wrong device/disk.  I'm talking about what the *BIOS* boots, not the
>> FreeBSD boot0 bootstrap.
>>
>> You should keep the 2nd disk in the RAID-1 mirror connected to its
>> current SATA port; do not move it to what ad0 was connected to.
>>
>> > So, given that ad0 was the failed disk, the bootloader has failed to
>> > find specific boot data on ad0 and dropped me into a boot prompt.
>>
>> Actually, it's reporting an I/O error at a specific LBA, indicating it
>> either can't load the kernel.
>>
>> > I'm tempted to replace the boot line with 0:ad(2,a)/boot/kernel/kernel
>> > or should that be 2:ad(0,a)/boot/kernel/kernel but I'm a little
>> > suspicious of doing anything at this point?
>>
>> I believe you want 0:ad(2,a)/boot/loader, but you'll have to enter this
>> every time unless you follow what I wrote above (re: BIOS disk boot
>> order).
>
> Again, all understood. I gave this a whirl and saw several ad0 timeout
> messages at various LBA, the system boot up hung and dropped me into
> single user mode. atacontrol list showed no devices attached to channel
> 0 which I thought was rather odd. I've no idea if this is indicative of
> a hw failure or not. Further investigation is required.
>
>> > Can anybody offer any guidance of what I can do to restore my system? I
>> > was able to shut down the machine cleanly (shutdown -p now) and despite
>> > the RAID mirror going offline, everything seemed to be behaving normally
>> > (expected I guess given that I just lost some redundancy).
>> >
>> > I'm just that little bit more worried now :-( If the disks are ok, what
>> > on earth could have happened and more importantly, how can I restore
>> > what was an operational system when I shut it down?!
>>
>> At this point you need to make a judgement call: which are you going to
>> spend more time doing: a) futzing around with this weird situation, or
>> b) reinstalling everything and restoring data from backups?
>>
>> If I was in your shoes at this point, I'd probably choose (b) and go
>> with installing 8.1-RC1 using gmirror for the RAID-1 capability.
>
> That's probably fair enough but I'm of the opinion that I'd like to know
> what has happened (or rather what FreeBSD has done) to my machine. Given
> that the apparently faulty disk is not faulty, something (or probably
> more accurately, the OS) has written some absolute LBA values to disk
> with the intent of accessing these. Yes the disk has indicated that
> there is an error but as to why, well that's the question :-)
>
> IMO it's all fine and well saying upgrade to the next stable release but
> that's not actually finding the cause and trying to resolve the problem
> in a sensible manner. I'm fortunate enough that I can easily handle a
> bit of down time on the machine. You're absolutely right in saying that
> the set up should have been tested prior to commissioning. I agree
> completely. However, it's a server that I run at home, I'm not an IT
> admin, I don't mind getting my hands dirty and do try to learn from
> experience - hopefully! :-)
>
>> There isn't much else I can say about the issue, other than that proper
>> failure testing may have caught this before it was too late.  If there's
>> anything positive to take away from this experience, it's that.  :-)
>>
>
> Absolutely.
>
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
>

-- 
Sent from my mobile device

Adam Vande More



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AANLkTimo1Vb461DHw3ZXNwK5BxDcgzKSkdxc3Dnqizge>