From owner-freebsd-stable@FreeBSD.ORG Thu Jun 24 22:06:41 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EAAB0106564A for ; Thu, 24 Jun 2010 22:06:41 +0000 (UTC) (envelope-from matt@bubblegen.co.uk) Received: from relay.pcl-ipout01.plus.net (relay.pcl-ipout01.plus.net [212.159.7.99]) by mx1.freebsd.org (Postfix) with ESMTP id 56FE18FC19 for ; Thu, 24 Jun 2010 22:06:40 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApwEAC5zI0zUnw4T/2dsb2JhbACDHZ0OsSqRIYEpgTuBTXAE Received: from outmx06.plus.net (HELO outmx04.plus.net) ([212.159.14.19]) by relay.pcl-ipout01.plus.net with ESMTP; 24 Jun 2010 23:06:40 +0100 Received: from bubblegen.plus.com ([80.229.236.194] helo=[192.136.1.18]) by outmx04.plus.net with esmtp (Exim) id 1ORuZD-0000yJ-IJ; Thu, 24 Jun 2010 23:06:39 +0100 From: Matthew Lear To: Jeremy Chadwick In-Reply-To: <20100624181535.GA58443@icarus.home.lan> References: <1276844904.7519.19.camel@almscliff.bubblegen.co.uk> <20100618082127.GA34578@icarus.home.lan> <1276876031.7519.39.camel@almscliff.bubblegen.co.uk> <20100618174208.GA47470@icarus.home.lan> <1276889330.2210.44.camel@almscliff.bubblegen.co.uk> <1277155992.1860.3.camel@almscliff.bubblegen.co.uk> <20100622074541.GA71157@icarus.home.lan> <82A96ECD-676C-4A4D-A328-0CFAABD64D50@gid.co.uk> <1277401934.1874.12.camel@almscliff.bubblegen.co.uk> <20100624181535.GA58443@icarus.home.lan> Content-Type: text/plain; charset="UTF-8" Date: Thu, 24 Jun 2010 23:06:22 +0100 Message-ID: <1277417182.1874.30.camel@almscliff.bubblegen.co.uk> Mime-Version: 1.0 X-Mailer: Evolution 2.28.3 Content-Transfer-Encoding: 7bit Cc: freebsd-stable@freebsd.org Subject: Re: 7.2-RELEASE-p4, IO errors & RAID1 failure X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Jun 2010 22:06:42 -0000 On Thu, 2010-06-24 at 11:15 -0700, Jeremy Chadwick wrote: > On Thu, Jun 24, 2010 at 06:52:14PM +0100, Matthew Lear wrote: > > On Tue, 2010-06-22 at 20:04 +0100, Bob Bishop wrote: > > > Hi, > > > > > > On 22 Jun 2010, at 08:45, Jeremy Chadwick wrote: > > > > > > > On Mon, Jun 21, 2010 at 10:33:12PM +0100, Matthew Lear wrote: > > > >> [tale of woe elided] > > > > > > > > I don't really have any other thoughts on the matter, sadly. > > > > [helpful suggestions elided] > > > > > > > > Anyone else have ideas/recommendations? > > > > > > The disks sure look OK. I wouldn't rule out the controller(s), I've had various chipsets fail in odd ways. > > > > > > > Thanks Bob. I think we all thought the same. > > I've actually just rebooted the machine and FreeBSD no longer boots. > > This isn't what I was expecting at all. Something has clearly gone wrong > > with some file system metadata. > > > > When I commissioned the machine I installed an 'early' bootloader > > (apologies for perhaps using an incorrect term) which boots FreeBSD by > > default (F1 option) or from Drive 1 (F5). Drive 1 is the DVD drive. > > I believe this is the boot0 stage of the FreeBSD bootstrap process, > otherwise known as "BootMgr" during the OS installation. I tend to > avoid this and pick "Standard" instead, which lets the system boot right > into boot2/loader. > > > It appears to be the case that the early bootloader tries to boot > > FreeBSD and fails. I get the messages: > > > > error 1 lba 795079 > > Invalid format > > > > FreeBSD/i386 boot > > Default: 0:ad(0,a)/boot/kernel/kernel > > boot: > > error 1 lba 786815 > > No /boot/kernel/kernel > > > > FreeBSD/i386 boot > > Default: 0:ad(0,a)/boot/kernel/kernel > > boot: > > > > ...and I'm at a boot prompt. > > You're at the boot0 stage. The bootstrap stage looks wrong: this should > be 0:ad(0,a)/boot/loader, not /boot/kernel/kernel. You should load the > kernel from boot2/loader, not boot0. > > After you powered off the system, did you physically remove the ad0 > disk, or is it still in the system? > It's still in the system. Given that the disk is ok relative to SMART, I was of the [probably naive] assumption that I'd be able to boot up normally, access the array on ar0, re-sync the array and carry on as normal monitoring any further errors. > I would recommend taking ad0 out of the picture (power down the machine > and physically unplug it), and make sure your BIOS is set to boot from > the first hard disk *and* the 2nd hard disk. "Hard disk" in this > context means "any disk that's part of the RAID-1 array". You want to > make sure your other disks (whatever that thing is on ata0-slave, and > the backup disk you have on ad1) *are not* bootable from the BIOS. If > they've ever been used as bootable disks in the past, then you should > have cleared the MBR on them to ensure they couldn't be booted by the > BIOS. Understood. > > What I'm documenting here is the need to make sure that you don't boot > the wrong device/disk. I'm talking about what the *BIOS* boots, not the > FreeBSD boot0 bootstrap. > > You should keep the 2nd disk in the RAID-1 mirror connected to its > current SATA port; do not move it to what ad0 was connected to. > > > So, given that ad0 was the failed disk, the bootloader has failed to > > find specific boot data on ad0 and dropped me into a boot prompt. > > Actually, it's reporting an I/O error at a specific LBA, indicating it > either can't load the kernel. > > > I'm tempted to replace the boot line with 0:ad(2,a)/boot/kernel/kernel > > or should that be 2:ad(0,a)/boot/kernel/kernel but I'm a little > > suspicious of doing anything at this point? > > I believe you want 0:ad(2,a)/boot/loader, but you'll have to enter this > every time unless you follow what I wrote above (re: BIOS disk boot > order). Again, all understood. I gave this a whirl and saw several ad0 timeout messages at various LBA, the system boot up hung and dropped me into single user mode. atacontrol list showed no devices attached to channel 0 which I thought was rather odd. I've no idea if this is indicative of a hw failure or not. Further investigation is required. > > Can anybody offer any guidance of what I can do to restore my system? I > > was able to shut down the machine cleanly (shutdown -p now) and despite > > the RAID mirror going offline, everything seemed to be behaving normally > > (expected I guess given that I just lost some redundancy). > > > > I'm just that little bit more worried now :-( If the disks are ok, what > > on earth could have happened and more importantly, how can I restore > > what was an operational system when I shut it down?! > > At this point you need to make a judgement call: which are you going to > spend more time doing: a) futzing around with this weird situation, or > b) reinstalling everything and restoring data from backups? > > If I was in your shoes at this point, I'd probably choose (b) and go > with installing 8.1-RC1 using gmirror for the RAID-1 capability. That's probably fair enough but I'm of the opinion that I'd like to know what has happened (or rather what FreeBSD has done) to my machine. Given that the apparently faulty disk is not faulty, something (or probably more accurately, the OS) has written some absolute LBA values to disk with the intent of accessing these. Yes the disk has indicated that there is an error but as to why, well that's the question :-) IMO it's all fine and well saying upgrade to the next stable release but that's not actually finding the cause and trying to resolve the problem in a sensible manner. I'm fortunate enough that I can easily handle a bit of down time on the machine. You're absolutely right in saying that the set up should have been tested prior to commissioning. I agree completely. However, it's a server that I run at home, I'm not an IT admin, I don't mind getting my hands dirty and do try to learn from experience - hopefully! :-) > There isn't much else I can say about the issue, other than that proper > failure testing may have caught this before it was too late. If there's > anything positive to take away from this experience, it's that. :-) > Absolutely.