Date: Tue, 15 Jul 2003 13:59:08 -0700 From: Sumit Shah <shah@ucla.edu> To: David Malone <dwmalone@maths.tcd.ie> Cc: freebsd-hackers@freebsd.org Subject: Re: RAID and NFS exports (Possible Data Corruption) Message-ID: <2D5885DA-B707-11D7-9819-000393DB86CA@ucla.edu> In-Reply-To: <20030715162006.GA47687@walton.maths.tcd.ie>
next in thread | previous in thread | raw e-mail | index | archive | help
Thanks for the reply. >> ad4: hard error reading fsbn 242727552 > > The error means that that the disk said that there was an error > trying to read this block. You say that when you rebooted that the > controler said a disk had gone bad, so this would sort of confirm > this. (I could believe that restarting mountd might upset raid stuff > if there were a kernel bug, but it seems very unlikely it could > cause a disk to go bad.) The full error was something like this on _both_ of the identical systems, even _before_ the reboot. After this message we could not read/write/fsck /dev/ar0 ad7: hard error reading fsbn 291786506 of 0-127 (ad7 bn 291786506; cn 289470 tn 11 sn 53) trying PIO mode ad7: DMA problem fallback to PIO mode ad7: DMA problem fallback to PIO mode ad7: DMA problem fallback to PIO mode ad7: DMA problem fallback to PIO mode ad7: DMA problem fallback to PIO mode ad7: hard error reading fsbn 291786586 of 0-127 (ad7 bn 291786586; cn 289470 tn 13 sn 7) status=59 e rror=40 ar0: ERROR - array broken There was also a variety of messages like these: Jul 14 02:55:39 thorimage1 /kernel: ad7: hard error reading fsbn 291786586 of 0-127 (ad7 bn 291786586; cn 289470 tn 13 sn 7) status=59 error=40 where ad7: .... included any of the 6 devices, somewhat randomly, in the array. > > My best guess would be that you have a bad batch of disks that > happen to have failed in similar ways. It is possible that restarting > mountd uncovered the errors, 'cos I think mountd internally does > a remount of the filesystem in question and that might cause a chunk > of stuff to be flushed out on to the disk, highlighting an error. > > (I had a bunch of the IBM "deathstar" disks fail on me within the > space of a week or so, after they'd been in use for about six > months. That certainly sounds reasonable that this problem had just manifested itself by restarting mountd. It's just strange and too much of a coincidence that two sets of six disks on two different but identical machines would fail exactly the same way within an hour. I guess given the decline of quality in hard drives things like this might be more likely. Thanks, Sumit
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2D5885DA-B707-11D7-9819-000393DB86CA>