From owner-freebsd-hackers@FreeBSD.ORG Tue Jul 15 13:59:10 2003 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AA77037B404 for ; Tue, 15 Jul 2003 13:59:10 -0700 (PDT) Received: from periwinkle.noc.ucla.edu (periwinkle.noc.ucla.edu [169.232.47.11]) by mx1.FreeBSD.org (Postfix) with ESMTP id B08CB43FAF for ; Tue, 15 Jul 2003 13:59:08 -0700 (PDT) (envelope-from shah@ucla.edu) Received: from tigerlily.noc.ucla.edu (tigerlily.noc.ucla.edu [169.232.46.12]) h6FKx8Qh027754; Tue, 15 Jul 2003 13:59:08 -0700 Received: from ucla.edu (dhcp246.rip.ucla.edu [149.142.110.246]) (authenticated bits=0)h6FKx8Po020421; Tue, 15 Jul 2003 13:59:08 -0700 Date: Tue, 15 Jul 2003 13:59:08 -0700 Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v552) To: David Malone From: Sumit Shah In-Reply-To: <20030715162006.GA47687@walton.maths.tcd.ie> Message-Id: <2D5885DA-B707-11D7-9819-000393DB86CA@ucla.edu> Content-Transfer-Encoding: 7bit X-Mailer: Apple Mail (2.552) X-Scanned-By: MIMEDefang 2.25 / SpamAssassin 2.43 / mail.ucla.edu X-Probable-Spam: no X-Spam-Hits: -2.3 cc: freebsd-hackers@freebsd.org Subject: Re: RAID and NFS exports (Possible Data Corruption) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Jul 2003 20:59:11 -0000 Thanks for the reply. >> ad4: hard error reading fsbn 242727552 > > The error means that that the disk said that there was an error > trying to read this block. You say that when you rebooted that the > controler said a disk had gone bad, so this would sort of confirm > this. (I could believe that restarting mountd might upset raid stuff > if there were a kernel bug, but it seems very unlikely it could > cause a disk to go bad.) The full error was something like this on _both_ of the identical systems, even _before_ the reboot. After this message we could not read/write/fsck /dev/ar0 ad7: hard error reading fsbn 291786506 of 0-127 (ad7 bn 291786506; cn 289470 tn 11 sn 53) trying PIO mode ad7: DMA problem fallback to PIO mode ad7: DMA problem fallback to PIO mode ad7: DMA problem fallback to PIO mode ad7: DMA problem fallback to PIO mode ad7: DMA problem fallback to PIO mode ad7: hard error reading fsbn 291786586 of 0-127 (ad7 bn 291786586; cn 289470 tn 13 sn 7) status=59 e rror=40 ar0: ERROR - array broken There was also a variety of messages like these: Jul 14 02:55:39 thorimage1 /kernel: ad7: hard error reading fsbn 291786586 of 0-127 (ad7 bn 291786586; cn 289470 tn 13 sn 7) status=59 error=40 where ad7: .... included any of the 6 devices, somewhat randomly, in the array. > > My best guess would be that you have a bad batch of disks that > happen to have failed in similar ways. It is possible that restarting > mountd uncovered the errors, 'cos I think mountd internally does > a remount of the filesystem in question and that might cause a chunk > of stuff to be flushed out on to the disk, highlighting an error. > > (I had a bunch of the IBM "deathstar" disks fail on me within the > space of a week or so, after they'd been in use for about six > months. That certainly sounds reasonable that this problem had just manifested itself by restarting mountd. It's just strange and too much of a coincidence that two sets of six disks on two different but identical machines would fail exactly the same way within an hour. I guess given the decline of quality in hard drives things like this might be more likely. Thanks, Sumit