From owner-freebsd-hackers@FreeBSD.ORG  Tue Jul 15 13:59:10 2003
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id AA77037B404
	for <freebsd-hackers@freebsd.org>;
	Tue, 15 Jul 2003 13:59:10 -0700 (PDT)
Received: from periwinkle.noc.ucla.edu (periwinkle.noc.ucla.edu
	[169.232.47.11])
	by mx1.FreeBSD.org (Postfix) with ESMTP id B08CB43FAF
	for <freebsd-hackers@freebsd.org>;
	Tue, 15 Jul 2003 13:59:08 -0700 (PDT)	(envelope-from shah@ucla.edu)
Received: from tigerlily.noc.ucla.edu (tigerlily.noc.ucla.edu [169.232.46.12])
	h6FKx8Qh027754;	Tue, 15 Jul 2003 13:59:08 -0700
Received: from ucla.edu (dhcp246.rip.ucla.edu [149.142.110.246])
	(authenticated bits=0)h6FKx8Po020421;	Tue, 15 Jul 2003 13:59:08 -0700
Date: Tue, 15 Jul 2003 13:59:08 -0700
Content-Type: text/plain; charset=US-ASCII; format=flowed
Mime-Version: 1.0 (Apple Message framework v552)
To: David Malone <dwmalone@maths.tcd.ie>
From: Sumit Shah <shah@ucla.edu>
In-Reply-To: <20030715162006.GA47687@walton.maths.tcd.ie>
Message-Id: <2D5885DA-B707-11D7-9819-000393DB86CA@ucla.edu>
Content-Transfer-Encoding: 7bit
X-Mailer: Apple Mail (2.552)
X-Scanned-By: MIMEDefang 2.25 / SpamAssassin 2.43 / mail.ucla.edu
X-Probable-Spam: no
X-Spam-Hits: -2.3
cc: freebsd-hackers@freebsd.org
Subject: Re: RAID and NFS exports (Possible Data Corruption)
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Jul 2003 20:59:11 -0000

Thanks for the reply.

>> ad4: hard error reading fsbn  242727552
>
> The error means that that the disk said that there was an error
> trying to read this block. You say that when you rebooted that the
> controler said a disk had gone bad, so this would sort of confirm
> this. (I could believe that restarting mountd might upset raid stuff
> if there were a kernel bug, but it seems very unlikely it could
> cause a disk to go bad.)

The full error was something like this on _both_ of the identical 
systems, even _before_ the reboot.  After this message we could not 
read/write/fsck /dev/ar0

ad7: hard error reading fsbn 291786506 of 0-127 (ad7 bn 291786506; cn 
289470 tn 11 sn 53) trying PIO
  mode
ad7: DMA problem fallback to PIO mode
ad7: DMA problem fallback to PIO mode
ad7: DMA problem fallback to PIO mode
ad7: DMA problem fallback to PIO mode
ad7: DMA problem fallback to PIO mode
ad7: hard error reading fsbn 291786586 of 0-127 (ad7 bn 291786586; cn 
289470 tn 13 sn 7) status=59 e
rror=40
ar0: ERROR - array broken

There was also a variety of messages like these:
Jul 14 02:55:39 thorimage1 /kernel: ad7: hard error reading fsbn 
291786586 of 0-127 (ad7 bn 291786586; cn 289470 tn 13 sn 7) status=59 
error=40

where ad7: .... included any of the 6 devices, somewhat randomly, in 
the array.

>
> My best guess would be that you have a bad batch of disks that
> happen to have failed in similar ways. It is possible that restarting
> mountd uncovered the errors, 'cos I think mountd internally does
> a remount of the filesystem in question and that might cause a chunk
> of stuff to be flushed out on to the disk, highlighting an error.
>
> (I had a bunch of the IBM "deathstar" disks fail on me within the
> space of a week or so, after they'd been in use for about six
> months.

That certainly sounds reasonable that this problem had just manifested 
itself by restarting mountd.  It's just strange and too much of a 
coincidence that two sets of six disks on two different but identical 
machines would fail exactly the same way within an hour.  I guess given 
the decline of quality in hard drives things like this might be more 
likely.

Thanks,
Sumit