From owner-freebsd-stable@FreeBSD.ORG  Mon Sep 12 13:52:53 2005
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
X-Original-To: freebsd-stable@freebsd.org
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id AC00916A420
	for <freebsd-stable@freebsd.org>; Mon, 12 Sep 2005 13:52:53 +0000 (GMT)
	(envelope-from bs139412@skynet.be)
Received: from outmx028.isp.belgacom.be (outmx028.isp.belgacom.be
	[195.238.3.49]) by mx1.FreeBSD.org (Postfix) with ESMTP id E422443D46
	for <freebsd-stable@freebsd.org>; Mon, 12 Sep 2005 13:52:52 +0000 (GMT)
	(envelope-from bs139412@skynet.be)
Received: from outmx028.isp.belgacom.be (localhost [127.0.0.1])
	by outmx028.isp.belgacom.be (8.12.11/8.12.11/Skynet-OUT-2.22) with
	ESMTP id j8CDqehJ025795
	for <freebsd-stable@freebsd.org>; Mon, 12 Sep 2005 15:52:40 +0200
	(envelope-from <bs139412@skynet.be>)
Received: from tetsuo.maxx.lan (116-190.244.81.adsl.skynet.be [81.244.190.116])
	by outmx028.isp.belgacom.be (8.12.11/8.12.11/Skynet-OUT-2.22) with
	ESMTP id j8CDqXpI025723
	for <freebsd-stable@freebsd.org>; Mon, 12 Sep 2005 15:52:33 +0200
	(envelope-from <bs139412@skynet.be>)
From: MaXX <bs139412@skynet.be>
To: freebsd-stable@freebsd.org
Date: Mon, 12 Sep 2005 15:53:27 +0200
User-Agent: KMail/1.8
References: <20050912120040.02A6B16A41F@hub.freebsd.org>
In-Reply-To: <20050912120040.02A6B16A41F@hub.freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200509121553.27981.bs139412@skynet.be>
Subject: Re: Stress testing and TIMEOUT - WRITE_DMA
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Sep 2005 13:52:53 -0000

On Fri, 26 Aug 2005 03:21:35 -0600 Anthony Chavez <acc@anthonychavez.org> 
wrote:
> My question is simply this: is the fact that I received 4 TIMEOUT
> warnings in the space of roughly 2 weeks significant cause for concern?
Hi,
You may have a look at this pr :85603  (FS corruption and 'uncorrectable' DMA 
errors on ATA disks after unclean shutdown) and see if that applies for you.

Are you running a kernel built around mid June this year?
Did your machine paniced before the DMA problems appears (I think a power 
faillure can do the trick too)?

We were severall usenet user experiencing this kind of problems 
(news://comp.unix.bsd.freebsd.misc thread was named "Disaster Recovery? and 
started 30 Aug 05). If you have the same problem as us, the fix is easy:
- backup your data with tar (will take a while due to timeouts)
- fdisk + newfs 
- reinstall your backup
- cvsup + upgrade your kernel
and thats all... And I was surprised to see my PostgreSQL database coming 
online without a single error message Pg really hate when theFS is 
inconsistent...

In our case this problem was fixed by newfs, even smartctl 
(sysutils/smartmontool) did report errors at the drive level. After newfs'ing 
the disk no more message (but they still in the drive's log). 

Hope this is relevant to your problem...
--
MaXX

I tested my drive as follow:
On comp.unix.bsd.freebsd.misc MaXX wrote:
> I will stress test the drive to see if it still reliable for some purpose.
I've finished some tests on the drive:

1. filled the drive with huge files (11,25,30,10Gb) 3 simultaneous writes =>
no DMA_READ or DMA_WRITE errors; fsck OK

2. copied 18 times /usr/ports with some distfiles and work folders (2
simultaneous copies , 9
times about 4 596 000 files) => no DMA_READ or DMA_WRITE errors; fsck NOT
OK: a bunch of errors which seem to be only at the file system level.

3. md5 sum of 4 596 000 files before corrective fsck: no errors, burning hot
drive

4. clean reboot + fsck: ok; fsck skipped checks.

5. compare md5 before and after reboot: OK, no missing files/folders, newsum
== oldsum.

I the tried to reproduce the initial problem, no way to do it... I killed
init, pulled the plug while writing or reading. No way to get those DMA_*
errors back (Note: the kernel was not the same as the failled one)...

I give up...

Conclusion: the disk is reliable enough to go back to work with a good
backup policy (maybe in a vinum mirror to be sure). The problem seem to be
bound to the kernel the machine was running since mid June 05.