Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Sep 2005 15:53:27 +0200
From:      MaXX <bs139412@skynet.be>
To:        freebsd-stable@freebsd.org
Subject:   Re: Stress testing and TIMEOUT - WRITE_DMA
Message-ID:  <200509121553.27981.bs139412@skynet.be>
In-Reply-To: <20050912120040.02A6B16A41F@hub.freebsd.org>
References:  <20050912120040.02A6B16A41F@hub.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 26 Aug 2005 03:21:35 -0600 Anthony Chavez <acc@anthonychavez.org> 
wrote:
> My question is simply this: is the fact that I received 4 TIMEOUT
> warnings in the space of roughly 2 weeks significant cause for concern?
Hi,
You may have a look at this pr :85603  (FS corruption and 'uncorrectable' DMA 
errors on ATA disks after unclean shutdown) and see if that applies for you.

Are you running a kernel built around mid June this year?
Did your machine paniced before the DMA problems appears (I think a power 
faillure can do the trick too)?

We were severall usenet user experiencing this kind of problems 
(news://comp.unix.bsd.freebsd.misc thread was named "Disaster Recovery? and 
started 30 Aug 05). If you have the same problem as us, the fix is easy:
- backup your data with tar (will take a while due to timeouts)
- fdisk + newfs 
- reinstall your backup
- cvsup + upgrade your kernel
and thats all... And I was surprised to see my PostgreSQL database coming 
online without a single error message Pg really hate when theFS is 
inconsistent...

In our case this problem was fixed by newfs, even smartctl 
(sysutils/smartmontool) did report errors at the drive level. After newfs'ing 
the disk no more message (but they still in the drive's log). 

Hope this is relevant to your problem...
--
MaXX

I tested my drive as follow:
On comp.unix.bsd.freebsd.misc MaXX wrote:
> I will stress test the drive to see if it still reliable for some purpose.
I've finished some tests on the drive:

1. filled the drive with huge files (11,25,30,10Gb) 3 simultaneous writes =>
no DMA_READ or DMA_WRITE errors; fsck OK

2. copied 18 times /usr/ports with some distfiles and work folders (2
simultaneous copies , 9
times about 4 596 000 files) => no DMA_READ or DMA_WRITE errors; fsck NOT
OK: a bunch of errors which seem to be only at the file system level.

3. md5 sum of 4 596 000 files before corrective fsck: no errors, burning hot
drive

4. clean reboot + fsck: ok; fsck skipped checks.

5. compare md5 before and after reboot: OK, no missing files/folders, newsum
== oldsum.

I the tried to reproduce the initial problem, no way to do it... I killed
init, pulled the plug while writing or reading. No way to get those DMA_*
errors back (Note: the kernel was not the same as the failled one)...

I give up...

Conclusion: the disk is reliable enough to go back to work with a good
backup policy (maybe in a vinum mirror to be sure). The problem seem to be
bound to the kernel the machine was running since mid June 05.
 




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200509121553.27981.bs139412>