From owner-freebsd-stable@FreeBSD.ORG Mon Sep 12 13:52:53 2005 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AC00916A420 for ; Mon, 12 Sep 2005 13:52:53 +0000 (GMT) (envelope-from bs139412@skynet.be) Received: from outmx028.isp.belgacom.be (outmx028.isp.belgacom.be [195.238.3.49]) by mx1.FreeBSD.org (Postfix) with ESMTP id E422443D46 for ; Mon, 12 Sep 2005 13:52:52 +0000 (GMT) (envelope-from bs139412@skynet.be) Received: from outmx028.isp.belgacom.be (localhost [127.0.0.1]) by outmx028.isp.belgacom.be (8.12.11/8.12.11/Skynet-OUT-2.22) with ESMTP id j8CDqehJ025795 for ; Mon, 12 Sep 2005 15:52:40 +0200 (envelope-from ) Received: from tetsuo.maxx.lan (116-190.244.81.adsl.skynet.be [81.244.190.116]) by outmx028.isp.belgacom.be (8.12.11/8.12.11/Skynet-OUT-2.22) with ESMTP id j8CDqXpI025723 for ; Mon, 12 Sep 2005 15:52:33 +0200 (envelope-from ) From: MaXX To: freebsd-stable@freebsd.org Date: Mon, 12 Sep 2005 15:53:27 +0200 User-Agent: KMail/1.8 References: <20050912120040.02A6B16A41F@hub.freebsd.org> In-Reply-To: <20050912120040.02A6B16A41F@hub.freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200509121553.27981.bs139412@skynet.be> Subject: Re: Stress testing and TIMEOUT - WRITE_DMA X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Sep 2005 13:52:53 -0000 On Fri, 26 Aug 2005 03:21:35 -0600 Anthony Chavez wrote: > My question is simply this: is the fact that I received 4 TIMEOUT > warnings in the space of roughly 2 weeks significant cause for concern? Hi, You may have a look at this pr :85603 (FS corruption and 'uncorrectable' DMA errors on ATA disks after unclean shutdown) and see if that applies for you. Are you running a kernel built around mid June this year? Did your machine paniced before the DMA problems appears (I think a power faillure can do the trick too)? We were severall usenet user experiencing this kind of problems (news://comp.unix.bsd.freebsd.misc thread was named "Disaster Recovery? and started 30 Aug 05). If you have the same problem as us, the fix is easy: - backup your data with tar (will take a while due to timeouts) - fdisk + newfs - reinstall your backup - cvsup + upgrade your kernel and thats all... And I was surprised to see my PostgreSQL database coming online without a single error message Pg really hate when theFS is inconsistent... In our case this problem was fixed by newfs, even smartctl (sysutils/smartmontool) did report errors at the drive level. After newfs'ing the disk no more message (but they still in the drive's log). Hope this is relevant to your problem... -- MaXX I tested my drive as follow: On comp.unix.bsd.freebsd.misc MaXX wrote: > I will stress test the drive to see if it still reliable for some purpose. I've finished some tests on the drive: 1. filled the drive with huge files (11,25,30,10Gb) 3 simultaneous writes => no DMA_READ or DMA_WRITE errors; fsck OK 2. copied 18 times /usr/ports with some distfiles and work folders (2 simultaneous copies , 9 times about 4 596 000 files) => no DMA_READ or DMA_WRITE errors; fsck NOT OK: a bunch of errors which seem to be only at the file system level. 3. md5 sum of 4 596 000 files before corrective fsck: no errors, burning hot drive 4. clean reboot + fsck: ok; fsck skipped checks. 5. compare md5 before and after reboot: OK, no missing files/folders, newsum == oldsum. I the tried to reproduce the initial problem, no way to do it... I killed init, pulled the plug while writing or reading. No way to get those DMA_* errors back (Note: the kernel was not the same as the failled one)... I give up... Conclusion: the disk is reliable enough to go back to work with a good backup policy (maybe in a vinum mirror to be sure). The problem seem to be bound to the kernel the machine was running since mid June 05.