From owner-freebsd-stable@FreeBSD.ORG Sat Dec 20 14:14:08 2003 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id F314116A4CE for ; Sat, 20 Dec 2003 14:14:07 -0800 (PST) Received: from lakemtao06.cox.net (lakemtao06.cox.net [68.1.17.115]) by mx1.FreeBSD.org (Postfix) with ESMTP id BD41E43D72 for ; Sat, 20 Dec 2003 14:13:26 -0800 (PST) (envelope-from kitbsdlists@HotPOP.com) Received: from vixen42 ([68.109.49.234]) by lakemtao06.cox.net (InterMail vM.5.01.06.05 201-253-122-130-105-20030824) with SMTP id <20031220221321.PNLT24575.lakemtao06.cox.net@vixen42>; Sat, 20 Dec 2003 17:13:21 -0500 Date: Sat, 20 Dec 2003 16:12:01 -0600 From: Vulpes Velox To: "Oivind H. Danielsen" Message-Id: <20031220161201.60833ea2.kitbsdlists@HotPOP.com> In-Reply-To: References: X-Mailer: Sylpheed version 0.9.6claws (GTK+ 1.2.10; i386-portbld-freebsd4.9) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit cc: freebsd-stable@freebsd.org Subject: Re: WRITE command timeout X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Dec 2003 22:14:08 -0000 On Sat, 20 Dec 2003 19:07:41 +0100 "Oivind H. Danielsen" wrote: > Hello. > > We have been running FreeBSD 4.6-5.1 systems for 1.5 years and are being > plagued by these: > > Dec 18 15:15:39 <> /kernel: ad0: WRITE command timeout tag=0 serv=0 - > resetting > Dec 19 15:03:23 <> /kernel: ad0: READ command timeout tag=0 serv=0 - > resetting This is most likely cuased by the drive going bad or a bad cable. > In our rack we have 34 identical drives (IBM IC35L080AVVA07). > > 24 drives on Windows 2000 : no problems. > 4 drives on Linux 2.4.x : no problems. > > 2 drives on RELENG_4_8 > (VIA 82C686, VIA C3) : no problems > > 4 drives on RELENG_4_8 > (nVIDIA nForce, XP 2000+) : r/w timeouts, fs corruption. > > (1 drive/system, 6 FreeBSD boxes) > > The good systems have been running the 1.5 years without a hitch. The > four identical RELENG_4_8 systems have all had corrupted filesystems (at > least once every two months). > > > We have tried the following: > > - Changed ATA100 cables (3 diff. types, all 80-wire) > - Disabled DMA (use PIO4) (hw.ata.ata_dma="0" in loader.conf) > - Disabled DMA in BIOS setup > - Changed motherboard (MSI MS6734, VIA KM400, vt8235 ATA) > - Changed power supply (added 100W) > - RELENG_5_1. > > None of these changes has helped. The only change seen when disabling > DMA is additional messages: "timeout waiting for DRQ - resetting". > > > I have searched the net for more information on this topic for over a > year, and all I find is replies like: > > - "Just change the cable, dude.." (did that, still timeouts) > - "IBM drives are bad for you." (seen this with other drives too) > (drives work well on Linux/W2k) > - "Disabling DMA fixes it." (tried that, it didn't) > - "ATA is for wimps. SCSI rulezz." (different discussion) > > > # sysctl hw.ata > hw.ata.ata_dma: 0 > hw.ata.wc: 1 > hw.ata.tags: 0 > hw.ata.atapi_dma: 0 > > # atacontrol mode 0 > Master = PIO4 > Slave = ??? > > # atacontrol info 0 > Master: ad0 ATA/ATAPI rev 5 > Slave: no device present > > > dmesg, pciconf and kernel config are attached. No special compilation > options (except -DIPFW2) are used. I can provide more information on > request. > > We're now running FreeBSD 4.8-RELEASE-p14 and FreeBSD 5.1-RELEASE-p8, > but the problem has been around since we started out with 4.6 I > believe. The "good" and "bad" FreeBSD systems all use the same > kernel/world. > > > The reason why we have used such low-end hardware in these boxes is that > they are part of a highly redundant cluster solution for crypto > processing (no storage is used for application purposes). This means the > system can cope with the occasional fs corruption, but we would still > prefer to get rid of it. > > > I know this problem has been discussed before, but wanted to add more > data to the discussion. I don't think all of the reports should be > attributed to bad HW. Nevertheless, even if the hardware is broken, the > system should preferably function equally well/bad as with Linux/W2k. > > > Any help is greatly appreciated. > > > Best Regards, > > Oivind H. Danielsen >