From owner-freebsd-stable@FreeBSD.ORG Sat Dec 20 13:32:47 2003 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E2DAF16A4CE for ; Sat, 20 Dec 2003 13:32:47 -0800 (PST) Received: from mail.broadpark.no (mail.broadpark.no [217.13.4.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 66A4A43D4C for ; Sat, 20 Dec 2003 13:32:42 -0800 (PST) (envelope-from oivind.danielsen@kopek.net) Received: from OIVIND2 (213-187-164-30.dd.nextgentel.com [213.187.164.30]) by mail.broadpark.no (Postfix) with SMTP id 85A76797AC for ; Sat, 20 Dec 2003 19:07:42 +0100 (MET) From: "Oivind H. Danielsen" To: Date: Sat, 20 Dec 2003 19:07:41 +0100 Message-ID: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_0004_01C3C72C.8AD09600" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.6604 (9.0.2911.0) X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 Importance: Normal Subject: WRITE command timeout X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Dec 2003 21:32:48 -0000 This is a multi-part message in MIME format. ------=_NextPart_000_0004_01C3C72C.8AD09600 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Hello. We have been running FreeBSD 4.6-5.1 systems for 1.5 years and are being plagued by these: Dec 18 15:15:39 <> /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting Dec 19 15:03:23 <> /kernel: ad0: READ command timeout tag=0 serv=0 - resetting In our rack we have 34 identical drives (IBM IC35L080AVVA07). 24 drives on Windows 2000 : no problems. 4 drives on Linux 2.4.x : no problems. 2 drives on RELENG_4_8 (VIA 82C686, VIA C3) : no problems 4 drives on RELENG_4_8 (nVIDIA nForce, XP 2000+) : r/w timeouts, fs corruption. (1 drive/system, 6 FreeBSD boxes) The good systems have been running the 1.5 years without a hitch. The four identical RELENG_4_8 systems have all had corrupted filesystems (at least once every two months). We have tried the following: - Changed ATA100 cables (3 diff. types, all 80-wire) - Disabled DMA (use PIO4) (hw.ata.ata_dma="0" in loader.conf) - Disabled DMA in BIOS setup - Changed motherboard (MSI MS6734, VIA KM400, vt8235 ATA) - Changed power supply (added 100W) - RELENG_5_1. None of these changes has helped. The only change seen when disabling DMA is additional messages: "timeout waiting for DRQ - resetting". I have searched the net for more information on this topic for over a year, and all I find is replies like: - "Just change the cable, dude.." (did that, still timeouts) - "IBM drives are bad for you." (seen this with other drives too) (drives work well on Linux/W2k) - "Disabling DMA fixes it." (tried that, it didn't) - "ATA is for wimps. SCSI rulezz." (different discussion) # sysctl hw.ata hw.ata.ata_dma: 0 hw.ata.wc: 1 hw.ata.tags: 0 hw.ata.atapi_dma: 0 # atacontrol mode 0 Master = PIO4 Slave = ??? # atacontrol info 0 Master: ad0 ATA/ATAPI rev 5 Slave: no device present dmesg, pciconf and kernel config are attached. No special compilation options (except -DIPFW2) are used. I can provide more information on request. We're now running FreeBSD 4.8-RELEASE-p14 and FreeBSD 5.1-RELEASE-p8, but the problem has been around since we started out with 4.6 I believe. The "good" and "bad" FreeBSD systems all use the same kernel/world. The reason why we have used such low-end hardware in these boxes is that they are part of a highly redundant cluster solution for crypto processing (no storage is used for application purposes). This means the system can cope with the occasional fs corruption, but we would still prefer to get rid of it. I know this problem has been discussed before, but wanted to add more data to the discussion. I don't think all of the reports should be attributed to bad HW. Nevertheless, even if the hardware is broken, the system should preferably function equally well/bad as with Linux/W2k. Any help is greatly appreciated. Best Regards, Oivind H. Danielsen ------=_NextPart_000_0004_01C3C72C.8AD09600 Content-Type: text/plain; name="dmesg.txt" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="dmesg.txt" Copyright (c) 1992-2003 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 4.8-RELEASE-p14 #0: Sat Dec 6 00:30:50 CET 2003 Timecounter "i8254" frequency 1193182 Hz CPU: AMD Athlon(TM) XP 2000+ (1662.38-MHz 686-class CPU) Origin =3D "AuthenticAMD" Id =3D 0x662 Stepping =3D 2 = Features=3D0x383fbff AMD Features=3D0xc0400000 real memory =3D 503234560 (491440K bytes) avail memory =3D 485507072 (474128K bytes) Preloaded elf kernel "kernel" at 0xc03e3000. Pentium Pro MTRR support enabled md0: Malloc disk Using $PIR table, 10 entries at 0xc00f20b0 npx0: on motherboard npx0: INT 16 interface pcib0: on motherboard pci0: on pcib0 pci0: (vendor=3D0x10de, dev=3D0x01ac) at 0.1 pci0: (vendor=3D0x10de, dev=3D0x01ad) at 0.2 pci0: (vendor=3D0x10de, dev=3D0x01aa) at 0.3 isab0: at device 1.0 = on pci0 isa0: on isab0 pci0: (vendor=3D0x10de, dev=3D0x01b4) at 1.1 irq 5 ohci0: mem 0xe7000000-0xe7000fff irq 10 = at device 2.0 on pci0 usb0: OHCI version 1.0 usb0: SMM does not respond, resetting usb0: on ohci0 usb0: USB revision 1.0 uhub0: (0x10de) OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 3 ports with 3 removable, self powered ohci1: mem 0xe6800000-0xe6800fff irq 10 = at device 3.0 on pci0 usb1: OHCI version 1.0 usb1: on ohci1 usb1: USB revision 1.0 uhub1: (0x10de) OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 3 ports with 3 removable, self powered pci0: (vendor=3D0x10de, dev=3D0x01c3) at 4.0 irq 10 pci0: (vendor=3D0x10de, dev=3D0x01b0) at 5.0 irq 5 pci0: (vendor=3D0x10de, dev=3D0x01b1) at 6.0 irq 11 pcib1: at device 8.0 = on pci0 pci1: on pcib1 pcib2: at device 6.0 on pci1 pci2: on pcib2 fxp0: port 0xb800-0xb83f mem = 0xe4000000-0xe40fffff,0xe4800000-0xe4800fff irq 5 at device 4.0 on pci2 fxp0: Ethernet address 00:08:9b:14:72:0f inphy0: on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp1: port 0xb400-0xb43f mem = 0xe3000000-0xe30fffff,0xe3800000-0xe3800fff irq 5 at device 5.0 on pci2 fxp1: Ethernet address 00:08:9b:14:72:10 inphy1: on miibus1 inphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp2: port 0xb000-0xb03f mem = 0xe2000000-0xe20fffff,0xe2800000-0xe2800fff irq 5 at device 6.0 on pci2 fxp2: Ethernet address 00:08:9b:14:71:9b inphy2: on miibus2 inphy2: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp3: port 0xa800-0xa83f mem = 0xe1000000-0xe10fffff,0xe1800000-0xe1800fff irq 5 at device 7.0 on pci2 fxp3: Ethernet address 00:08:9b:14:71:9c inphy3: on miibus3 inphy3: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto atapci0: port 0x9800-0x980f at device = 9.0 on pci0 ata0: at 0x1f0 irq 14 on atapci0 ata1: at 0x170 irq 15 on atapci0 pcib3: at device 30.0 = on pci0 pci3: on pcib3 pci3: at 0.0 irq 11 orm0: