From owner-freebsd-hardware Sun Mar 17 17:49:11 2002 Delivered-To: freebsd-hardware@freebsd.org Received: from agena.meridian-enviro.com (thunder.meridian-enviro.com [207.109.234.227]) by hub.freebsd.org (Postfix) with ESMTP id 70DBB37B405; Sun, 17 Mar 2002 17:49:00 -0800 (PST) Received: from delta.meridian-enviro.com (delta.meridian-enviro.com [10.10.10.43]) by agena.meridian-enviro.com (8.11.6/8.11.6) with ESMTP id g2I1mxk12710; Sun, 17 Mar 2002 19:48:59 -0600 (CST) (envelope-from rand@meridian-enviro.com) Date: Sun, 17 Mar 2002 19:48:54 -0600 Message-ID: <87wuwave61.wl@delta.meridian-enviro.com> From: "Douglas K. Rand" To: freebsd-stable@freebsd.org, freebsd-hardware@freebsd.org Cc: bryanh@meridian-enviro.com Subject: 3Ware, Western Digital disks, and stray interrupts User-Agent: Wanderlust/2.7.1 (Too Funky) SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 MULE XEmacs/21.1 (patch 14) (Cuyahoga Valley) (i386--freebsd) X-Face: $L%T~#'9fAQ])o]A][d7EH`V;"_;2K;TEPQB=v]rDf_2s% List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org We have two pretty much identical systems: Both are Tyan Tiger MP S2469 boards with a 3ware 7450 controller and Western Digital WD1000 100GB disks. One system has 4 disks in a RAID 10 configuration, and the other has 2 disks in a RAID 1 configuration. One system only has a single Athlon MP CPU, while the other has 2 Athlon MP CPUs. We have gone through 5 of the WD1000 disks so far, with a 6th that just failed the other day. The first 3 failures we tested with Western Digital's drive fitness test, which reported all thee drives to be OK. The first disk that failed we tried to put back in and have the 3ware controller rebuild, but the rebuild failed after 2 hours. We've stopped testing the disks, and just send them back to Western Digital. All the failures have been drive timeouts: Dec 29 16:55:31 doppler[kern.crit] /kernel: twe0: AEN: Jan 1 23:36:04 doppler[kern.crit] /kernel: twe0: AEN: Feb 22 18:19:21 doppler[kern.crit] /kernel: twe0: AEN: Mar 7 20:18:44 vault[kern.crit] /kernel: twe0: AEN: Mar 16 21:42:02 vault[kern.crit] /kernel: twe0: AEN: The last two messages were somewhat massaged by me, that comes later... So, the first question: Has anybody else seen such a horrible failure rate witht he WD1000 disks? The other problem we are having, which /may/ be related, is that the second system (vault, the single CPU box) has had 2 failures that coincide with a spate of "stray irq 7" messages. We are using swatch to watch for the twe messages, but the two failures on vault have had the kernel log mixed with the stray irq 7 messages: Mar 7 20:18:44 vault[kern.crit] /kernel: t Mar 7 20:18:44 vault[kern.err] /kernel: stray irq 7 Mar 7 20:18:44 vault[kern.crit] /kernel: we0 Mar 7 20:18:44 vault[kern.err] /kernel: stray irq 7 Mar 7 20:18:44 vault[kern.crit] /kernel: too many stray irq 7's; not logging any more Mar 7 20:18:44 vault[kern.crit] /kernel: : AEN: Mar 16 21:42:02 vault[kern.crit] /kernel: tw Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7 Mar 16 21:42:02 vault[kern.crit] /kernel: e0: Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7 Mar 16 21:42:02 vault[kern.crit] /kernel: A Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7 Mar 16 21:42:02 vault[kern.crit] /kernel: EN: In both cases, there aren't any kernel logs for 2 hours on either side of this message. We have the parallel port disabled in the BIOS, and after the last failure took irq 7 away from the PCI and PnP devices. (None of the previous dmesg for the system report any devices using irq 7.) I've put the current dmesg at the end. So, is the 3ware controller causing the stray irq 7 messages when the disk failes, or are the stray irq 7 messages causing the 3ware controller to timeout the disk? Any help would be appreciated. Pretty soon Western Digital is gonna stop taking our phone calls. Either that, or we'll loose 2 disks before we get the first one fixed. ;^) Copyright (c) 1992-2002 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 4.5-RELEASE #1: Wed Feb 13 17:10:19 CST 2002 rand@vault.meridian-enviro.com:/usr/obj/usr/src/sys/VAULT Timecounter "i8254" frequency 1193182 Hz Timecounter "TSC" frequency 1400054127 Hz CPU: AMD Athlon(tm) MP Processor 1600+ (1400.05-MHz 686-class CPU) Origin = "AuthenticAMD" Id = 0x662 Stepping = 2 Features=0x383fbff AMD Features=0xc0480000<,AMIE,DSP,3DNow!> real memory = 268435456 (262144K bytes) avail memory = 258568192 (252508K bytes) Preloaded elf kernel "kernel" at 0xc02a8000. Pentium Pro MTRR support enabled Using $PIR table, 268435454 entries at 0xc00fdf10 npx0: on motherboard npx0: INT 16 interface pcib0: on motherboard pci0: on pcib0 pcib1: at device 1.0 on pci0 pci1: on pcib1 pci1: at 5.0 irq 10 isab0: at device 7.0 on pci0 isa0: on isab0 atapci0: port 0xf000-0xf00f at device 7.1 on pci0 ata0: at 0x1f0 irq 14 on atapci0 ata1: at 0x170 irq 15 on atapci0 chip1: at device 7.3 on pci0 twe0: <3ware Storage Controller> port 0x1430-0x143f mem 0xf4000000-0xf47fffff,0xf4901000-0xf490100f irq 5 at device 8.0 on pci0 twe0: 4 ports, Firmware FE7X 1.03.09.027, BIOS BE7X 1.07.02.002 fxp0: port 0x1400-0x141f mem 0xf4800000-0xf48fffff,0xf4903000-0xf4903fff irq 5 at device 12.0 on pci0 fxp0: Ethernet address 00:90:27:18:d7:45 inphy0: on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto ahc0: port 0x1000-0x10ff mem 0xf4900000-0xf4900fff irq 11 at device 13.0 on pci0 aic7880: Ultra Wide Channel A, SCSI Id=7, 16/255 SCBs orm0: