Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 17 Mar 2002 19:48:54 -0600
From:      "Douglas K. Rand" <rand@meridian-enviro.com>
To:        freebsd-stable@freebsd.org, freebsd-hardware@freebsd.org
Cc:        bryanh@meridian-enviro.com
Subject:   3Ware, Western Digital disks, and stray interrupts
Message-ID:  <87wuwave61.wl@delta.meridian-enviro.com>

next in thread | raw e-mail | index | archive | help
We have two pretty much identical systems: Both are Tyan Tiger MP
S2469 boards with a 3ware 7450 controller and Western Digital WD1000
100GB disks. One system has 4 disks in a RAID 10 configuration, and
the other has 2 disks in a RAID 1 configuration. One system only has a
single Athlon MP CPU, while the other has 2 Athlon MP CPUs.

We have gone through 5 of the WD1000 disks so far, with a 6th that
just failed the other day. The first 3 failures we tested with Western
Digital's drive fitness test, which reported all thee drives to be
OK. The first disk that failed we tried to put back in and have the
3ware controller rebuild, but the rebuild failed after 2 hours. We've
stopped testing the disks, and just send them back to Western
Digital. 

All the failures have been drive timeouts:

Dec 29 16:55:31 doppler[kern.crit] /kernel: twe0: AEN: <twe0: drive error for unknown unit 2>
Jan  1 23:36:04 doppler[kern.crit] /kernel: twe0: AEN: <twe0: drive timeout for unknown unit 3>
Feb 22 18:19:21 doppler[kern.crit] /kernel: twe0: AEN: <twe0: drive timeout for unknown unit 1>
Mar  7 20:18:44 vault[kern.crit] /kernel: twe0: AEN: <twed0: drive timeout>
Mar 16 21:42:02 vault[kern.crit] /kernel: twe0: AEN: <twed0: drive timeout>

The last two messages were somewhat massaged by me, that comes later...

So, the first question: Has anybody else seen such a horrible failure
rate witht he WD1000 disks?

The other problem we are having, which /may/ be related, is that the
second system (vault, the single CPU box) has had 2 failures that
coincide with a spate of "stray irq 7" messages. We are using swatch
to watch for the twe messages, but the two failures on vault have had
the kernel log mixed with the stray irq 7 messages:

Mar  7 20:18:44 vault[kern.crit] /kernel: t
Mar  7 20:18:44 vault[kern.err] /kernel: stray irq 7
Mar  7 20:18:44 vault[kern.crit] /kernel: we0
Mar  7 20:18:44 vault[kern.err] /kernel: stray irq 7
Mar  7 20:18:44 vault[kern.crit] /kernel: too many stray irq 7's; not logging any more
Mar  7 20:18:44 vault[kern.crit] /kernel: : AEN: <twed0: drive timeout>

Mar 16 21:42:02 vault[kern.crit] /kernel: tw
Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7
Mar 16 21:42:02 vault[kern.crit] /kernel: e0:
Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7
Mar 16 21:42:02 vault[kern.crit] /kernel: A
Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7
Mar 16 21:42:02 vault[kern.crit] /kernel: EN: <tw
Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7
Mar 16 21:42:02 vault[kern.crit] /kernel: ed0
Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7
Mar 16 21:42:02 vault[kern.crit] /kernel: too many stray irq 7's; not logging any more
Mar 16 21:42:02 vault[kern.crit] /kernel: : drive timeout>

In both cases, there aren't any kernel logs for 2 hours on either side
of this message. We have the parallel port disabled in the BIOS, and
after the last failure took irq 7 away from the PCI and PnP
devices. (None of the previous dmesg for the system report any devices
using irq 7.) I've put the current dmesg at the end.

So, is the 3ware controller causing the stray irq 7 messages when the
disk failes, or are the stray irq 7 messages causing the 3ware
controller to timeout the disk?

Any help would be appreciated. Pretty soon Western Digital is gonna
stop taking our phone calls. Either that, or we'll loose 2 disks
before we get the first one fixed. ;^)


Copyright (c) 1992-2002 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD 4.5-RELEASE #1: Wed Feb 13 17:10:19 CST 2002
    rand@vault.meridian-enviro.com:/usr/obj/usr/src/sys/VAULT
Timecounter "i8254"  frequency 1193182 Hz
Timecounter "TSC"  frequency 1400054127 Hz
CPU: AMD Athlon(tm) MP Processor 1600+ (1400.05-MHz 686-class CPU)
  Origin = "AuthenticAMD"  Id = 0x662  Stepping = 2
  Features=0x383fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE>
  AMD Features=0xc0480000<<b19>,AMIE,DSP,3DNow!>
real memory  = 268435456 (262144K bytes)
avail memory = 258568192 (252508K bytes)
Preloaded elf kernel "kernel" at 0xc02a8000.
Pentium Pro MTRR support enabled
Using $PIR table, 268435454 entries at 0xc00fdf10
npx0: <math processor> on motherboard
npx0: INT 16 interface
pcib0: <Host to PCI bridge> on motherboard
pci0: <PCI bus> on pcib0
pcib1: <PCI to PCI bridge (vendor=1022 device=700d)> at device 1.0 on pci0
pci1: <PCI bus> on pcib1
pci1: <Number Nine model 5348 graphics accelerator> at 5.0 irq 10
isab0: <PCI to ISA bridge (vendor=1022 device=7410)> at device 7.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <AMD 766 ATA100 controller> port 0xf000-0xf00f at device 7.1 on pci0
ata0: at 0x1f0 irq 14 on atapci0
ata1: at 0x170 irq 15 on atapci0
chip1: <PCI to Other bridge (vendor=1022 device=7413)> at device 7.3 on pci0
twe0: <3ware Storage Controller> port 0x1430-0x143f mem 0xf4000000-0xf47fffff,0xf4901000-0xf490100f irq 5 at device 8.0 on pci0
twe0: 4 ports, Firmware FE7X 1.03.09.027, BIOS BE7X 1.07.02.002
fxp0: <Intel Pro 10/100B/100+ Ethernet> port 0x1400-0x141f mem 0xf4800000-0xf48fffff,0xf4903000-0xf4903fff irq 5 at device 12.0 on pci0
fxp0: Ethernet address 00:90:27:18:d7:45
inphy0: <i82555 10/100 media interface> on miibus0
inphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
ahc0: <Adaptec 2940 Ultra SCSI adapter> port 0x1000-0x10ff mem 0xf4900000-0xf4900fff irq 11 at device 13.0 on pci0
aic7880: Ultra Wide Channel A, SCSI Id=7, 16/255 SCBs
orm0: <Option ROMs> at iomem 0xc0000-0xc7fff,0xc8000-0xc8fff,0xc9800-0xc9fff on isa0
fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0
fdc0: FIFO enabled, 8 bytes threshold
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x100>
sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0
sio0: type 16550A, console
sio1 at port 0x2f8-0x2ff irq 3 on isa0
sio1: type 16550A
Waiting 2 seconds for SCSI devices to settle
twed0: <TwinStor, Rebuilding> on twe0
twed0: 95395MB (195369520 sectors)
twe0: command interrupt
sa0 at ahc0 bus 0 target 6 lun 0
sa0: <HP C1557A U812> Removable Sequential Access SCSI-2 device 
sa0: 10.000MB/s transfers (10.000MHz, offset 15)
Mounting root from ufs:/dev/twed0s1a
ch0 at ahc0 bus 0 target 6 lun 1
ch0: <HP C1557A U812> Removable Changer SCSI-2 device 
ch0: 10.000MB/s transfers (10.000MHz, offset 15)
ch0: 6 slots, 1 drive, 0 pickers, 0 portals
twe0: AEN: <twed0: rebuild started>
twe0: AEN: <twed0: rebuild done>
ch: warning: could not map element source address 0d to a valid element type
pid 3332 (db_metar), uid 7002: exited on signal 11 (core dumped)
ch: warning: could not map element source address 0d to a valid element type
ch: warning: could not map element source address 0d to a valid element type
tw
stray irq 7
e0: 
stray irq 7
A
stray irq 7
EN: <tw
stray irq 7
ed0
stray irq 7
too many stray irq 7's; not logging any more
: drive timeout>
ch: warning: could not map element source address 0d to a valid element type

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?87wuwave61.wl>