From owner-freebsd-stable Sun Nov 28 8:33:45 1999 Delivered-To: freebsd-stable@freebsd.org Received: from aurora.sol.net (aurora.sol.net [206.55.65.76]) by hub.freebsd.org (Postfix) with ESMTP id 0828714BE9 for ; Sun, 28 Nov 1999 08:33:14 -0800 (PST) (envelope-from jgreco@aurora.sol.net) Received: (from jgreco@localhost) by aurora.sol.net (8.9.2/8.9.2/SNNS-1.02) id KAA55332; Sun, 28 Nov 1999 10:33:07 -0600 (CST) From: Joe Greco Message-Id: <199911281633.KAA55332@aurora.sol.net> Subject: Re: ahc problems (with vinum?) In-Reply-To: <199911280515.WAA19138_panzer.kdm.org@ns.sol.net> from "Kenneth D. Merry" at "Nov 28, 1999 5:16: 2 am" To: ken@kdm.org (Kenneth D. Merry) Date: Sun, 28 Nov 1999 10:33:07 -0600 (CST) Cc: dgilbert@velocet.ca, stable@freebsd.org X-Mailer: ELM [version 2.4ME+ PL43 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > David Gilbert wrote... > > >>>>> "Kenneth" =3D=3D Kenneth D Merry writes: > > Kenneth> David Gilbert wrote... > > >> Several times, on a system I've been configuring and testing, I've > > >> got some maddening ahc0 messages. In general, they complain of a > > >> timeout on the bus (I think some packet got lost)... and x SCBs are > > >> aborted. > > >>=20 > > >> At this point, some portion of the SCSI bus is unusable... and the > > >> machine eventually hangs due to this. It does claim that it's > > >> resetting channel A of the ahc0 controller, but I gather it doesn't > > >> do any good. > > >>=20 > > >> I'm running 3.3-STABLE (as of thursday, I think) and am trying to > > >> format and test an 8-drive vinum RAID-5 array. > >=20 > > Kenneth> You'll need to provide more information in order for anyone > > Kenneth> to make sense of your problem. Specifically, please post any > > Kenneth> and all relevant kernel messages, including your controllers > > Kenneth> and drives and the errors you've seen printed out, explain > > Kenneth> your SCSI bus configuration, where it is terminated, etc. > >=20 > > Kenneth> The #1 cause of problems is cabling and termination. The > > Kenneth> second most common cause of problems is bogus drive firmware. > >=20 > > Kenneth> In any case, check your cabling and termination, as that is > > Kenneth> most likely problem. > >=20 > > Regardless of terminaion, the SCSI bus reset should clear > > things... the unit will run for hours just fine... get this one error > > and hang. It is difficult to copy down all the messages --- as they > > don't get copied into the logs (since the SCSI bus is locked). >=20 > Run a serial console on the box. You'll get all the messages that way. > Seriously, there's no way to adequately diagnose the problem without the > specific error messages in question. There are any number of conditions > that can cause a timeout. >=20 > > The controller is the 2940 U2W --- the one with a SE and an LVD > > connector. The LVD bus is connected to a professional 8 drive LVD > > case which is connected and terminated with the supplied cables. The > > SE connector is connected to a single drive. >=20 > And the SE drive is terminated as well? Are the supplied cables and > terminator for the LVD segment LVD cables/terminators? Ken, Just having spent a week debugging a (very) intermittent SCSI bus problem, I agree that I've seen some odd behaviour of this sort. What's even more exasperating is that, at least in some cases, it does appear to recover the one device that erred, but the rest stop functioning. I've got serial consoles on my machines, let me see if I can dig up... /=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\= =08|=08/=08-=08Console: serial port BIOS drive A: is disk0 BIOS drive C: is disk1 BIOS drive D: is disk2 BIOS drive E: is disk3 BIOS drive F: is disk4 BIOS drive G: is disk5 BIOS drive H: is disk6 BIOS drive I: is disk7 BIOS drive J: is disk8 FreeBSD/i386 bootstrap loader, Revision 0.7 640/65472kB (jkh@highwing.cdrom.com, Thu Sep 16 22:16:41 GMT 1999) |=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-= =08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08= /=08-=08\=08|=08/=08Loading /boot/defaults/loader.conf=20 -=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|= =08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/kernel = text=3D0x10a418 /=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\= =08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08= -=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|= =08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08= \=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08= -=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|= =08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08= \=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08data=3D0x17= b48+0x1a97c \=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08syms=3D[0x4+0x1= ee30\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08+0x4+0x= 206b3\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08] \=08|=08/=08-=08\=08|=08/=08 Hit [Enter] to boot immediately, or any other key for command prompt. Type '?' for a list of commands, 'help' for more detailed help. boot: host > boot -s Copyright (c) 1992-1999 FreeBSD Inc. Copyright (c) 1982, 1986, 1989, 1991, 1993 The Regents of the University of California. All rights reserved. FreeBSD 3.3-RELEASE #0: Mon Nov 22 13:38:07 CST 1999 root@host:/usr/src/sys/compile/DEMO Timecounter "i8254" frequency 1193182 Hz CPU: Pentium II/Xeon/Celeron (686-class CPU) Origin =3D "GenuineIntel" Id =3D 0x652 Stepping =3D 2 Features=3D0x183fbff real memory =3D 536870912 (524288K bytes) avail memory =3D 519716864 (507536K bytes) Programming 24 pins in IOAPIC #0 FreeBSD/SMP: Multiprocessor motherboard cpu0 (BSP): apic id: 1, version: 0x00040011, at 0xfee00000 cpu1 (AP): apic id: 0, version: 0x00040011, at 0xfee00000 io0 (APIC): apic id: 2, version: 0x00170011, at 0xfec00000 Preloaded elf kernel "kernel" at 0xc027e000. Pentium Pro MTRR support enabled Probing for devices on PCI bus 0: chip0: rev 0x03 on pci0.0.0 chip1: rev 0x03 on pci0.1.0 chip2: rev 0x02 on pci0.4.0 chip3: rev 0x02 on pci0.4.3 ahc0: rev 0x00 int a irq 19 on pci= 0.6.0 ahc0: aic7890/91 Wide Channel A, SCSI Id=3D7, 16/255 SCBs hfa0: rev 0x00 int a irq 19 on pci0.9.0 chip4: rev 0x03 on pci0.1= 0.0 ahc1: rev 0x00 int a irq 17 on pci0.11.0 ahc1: aic7890/91 Wide Channel A, SCSI Id=3D7, 16/255 SCBs ahc2: rev 0x00 int a irq 16 on pci0.12.0 ahc2: aic7880 Wide Channel A, SCSI Id=3D7, 16/255 SCBs Probing for devices on PCI bus 1: Probing for devices on PCI bus 2: de0: rev 0x22 int a irq 18 on pci2.4.0 de0: SMC 9332BDT 21140A [10-100Mb/s] pass 2.2 de0: address 00:e0:29:3c:fb:84 de1: rev 0x22 int a irq 19 on pci2.5.0 de1: SMC 9332BDT 21140A [10-100Mb/s] pass 2.2 de1: address 00:e0:29:3c:fb:85 Probing for PnP devices: Probing for devices on the ISA bus: sc0 on isa sc0: VGA color <16 virtual consoles, flags=3D0x0> atkbdc0 at 0x60-0x6f on motherboard atkbd0 irq 1 on isa psm0 not found sio0 at 0x3f8-0x3ff irq 4 flags 0x10 on isa sio0: type 16550A, console sio1 at 0x2f8-0x2ff irq 3 on isa sio1: type 16550A sio2: configured irq 5 not in bitmap of probed irqs 0 sio2 not found at 0x3e8 sio3: configured irq 9 not in bitmap of probed irqs 0 sio3 not found at 0x2e8 fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa fdc0: FIFO enabled, 8 bytes threshold fd0: 1.44MB 3.5in vga0 at 0x3b0-0x3df maddr 0xa0000 msize 131072 on isa npx0 on motherboard npx0: INT 16 interface we0 at 0x2e8 on isa we0: kernel is keeping watchdog alive APIC_IO: Testing 8254 interrupt delivery APIC_IO: routing 8254 via pin 2 IP packet filtering initialized, divert disabled, rule-based forwarding dis= abled, logging limited to 100 packets/entry by default ccd0-15: Concatenated disk drivers Waiting 2 seconds for SCSI devices to settle SMP: AP CPU #1 Launched! de0: enabling 100baseTX port chda1 at ahc1 bus 0 target 0 lun 0 da1: Fixed Direct Access SCSI-2 device=20 da1: 40.000MB/s transfers (20.000MHz, offset 15, 16bit), Tagged Queueing En= abled da1: 17366MB (35566480 512 byte sectors: 255H 63S/T 2213C) da4 at ahc1 bus 0 target 3 lun 0 da4: Fixed Direct Access SCSI-2 device=20 da4: 40.000MB/s transfers (20.000MHz, offset 15, 16bit), Tagged Queueing En= abled da4: 17366MB (35566480 512 byte sectors: 255H 63S/T 2213C) da7 at ahc1 bus 0 target 6 lun 0 da7: Fixed Direct Access SCSI-2 device=20 da7: 40.000MB/s transfers (20.000MHz, offset 15, 16bit), Tagged Queueing En= abled da7: 17366MB (35566480 512 byte sectors: 255H 63S/T 2213C) da10 at ahc2 bus 0 target 0 lun 0 da10: Fixed Direct Access SCSI-2 device=20 da10: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing En= abled da10: 17366MB (35566480 512 byte sectors: 64H 32S/T 17366C) da6 at ahc1 bus 0 target 5 lun 0 da6: Fixed Direct Access SCSI-2 device=20 da6: 40.000MB/s transfers (20.000MHz, offset 15, 16bit), Tagged Queueing En= abled da6: 17366MB (35566480 512 byte sectors: 255H 63S/T 2213C) da13 at ahc2 bus 0 target 3 lun 0 da13: Fixed Direct Access SCSI-2 device=20 da13: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing En= abled da13: 17366MB (35566480 512 byte sectors: 64H 32S/T 17366C) da5 at ahc1 bus 0 target 4 lun 0 da5: Fixed Direct Access SCSI-2 device=20 da5: 40.000MB/s transfers (20.000MHz, offset 15, 16bit), Tagged Queueing En= abled da5: 17366MB (35566480 512 byte sectors: 255H 63S/T 2213C) da16 at ahc2 bus 0 target 6 lun 0 da16: Fixed Direct Access SCSI-2 device=20 da16: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing En= abled da16: 17366MB (35566480 512 byte sectors: 64H 32S/T 17366C) da3 at ahc1 bus 0 target 2 lun 0 da3: Fixed Direct Access SCSI-2 device=20 da3: 40.000MB/s transfers (20.000MHz, offset 15, 16bit), Tagged Queueing En= abled da3: 17366MB (35566480 512 byte sectors: 255H 63S/T 2213C) da2 at ahc1 bus 0 target 1 lun 0 da2: Fixed Direct Access SCSI-2 device=20 da2: 40.000MB/s transfers (20.000MHz, offset 15, 16bit), Tagged Queueing En= abled da2: 17366MB (35566480 512 byte sectors: 255H 63S/T 2213C) da15 at ahc2 bus 0 target 5 lun 0 da15: Fixed Direct Access SCSI-2 device=20 da15: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing En= abled da15: 17366MB (35566480 512 byte sectors: 64H 32S/T 17366C) da9 at ahc1 bus 0 target 9 lun 0 da9: Fixed Direct Access SCSI-2 device=20 da9: 40.000MB/s transfers (20.000MHz, offset 15, 16bit), Tagged Queueing En= abled da9: 17366MB (35566480 512 byte sectors: 255H 63S/T 2213C) da14 at ahc2 bus 0 target 4 lun 0 da14: Fixed Direct Access SCSI-2 device=20 da14: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing En= abled da14: 17366MB (35566480 512 byte sectors: 64H 32S/T 17366C) da8 at ahc1 bus 0 target 8 lun 0 da8: Fixed Direct Access SCSI-2 device=20 da8: 40.000MB/s transfers (20.000MHz, offset 15, 16bit), Tagged Queueing En= abled da8: 17366MB (35566480 512 byte sectors: 255H 63S/T 2213C) da12 at ahc2 bus 0 target 2 lun 0 da12: Fixed Direct Access SCSI-2 device=20 da12: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing En= abled da12: 17366MB (35566480 512 byte sectors: 64H 32S/T 17366C) da11 at ahc2 bus 0 target 1 lun 0 da11: Fixed Direct Access SCSI-2 device=20 da11: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing En= abled da11: 17366MB (35566480 512 byte sectors: 64H 32S/T 17366C) da0 at ahc0 bus 0 target 0 lun 0 da0: Fixed Direct Access SCSI-2 device=20 da0: 40.000MB/s transfers (20.000MHz, offset 15, 16bit), Tagged Queueing En= abled da0: 4357MB (8925000 512 byte sectors: 255H 63S/T 555C) anging root device to da0s2a Enter full pathname of shell or RETURN for /bin/sh:=20 erase ^H, kill ^U, intr ^C # mioy=08 =08=08 =08=08 =08ount -a # cd ~de1: autosense failed: cable problem? jgreco # ls .cshrc .login.env .logout .path run .login .login.env.old .mailrc .profile # sh run& # dd: /dev/rda17: Device not configured dd: /dev/rda18: Device not configured (da13:ahc2:0:3:0): SCB 0xa - timed out in datain phase, SEQADDR =3D=3D 0x110 (da13:ahc2:0:3:0): Other SCB Timeout (da11:ahc2:0:1:0): SCB 0xb - timed out in datain phase, SEQADDR =3D=3D 0x110 (da11:ahc2:0:1:0): Other SCB Timeout (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR =3D=3D 0x110 (da10:ahc2:0:0:0): BDR message in message buffer (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR =3D=3D 0x10f (da10:ahc2:0:0:0): no longer in timeout, status =3D 34b ahc2: Issued Channel A Bus Reset. 7 SCBs aborted (da11:ahc2:0:1:0): SCB 0xa - timed out in datain phase, SEQADDR =3D=3D 0x153 (da11:ahc2:0:1:0): Other SCB Timeout (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR =3D=3D 0x153 (da10:ahc2:0:0:0): BDR message in message buffer (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR =3D=3D 0x153 (da10:ahc2:0:0:0): no longer in timeout, status =3D 34b ahc2: Issued Channel A Bus Reset. 3 SCBs aborted (da10:ahc2:0:0:0): SCB 0xa - timed out in datain phase, SEQADDR =3D=3D 0x110 (da10:ahc2:0:0:0): BDR message in message buffer (da10:ahc2:0:0:0): SCB 0xa - timed out in datain phase, SEQADDR =3D=3D 0x10f (da10:ahc2:0:0:0): no longer in timeout, status =3D 34b ahc2: Issued Channel A Bus Reset. 6 SCBs aborted 4357+1 records in 4357+1 records out 4569600000 bytes transferred in 428.640450 secs (10660683 bytes/sec) # reboot /=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\=08|=08/=08-=08\= =08|=08/=08-=08Console: serial port BIOS drive A: is disk0 BIOS drive C: is disk1 BIOS drive D: is disk2 run is a little script that sucks data in from all SCSI drives with dd and dumps it to /dev/null, in parallel. Now, when the bus reset happens, often the drive listed will actually recover and continue going, but if so, the others will typically stop (but dd is just waiting for data). This isn't written in stone, I've seen all drives drop off, and I've also seen the whole thing recover just fine. I have no idea what the result was for the incident listed above. It was one of dozens of incidents. The "reboot" bit is also mildly interesting. FreeBSD (cam?) seems to have lots of problems halting or rebooting in the event that a device is unavailable or a scbus is hung. I'd guess that it is waiting to flush some buffers or something, except that my tests only do reads - no writes. ... Joe ---------------------------------------------------------------------------= ---- Joe Greco - Systems Administrator jgreco@ns.sol.net Solaria Public Access UNIX - Milwaukee, WI 414/342-4847 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message