From owner-freebsd-scsi  Mon Dec 14 08:57:48 1998
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id IAA11717
          for freebsd-scsi-outgoing; Mon, 14 Dec 1998 08:57:48 -0800 (PST)
          (envelope-from owner-freebsd-scsi@FreeBSD.ORG)
Received: from panzer.plutotech.com (panzer.plutotech.com [206.168.67.125])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id IAA11698;
          Mon, 14 Dec 1998 08:57:44 -0800 (PST)
          (envelope-from ken@panzer.plutotech.com)
Received: (from ken@localhost)
          by panzer.plutotech.com (8.9.1/8.8.5) id JAA54340;
          Mon, 14 Dec 1998 09:57:27 -0700 (MST)
From: "Kenneth D. Merry" <ken@plutotech.com>
Message-Id: <199812141657.JAA54340@panzer.plutotech.com>
Subject: Re: CAM and -stable
In-Reply-To: <Pine.BSF.4.00.9812141026360.25881-100000@super-g.inch.com> from spork at "Dec 14, 98 10:29:08 am"
To: spork@super-g.com (spork)
Date: Mon, 14 Dec 1998 09:57:27 -0700 (MST)
Cc: freebsd-scsi@FreeBSD.ORG, freebsd-stable@FreeBSD.ORG
X-Mailer: ELM [version 2.4ME+ PL28s (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-scsi@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

spork wrote...

[ sorry for not responding to your previous message.  I thought Justin
would respond, but evidently he never got around to it. ]

> FWIW, I got the same message today, but the machine didn't lock up.  Any
> ideas?  Anyone?  I'm cc-ing stable this time in hopes of finding someone
> running cam under -stable...
> 
> Here's the messages:
> 
> Dec 12 03:44:18 shell /kernel: (da1:ahc0:0:0:1): tagged openings now 31
> Dec 13 02:01:04 shell /kernel: (da1:ahc0:0:0:1): tagged openings now 2
> Dec 14 02:01:21 shell /kernel: (da0:ahc0:0:0:0): tagged openings now 30
> Dec 14 10:07:42 shell /kernel: (da0:ahc0:0:0:0): SCB 0x1c - timed out
> while idle, LASTPHASE == 0x1, SCSISIGI == 0x0
> Dec 14 10:07:45 shell /kernel: SEQADDR == 0x8
> Dec 14 10:07:45 shell /kernel: SSTAT1 == 0xa
> Dec 14 10:07:45 shell /kernel: (da0:ahc0:0:0:0): Queuing a BDR SCB
> Dec 14 10:07:45 shell /kernel: (da0:ahc0:0:0:0): Bus Device Reset Message
> Sent
> Dec 14 10:07:45 shell /kernel: (da0:ahc0:0:0:0): no longer in timeout,
> status = 34b
> Dec 14 10:07:45 shell /kernel: ahc0: Bus Device Reset Sent. 1 SCBs aborted 

The 'timed out while idle' messages basically mean that a command timed out
while we were waiting for it to complete, and there was nothing else going
on at the time.  I think it generally takes around 10 seconds for that to
happen.  (It may actually be 60 now...)

Generally, it's a sign that the device has gone "out to lunch", and we have
to whap it over the head with a BDR to get it to wake up.

Another example of a device with that sort of problem is the Quantum Atlas
II.  (especially firmware revisions earlier than LYK8)  Earlier firmware
revisions of that drive will go "out to lunch" when there is a lot of bus
traffic, 

You also have another problem, which wasn't evident in your earlier mail.
The tagged openings on one of your RAID partitions have gone down to 2.
This is a prime example of why it's a good idea to print out the number of
tagged openings by default.  (Take note all you whiners out there who've
been complaining about it.)

That indicates that the device keeps sending queue full until we reduce the
number of tagged openings to the lowest possible value (2).  I would
suggest looking in the CMD docs, and try to figure out if they say how many
simultaneous transactions the device can handle.  Take that number, divide
it by 2 (you've got two partitions on the device), and make that the
maximum number of tags in a quirk entry in the transport layer.  Make the
minimum number of tags something slightly less than that.

If they don't say how many tags the thing can handle, a good measurement is
something in the neighborhood of the first 'reduced tags' number you get.
I'm not sure whether the above messages are all of the tagged openings
messages, but if they are, you might assume that the thing can handle around
32 tags total.  Divided by 2, that's 16.  So you could try setting the
maximum for each lun to 16, and the minimum to 10 or so.

Generally, the system will recover all right from the 'timed out while
idle' problem.  After we hit the device with a BDR, all the CCBs that have
already been sent to the device are aborted, and we requeue them all.

> On Fri, 11 Dec 1998, spork wrote:
> 
> > Hi,
> > 
> > I'm about to put two new machines in production, and they're both "core"
> > machines; main dns/auth/mail and a shell machine.  Currently the machines
> > we use in this capacity are 2.1.7.1, and it's been very stable.
> > 
> > Now the new machines share a RAID array hung off of a CMD CRD-5440.  I
> > patched our usual build (980825 -stable) with the July CAM patchkit, as
> > the existing AHC driver couldn't detect any LUNs beyond the first one.
> > 
> > All has been well so far, I've tried to stress the machines as much as
> > possible by running some disk benchmarks over and over, but yesterday one
> > locked up (console frozen) with the following messages being the last
> > thing on the console:
> > 
> > Dec 10 18:13:15 shell /kernel: (da0:ahc0:0:0:0): SCB 0x1e - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0
> > Dec 10 18:13:18 shell /kernel: SEQADDR == 0xa
> > Dec 10 18:13:18 shell /kernel: SSTAT1 == 0xb
> > Dec 10 18:13:18 shell /kernel: (da0:ahc0:0:0:0): Queuing a BDR SCB
> > Dec 10 18:13:18 shell /kernel: (da0:ahc0:0:0:0): Bus Device Reset Message Sent
> > Dec 10 18:13:18 shell /kernel: (da0:ahc0:0:0:0): no longer in timeout, status = 34b
> > Dec 10 18:13:18 shell /kernel: ahc0: Bus Device Reset Sent. 2 SCBs aborted
> > 
> > I had to give it a hard reset at this point.
> > 
> > So my questions are:  Is this a known issue?  Does it point to a possible
> > hardware problem?  Will there be a newer cam patchkit for -stable?
> > 
> > I don't think it's a cabling issue, as this is the first I've seen of any
> > anomolies with the scsi subsystem, and the only cabling in question here
> > is a high quality 2' external UW scsi cable from the back of this machine
> > to the RAID array.  The other machine that uses the other host port on the
> > RAID array remained functional during this glitch...
> > 
> > Any ideas?  I was very comfortable with CAM before, but now I'm a little
> > nervous about moving this into production.  Would it be better to try and
> > back out of the patches and use the ahc driver?  Let me know if there's
> > any other info needed.
> > 
> > Following are the boot messages...
> > 
> > Thanks,
> > 
> > Charles
> > 
> > Dec 10 19:27:32 shell /kernel: Copyright (c) 1992-1998 FreeBSD Inc.
> > Dec 10 19:27:32 shell /kernel: Copyright (c) 1982, 1986, 1989, 1991, 1993
> > Dec 10 19:27:32 shell /kernel: The Regents of the University of California.  All rights reserved.
> > Dec 10 19:27:32 shell /kernel: 
> > Dec 10 19:27:32 shell /kernel: FreeBSD 2.2.7-19980825-SNAP #0: Thu Dec 10 12:02:45 EST 1998
> > Dec 10 19:27:32 shell /kernel: spork@shell.inch.com:/usr/src/sys/compile/SHELL
> > Dec 10 19:27:32 shell /kernel: CPU: Pentium II (quarter-micron) (350.80-MHz 686-class CPU)
> > Dec 10 19:27:32 shell /kernel: Origin = "GenuineIntel"  Id = 0x651  Stepping=1
> > Dec 10 19:27:32 shell /kernel: Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,<b16>,<b17>,MMX,<b24>>
> > Dec 10 19:27:32 shell /kernel: real memory  = 268435456 (262144K bytes)
> > Dec 10 19:27:32 shell /kernel: avail memory = 261144576 (255024K bytes)
> > Dec 10 19:27:32 shell /kernel: Probing for devices on PCI bus 0:
> > Dec 10 19:27:32 shell /kernel: chip0 <generic PCI bridge (vendor=8086 device=7190 subclass=0)> rev 2 on pci0:0:0
> > Dec 10 19:27:32 shell /kernel: chip1 <generic PCI bridge (vendor=8086 device=7191 subclass=4)> rev 2 on pci0:1:0
> > Dec 10 19:27:32 shell /kernel: chip2 <Intel 82371AB PCI-ISA bridge> rev 2 on pci0:4:0
> > Dec 10 19:27:32 shell /kernel: chip3 <Intel 82371AB IDE interface> rev 1 on pci0:4:1
> > Dec 10 19:27:32 shell /kernel: chip4 <Intel 82371AB USB interface> rev 1 int d irq 12 on pci0:4:2
> > Dec 10 19:27:32 shell /kernel: chip5 <Intel 82371AB Power management controller> rev 2 on pci0:4:3
> > Dec 10 19:27:32 shell /kernel: fxp0 <Intel EtherExpress P
> > Dec 10 19:27:32 shell /kernel: ro 10/100B Ethernet> rev 5 int a irq 10 on pci0:7:0
> > Dec 10 19:27:32 shell /kernel: fxp0: Ethernet address 00:e0:18:90:36:4d
> > Dec 10 19:27:32 shell /kernel: ahc0 <Adaptec 2940 Ultra SCSI adapter> rev 1 int a irq 12 on pci0:9:0
> > Dec 10 19:27:32 shell /kernel: ahc0: aic7880 Wide Channel A, SCSI Id=7, 16/255 SCBs
> > Dec 10 19:27:32 shell /kernel: fxp1 <Intel EtherExpress Pro 10/100B Ethernet> rev 5 int a irq 10 on pci0:10:0
> > Dec 10 19:27:32 shell /kernel: fxp1: Ethernet address 00:a0:c9:e7:ac:7d
> > Dec 10 19:27:32 shell /kernel: vga0 <VGA-compatible display device> rev 211 int a irq 11 on pci0:11:0
> > Dec 10 19:27:32 shell /kernel: Probing for devices on PCI bus 1:
> > Dec 10 19:27:32 shell /kernel: Probing for devices on the ISA bus:
> > Dec 10 19:27:32 shell /kernel: sc0 at 0x60-0x6f irq 1 on motherboard
> > Dec 10 19:27:32 shell /kernel: sc0: VGA color <16 virtual consoles, flags=0x0>
> > Dec 10 19:27:32 shell /kernel: sio0 at 0x3f8-0x3ff irq 4 on isa
> > Dec 10 19:27:32 shell /kernel: sio0: type 16550A
> > Dec 10 19:27:32 shell /kernel: sio1 at 0x2f8-0x2ff irq 3 on isa
> > Dec 10 19:27:32 shell /kernel: sio1: type 16550A
> > Dec 10 19:27:32 shell /kernel: lpt0 at 0x378-0x37f irq 7 on isa
> > Dec 10 19:27:32 shell /kernel: lpt0: Interrupt-driven port
> > Dec 10 19:27:32 shell /kernel: lp0: TCP/IP capable interface
> > Dec 10 19:27:32 shell /kernel: fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa
> > Dec 10 19:27:32 shell /kernel: fdc0: FIFO enabled, 8 bytes threshold
> > Dec 10 19:27:32 shell /kernel: fd0: 1.44MB 3.5in
> > Dec 10 19:27:32 shell /kernel: npx0 flags 0x1 on motherboard
> > Dec 10 19:27:32 shell /kernel: npx0: INT 16 interface
> > Dec 10 19:27:32 shell /kernel: IP packet filtering initialized, divert enabled, logging limited to 200 packets/entry
> > Dec 10 19:27:32 shell /kernel: da0 at ahc0 bus 0 target 0 lun 0
> > Dec 10 19:27:32 shell /kernel: da0: <CMD TECH CRD-5440-1 C1-5> Fixed Direct Access SCSI2 device 
> > Dec 10 19:27:32 shell /kernel: da0: 40.0MB/s transfers (20.0MHz, offset 8, 16bit), Tagged Queueing Enabled
> > Dec 10 19:27:32 shell /kernel: da0: 6999MB (14335872 512 byte sectors: 64H 32S/T 6999C)
> > Dec 10 19:27:32 shell /kernel: da1 at ahc0 bus 0 target 0 lun 1
> > Dec 10 19:27:32 shell /kernel: da1: <CMD TECH CRD-5440-1 C1-5> Fixed Direct Access SCSI2 device 
> > Dec 10 19:27:32 shell /kernel: da1: 40.0MB/s transfers (20.0MHz, offset 8, 16bit), Tagged Queueing Enabled
> > Dec 10 19:27:32 shell /kernel: da1: 10431MB (21362688 512 byte sectors: 64H 32S/T 10431C)
> > Dec 10 19:27:32 shell /kernel: WARNING: / was not properly dismounted.
> > Dec 10 19:27:32 shell /kernel: nfs server 10.0.0.1:/var/mail: not responding
> > Dec 10 19:27:32 shell savecore: no core dump
> > 
> > ---
> > Charles Sprickman
> > spork@super-g.com
> > --- 
> >                      "...there's no idea that's so good you can't 
> >                       ruin it with a few well-placed idiots." 
> > 
> > 
> > To Unsubscribe: send mail to majordomo@FreeBSD.org
> > with "unsubscribe freebsd-scsi" in the body of the message
> > 
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-scsi" in the body of the message
> 


Ken
-- 
Kenneth Merry
ken@plutotech.com

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message