From owner-freebsd-bugs Mon Jun 12 16:50:14 2000 Delivered-To: freebsd-bugs@freebsd.org Received: from freefall.freebsd.org (freefall.FreeBSD.ORG [204.216.27.21]) by hub.freebsd.org (Postfix) with ESMTP id 8BD3F37B5AB for ; Mon, 12 Jun 2000 16:50:02 -0700 (PDT) (envelope-from gnats@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.9.3/8.9.2) id QAA35485; Mon, 12 Jun 2000 16:50:03 -0700 (PDT) (envelope-from gnats@FreeBSD.org) Date: Mon, 12 Jun 2000 16:50:03 -0700 (PDT) Message-Id: <200006122350.QAA35485@freefall.freebsd.org> To: freebsd-bugs@FreeBSD.org Cc: From: Geir Inge Jensen Subject: Re: i386/19226: SCSI timeouts during heavy load Reply-To: Geir Inge Jensen Sender: owner-freebsd-bugs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org The following reply was made to PR i386/19226; it has been noted by GNATS. From: Geir Inge Jensen To: "Kenneth D. Merry" Cc: FreeBSD-gnats-submit@FreeBSD.ORG Subject: Re: i386/19226: SCSI timeouts during heavy load Date: Mon, 12 Jun 2000 16:42:57 -0700 "Kenneth D. Merry" wrote: > > [ Please make sure to CC any response to freebsd-gnats-submit@FreeBSD.ORG > so your repsonse makes it into the gnats database. ] > > On Mon, Jun 12, 2000 at 21:37:17 +0000, gij@jk.priv.no wrote: > > > > >Number: 19226 > > >Category: i386 > > >Synopsis: SCSI timeouts during heavy load > > >Confidential: no > > >Severity: serious > > >Priority: high > > >Responsible: freebsd-bugs > > >State: open > > >Quarter: > > >Keywords: > > >Date-Required: > > >Class: sw-bug > > >Submitter-Id: current-users > > >Arrival-Date: Mon Jun 12 14:40:01 PDT 2000 > > >Closed-Date: > > >Last-Modified: > > >Originator: Geir Inge Jensen > > >Release: FreeBSD 4.0-STABLE i386 > > >Organization: > > None, only personal opinions expressed. > > >Environment: > > > > Dell PowerEdge 2450 Dual 600MHz. Dell PowerVault 200S. Two AHA29160 > > SCSI cards, both connected to the PowerVault. > > > > 3 internal IBM DMVS 18GB disks. 8 external disks in the PowerVault > > (same disks). > > > > Relavant dmesg output: > > [ ... ] > > It would have probably been helpful to include the dmesg output from the > disks as well, to get a better idea of the configuration. da0 at ahc2 bus 0 target 0 lun 0 da0: Fixed Direct Access SCSI-3 device da0: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da0: 17366MB (35566500 512 byte sectors: 255H 63S/T 2213C) da2 at ahc0 bus 0 target 0 lun 0 da2: Fixed Direct Access SCSI-3 device da2: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da2: 17366MB (35566500 512 byte sectors: 255H 63S/T 2213C) da6 at ahc1 bus 0 target 8 lun 0 da6: Fixed Direct Access SCSI-3 device da6: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da6: 17366MB (35566500 512 byte sectors: 255H 63S/T 2213C) da10 at ahc3 bus 0 target 2 lun 0 da10: Fixed Direct Access SCSI-3 device da10: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enable d da10: 17366MB (35566500 512 byte sectors: 255H 63S/T 2213C) da1 at ahc2 bus 0 target 1 lun 0 da1: Fixed Direct Access SCSI-3 device da1: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da1: 17366MB (35566500 512 byte sectors: 255H 63S/T 2213C) da5 at ahc0 bus 0 target 3 lun 0 da5: Fixed Direct Access SCSI-3 device da5: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da5: 17366MB (35566500 512 byte sectors: 255H 63S/T 2213C) da9 at ahc1 bus 0 target 11 lun 0 da9: Fixed Direct Access SCSI-3 device da9: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da9: 17366MB (35566500 512 byte sectors: 255H 63S/T 2213C) da4 at ahc0 bus 0 target 2 lun 0 da4: Fixed Direct Access SCSI-3 device da4: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da4: 17366MB (35566500 512 byte sectors: 255H 63S/T 2213C) da8 at ahc1 bus 0 target 10 lun 0 da8: Fixed Direct Access SCSI-3 device da8: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da8: 17366MB (35566500 512 byte sectors: 255H 63S/T 2213C) da3 at ahc0 bus 0 target 1 lun 0 da3: Fixed Direct Access SCSI-3 device da3: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da3: 17366MB (35566500 512 byte sectors: 255H 63S/T 2213C) da7 at ahc1 bus 0 target 9 lun 0 da7: Fixed Direct Access SCSI-3 device da7: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da7: 17366MB (35566500 512 byte sectors: 255H 63S/T 2213C) > > You've got two SCSI busses connected to the *same* array? Is this > controller a CMD OEM controller by any chance? The array will automatically terminate the bus in the middle. So that you get 4 disks on each bus. Thats why I tried using only one bus against it to check for a malfunction in that autosplitter. But we have a lot of these PowerVaults running fine with Dell PowerEdge 4350 and FreeBSD 3.3. Some of the components in the PowerVault is made by Eurologic. > > acd0: CDROM at ata0-master using PIO4 > > pass2 at ahc2 bus 0 target 6 lun 0 > > pass2: Fixed Processor SCSI-2 device > > pass2: 3.300MB/s transfers > > pass7 at ahc0 bus 0 target 15 lun 0 > > pass7: Removable Processor SCSI-3 device > > pass7: 3.300MB/s transfers > > pass12 at ahc1 bus 0 target 15 lun 0 > > pass12: Removable Processor SCSI-3 device > > pass12: 3.300MB/s transfers > > pass14 at ahc3 bus 0 target 6 lun 0 > > pass14: Fixed Processor SCSI-2 device > > pass14: 3.300MB/s transfers > > > > >Description: > > > > After a while, during heavy disk I/O, the following appears: > > > > (da2:ahc0:0:0:0): SCB 0x33 - timed out while idle, SEQADDR == 0x157 > > (da2:ahc0:0:0:0): Queuing a BDR SCB > > (da6:ahc1:0:8:0): SCB 0x7c - timed out while idle, SEQADDR == 0x157 > > (da6:ahc1:0:8:0): Queuing a BDR SCB > > (da2:ahc0:0:0:0): SCB 0x33 - timed out while idle, SEQADDR == 0x157 > > (da2:ahc0:0:0:0): no longer in timeout, status = 34b > > ahc0: Issued Channel A Bus Reset. 7 SCBs aborted > > (da6:ahc1:0:8:0): SCB 0x7c - timed out while idle, SEQADDR == 0x157 > > (da6:ahc1:0:8:0): no longer in timeout, status = 34b > > ahc1: Issued Channel A Bus Reset. 7 SCBs aborted > > > > And so on. At this time, you don't have any contact with the PowerVault. > > Of course, the ccd freaks out with this: > > > > ccd0: error 5 on component 0 block 80 (ccd block 64) > > That (the timeout messages) indicates that from the system's perspective, > the array hasn't returned a read or write request in 60 seconds. So we > reset it in an attempt to wake it up. Yes, I have tried issuing a camcontrol reset/rescan without luck. We have to reboot the machine to get contact with the disks. Most of the time we have contact with the system after the error has occured. But once in a while the system completely locks up (probably a deadlock or something). I briefly browsed through some patches for Linux (up until the point where it works on these systems). There is a lot of changes in the AIC7xxx driver. The sequencer code has many changes, and they now issue a dummy read to flush write requests. They apparently had a problem with scanning the same PCI bus twice (both as a peer and as a child), so they have a fix for that too. But I am not knowledgeable enough to really tell whats going on. > > Notice that the error occurs on both buses at the same time! It can > > take several hours before this happens. But we can reproduce it with > > some patience and heavy load. The SCB's differ slightly from occasion > > to occasion. > > > This is what we have tried to pinpoint the cause: > > > > - Replace all scsi cables. > > - Terminate the bus'es in the bios. > > - Replace the AHA29160's with other AHA29160's. > > - Replace the AHA29160's with AHA2940U2W's. > > - Replace the internal PCI bus the cards plugs into (PCI tray). > > - Replace the ES Expander Modules in the PowerVault. > > - Replace the PowerVault. > > - Replace the PowerVault with a known good (and older revision) PowerVault > > (we have several of these running on Dell PowerEdge 4350 with > > 3.3-STABLE on them). These older systems run fine. > > - Test with 4.0-STABLE UP kernel. > > - Test with 5.0-CURRENT UP kernel. > > - Keep both external SCSI cards, but use only one of them. > > - Remove one of the external SCSI cards, and use the internal 7899, > > channel B, as well against the PowerVault (ie. two buses against it). > > - Running RedHat 6.2 with 2.2.14-5 kernel on the same system. > > > > None of the above actions cured it. After some hours, it fails. Note that > > the old PowerVault we tested from earlier systems contained other disks > > (Seagate and Quantum), which works fine under 3.3-STABLE. > > That's quite a lot of diagnosis. Much better than most people who just say > "it's broken". :) > > > >From this testing, we have these conclusions: > > > > - There is nothing wrong with the PowerVault and the diskdrives. > > - There is nothing wrong with the SCSI cards. > > > > We also have some success stories: > > > > - Run the PowerVault from a single PCI card (ie. remove the other). > > - Run the PowerVault only from the internal 7899, channel B. > > In this configuration, did you have any other SCSI bus connected to the > PowerVault? No, only one bus. Ie. the autosplitter is not in action. However, due to the fact that this works fine under 3.3 on another system, I don't think it could be the PowerVaults fault. That splitter works as it's supposed to (also under Linux). We also get the error if we put in an extra PCI scsi card in the above setup (two cards, with only one connected to the PowerVault). Since that also fails, it can't be a defect in the PowerVault. The only difference between success and failure is that single idle PCI scsi card! (which suggests a PCI or interrupt problem). > > - linux-2.2.14-6.1.1 kernel (provided by Dell) with original HW setup. > > - linux-2.2.15 kernel with original HW setup. > > > > To me, it sounds like a PCI problem (or maybe in the RCC LE chip). It > > could also be a problem in the AIC7xxx driver, but it even failed with > > the AHA2940U2W cards (which works fine in our 3.3 systems). But I am > > only guessing here. However, Linux has obviously found a fix. > > I kinda wonder if this RAID array may be a CMD OEM or something. There is no RAID controller in it. It has components from Eurologic, but I don't know if the whole thing is made by them. Have a look at http://www.dell.com/us/en/biz/products/spec_scsis_200_storage.htm for further information. > CMD controllers have trouble when you have multiple luns on the same > controller in use. The symptoms are very similar to what you're > describing. It's not a CMD. And they don't share the bus. It's being split in two parts (as you can see in the added dmesg output). > The two 'solutions' for a CMD controller are: > - only use one LUN > - disable tagged queueing for both luns (you can do this either from CMD's > setup utility or from FreeBSD with camcontrol, or by putting a quirk > entry in the transport layer.) > > > >How-To-Repeat: > > > > Access every disk in the system, and produce a lot of I/O. I open all > > disk devices in raw mode and do a lot of random seeks and reads. > > However, we have experienced this error on mostly idle machines also. > > Except for the idle part, this sounds kinda like the CMD problem. > > One thing to try is disabling tagged queueing on both ports of the array. > For example, to disable tagged queueing for the disk da20: > > camcontrol negotiate da20 -v -T disable -a > > Then try running your tests again, and see if the problem happens again. > If so, it may be that the array has problems with tagged queueing on > multiple luns, like the CMD array controllers. I can try this, but I doubt it will help. Or has something changed from 3.3 to 4.0 that requires this? Keep in mind that we use exactly the same PowerVault with this setup on a lot of 4350's running FreeBSD 3.3. - Geir Inge. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message