Date: Mon, 18 Aug 1997 21:59:02 -0600 From: "Justin T. Gibbs" <gibbs@plutotech.com> To: Greg Lehey <grog@lemis.com> Cc: freebsd-scsi@FreeBSD.ORG (FreeBSD SCSI Mailing List) Subject: Re: Bus resets. Grrrr. Message-ID: <199708190359.VAA10011@pluto.plutotech.com> In-Reply-To: Your message of "Sun, 17 Aug 1997 10:59:43 %2B0930." <199708170129.KAA03776@freebie.lemis.com>
next in thread | previous in thread | raw e-mail | index | archive | help
>This is the third time in a row that I haven't been able to complete a backup >because of "recoverable" SCSI errors. Here's a pretty typical scenario: What version of the kernel are you using with what AHC options enabled in your kernel config file? >Aug 17 10:27:19 freebie /kernel: sd0: SCB 0x4 - timed out while idle, LASTPHAS >E == 0x1, SCSISIGI == 0x0 > >What does this mean? What can time out when nothing's happening? Or is >this a timeout accepting a new command when it shouldn't have to? Is this >a device or a driver logic error? The message simply indicates the state of the SCSI bus at the time the timeout occured. In this case, the sequencer was sitting in it's idle loop waiting for new work to be queued by the kernel driver or for a device to reconnect and continue a previously established connection. To understand why a timeout can occur in this state or any other, it's probably best to explain the timeout model that is used in FreeBSD. The "type" drivers (e.g. sd, cd, st, worm, etc) specifies a timeout value for each command it issues. When the adaptec driver queues a command to the controller, it sets up a "callout" that will occur timeout ms in the future for the just queued command. In this particular case, that "callout" fired while the sequencer wasn't working on any commands. So, what does a "timeout while idle" tell us? Well, it means that either the timeout that the type driver (in this case the "st" driver) specified was too short, or the aic7xxx driver lost the command somewhere either in route to or from the device. The latter problem did occur under heavy load prior to my latest "spin lock" change to the driver. The first problem seems really common in the st driver especially when older media or a rewind operation is involved. You can try bumping up the timeouts in sys/scsi/st.c to see if this solves your problem. > >Aug 17 10:27:31 freebie /kernel: SEQADDR = 0x9 SCSISEQ = 0x12 SSTAT0 = 0x5 SST >AT1 = 0xa >Aug 17 10:27:31 freebie /kernel: sd0: Queueing an Abort SCB >Aug 17 10:27:31 freebie /kernel: sd0: Abort Message Sent >Aug 17 10:27:31 freebie /kernel: sd0: SCB 0x4 - timed out in message out phase >, SCSISIGI == 0xa4 >Aug 17 10:27:31 freebie /kernel: SEQADDR = 0x9a SCSISEQ = 0x12 SSTAT0 = 0x5 SS >TAT1 = 0x2 > >If I understand this correctly, this means that the abort SCB wasn't received >either, so the driver does a bus reset: What it means is that the tape drive accepted the connection from the controller, most likely accepted the ABORT message, but took longer than the driver allowed for it to process the abort request, free the bus, and thus signal that the abort was successful. So, we take out the hammer and reset the bus. The timeout in the aic7xxx driver for abort requests may be too short. > >Aug 17 10:27:31 freebie /kernel: ahc0: Issued Channel A Bus Reset. 3 SCBs abor >ted >Aug 17 10:27:32 freebie /kernel: Clearing bus reset >Aug 17 10:27:32 freebie /kernel: Clearing 'in-reset' flag >Aug 17 10:27:32 freebie /kernel: sd0: no longer in timeout > >... which works. > >Aug 17 10:27:32 freebie /kernel: sd0: SCB 0x4 - timed out in command phase, SC >SISIGI == 0x84 > >So why do we get another timeout? Or is this overlapping? This is due to the way that timeout process occurs in the current driver. If I have 20 operations outstanding, each with a timeout queued, those timeouts are still in effect while I am doing recovery processing. So, even if I just unwedged the bus, a timeout for a pending command can fire "logically too soon". >Aug 17 10:27:32 freebie /kernel: SEQADDR = 0x42 SCSISEQ = 0x12 SSTAT0 = 0x7 SS >TAT1 = 0x2 >Aug 17 10:27:32 freebie /kernel: sd0: abort message in message buffer >Aug 17 10:27:32 freebie /kernel: sd1: SCB 0x3 timedout while recovery in progr >ess >Aug 17 10:27:32 freebie /kernel: sd0: SCB 1 - Abort Completed. >Aug 17 10:27:32 freebie /kernel: sd0: no longer in timeout >Aug 17 10:27:32 freebie /kernel: sd1: UNIT ATTENTION asc:29,0 >Aug 17 10:27:32 freebie /kernel: sd1: Power on, reset, or bus device reset oc >curred >Aug 17 10:27:32 freebie /kernel: , retries:3 > >So sd3 complains, but carries on with no harm done, Yup. The "sd" type driver seems to handle a bus reset gracefully. >Aug 17 10:27:32 freebie /kernel: st0: UNIT ATTENTION asc:29,0 >Aug 17 10:27:32 freebie /kernel: st0: Power on, reset, or bus > device reset occurred >Aug 17 10:27:32 freebie /kernel: st0: Target Busy > >but the tape dies. Is there a good reason for this? I would have >thought that it would make sense for a power on or reset, but not >for a bus reset. Does a tape unit lose its position or data when >it receives a bus reset? As you can tell by the sense code that is returned, your tape drive draws no distinction between "Power on", "reset" (i.e. bus reset), or "bus device reset" and is probably returning "Target Busy" because it is going through self test. Any information regarding tape position is almost certainly lost as is probably the case for the compression/density settings. The "st" driver should be able to restore the drive to the previous condition though since it knows all of the information to do so. This is a bug. >Is anybody doing anything about this? In a way I am, and in a way I'm not. All of these problems are well known to me and are being addressed in a complete rewrite of the FreeBSD SCSI layer. Ken Merry and myself are working diligently on this task and I have chosen to put 99% of my effort into completing this work instead of attempting to patch around the outstanding problems in the current implementation. The current status of the rewrite includes support for the Adaptec aic7xxx cards, disks, cdroms, application pass through (what scsi(8) would use to access devices), and fledgling target mode support. Once we have a tape driver, target mode, and some remaining error recovery issues dealt with, the code will be available for beta test while all remaining controller drivers are ported to the new system and a few remaining features are added. If you are willing to do without some features and device support and are interrested in giving early feedback on the work, contact me in private mail. > >Greg > -- Justin T. Gibbs =========================================== FreeBSD: Turning PCs into workstations ===========================================
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199708190359.VAA10011>