From owner-freebsd-scsi Sat Aug 16 18:30:25 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.5/8.8.5) id SAA22952 for freebsd-scsi-outgoing; Sat, 16 Aug 1997 18:30:25 -0700 (PDT) Received: from nico.telstra.net (nico.telstra.net [139.130.204.16]) by hub.freebsd.org (8.8.5/8.8.5) with SMTP id SAA22945 for ; Sat, 16 Aug 1997 18:30:18 -0700 (PDT) Received: from freebie.lemis.com (gregl1.lnk.telstra.net [139.130.136.133]) by nico.telstra.net (8.6.10/8.6.10) with ESMTP id LAA14147 for ; Sun, 17 Aug 1997 11:29:45 +1000 From: Greg Lehey Received: (grog@localhost) by freebie.lemis.com (8.8.7/8.6.12) id KAA03776 for freebsd-scsi@freebsd.org; Sun, 17 Aug 1997 10:59:44 +0930 (CST) Message-Id: <199708170129.KAA03776@freebie.lemis.com> Subject: Bus resets. Grrrr. To: freebsd-scsi@freebsd.org (FreeBSD SCSI Mailing List) Date: Sun, 17 Aug 1997 10:59:43 +0930 (CST) Organisation: LEMIS, PO Box 460, Echunga SA 5153, Australia Phone: +61-8-8388-8250 Fax: +61-8-8388-8250 Mobile: +61-41-739-7062 WWW-Home-Page: http://www.lemis.com/~grog X-Mailer: ELM [version 2.4ME+ PL32 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-scsi@freebsd.org X-Loop: FreeBSD.org Precedence: bulk This is the third time in a row that I haven't been able to complete a backup because of "recoverable" SCSI errors. Here's a pretty typical scenario: Aug 17 10:27:19 freebie /kernel: sd0: SCB 0x4 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0 What does this mean? What can time out when nothing's happening? Or is this a timeout accepting a new command when it shouldn't have to? Is this a device or a driver logic error? Aug 17 10:27:31 freebie /kernel: SEQADDR = 0x9 SCSISEQ = 0x12 SSTAT0 = 0x5 SSTAT1 = 0xa Aug 17 10:27:31 freebie /kernel: sd0: Queueing an Abort SCB Aug 17 10:27:31 freebie /kernel: sd0: Abort Message Sent Aug 17 10:27:31 freebie /kernel: sd0: SCB 0x4 - timed out in message out phase, SCSISIGI == 0xa4 Aug 17 10:27:31 freebie /kernel: SEQADDR = 0x9a SCSISEQ = 0x12 SSTAT0 = 0x5 SSTAT1 = 0x2 If I understand this correctly, this means that the abort SCB wasn't received either, so the driver does a bus reset: Aug 17 10:27:31 freebie /kernel: ahc0: Issued Channel A Bus Reset. 3 SCBs aborted Aug 17 10:27:32 freebie /kernel: Clearing bus reset Aug 17 10:27:32 freebie /kernel: Clearing 'in-reset' flag Aug 17 10:27:32 freebie /kernel: sd0: no longer in timeout ... which works. Aug 17 10:27:32 freebie /kernel: sd0: SCB 0x4 - timed out in command phase, SCSISIGI == 0x84 So why do we get another timeout? Or is this overlapping? Aug 17 10:27:32 freebie /kernel: SEQADDR = 0x42 SCSISEQ = 0x12 SSTAT0 = 0x7 SSTAT1 = 0x2 Aug 17 10:27:32 freebie /kernel: sd0: abort message in message buffer Aug 17 10:27:32 freebie /kernel: sd1: SCB 0x3 timedout while recovery in progress Aug 17 10:27:32 freebie /kernel: sd0: SCB 1 - Abort Completed. Aug 17 10:27:32 freebie /kernel: sd0: no longer in timeout Aug 17 10:27:32 freebie /kernel: sd1: UNIT ATTENTION asc:29,0 Aug 17 10:27:32 freebie /kernel: sd1: Power on, reset, or bus device reset occurred Aug 17 10:27:32 freebie /kernel: , retries:3 So sd3 complains, but carries on with no harm done, Aug 17 10:27:32 freebie /kernel: st0: UNIT ATTENTION asc:29,0 Aug 17 10:27:32 freebie /kernel: st0: Power on, reset, or bus device reset occurred Aug 17 10:27:32 freebie /kernel: st0: Target Busy but the tape dies. Is there a good reason for this? I would have thought that it would make sense for a power on or reset, but not for a bus reset. Does a tape unit lose its position or data when it receives a bus reset? Is anybody doing anything about this? Greg