From owner-freebsd-scsi Sat Mar 29 09:27:33 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.5/8.8.5) id JAA03603 for freebsd-scsi-outgoing; Sat, 29 Mar 1997 09:27:33 -0800 (PST) Received: from seine.cs.umd.edu (seine.cs.umd.edu [128.8.128.59]) by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id JAA03596 for ; Sat, 29 Mar 1997 09:27:31 -0800 (PST) Received: by seine.cs.umd.edu (8.8.5/UMIACS-0.9/04-05-88) id MAA07478; Sat, 29 Mar 1997 12:27:29 -0500 (EST) Message-Id: <199703291727.MAA07478@seine.cs.umd.edu> To: scsi@freebsd.org Cc: rohit@cs.umd.edu Subject: Re: AHA2940 bug(s) still exist in 2.2.1 Date: Sat, 29 Mar 1997 12:27:28 -0500 From: Rohit Dube Sender: owner-freebsd-scsi@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Hi, I had posted the following to hardware earlier. Am reposting this to scsi with some minor edits in the hope that it may help give the developers some additional clues. --> I am seeing some weird problems with a couple of machines running 2.2-970225-GAMMA Everynight when we run amanda's 'amdump', these machines crash. The crash can also be triggered by a 'dump' to /dev/null or a 'dd'. (Not entirely deterministic but all 3 crash the machines most of the time). We tried 2.1.5, 2.1.7, 2.2-961006-SNAP, 2.2.1 which exhibit the same behaviour. We have the following hardware on the machines which are crashing - (curtailed dmesg output showing only the PCI devices) Probing for devices on PCI bus 0: chip0 rev 3 on pci0:0 chip1 rev 1 on pci0:7:0 chip2 rev 0 on pci0:7:1 vga0 rev 0 int a irq 12 on pci0:9 de0 rev 32 int a irq 10 on pci0:11 de0: SMC 9332BDT 21140A [10-100Mb/s] pass 2.0 de0: address 00:00:c0:03:6b:f9 ahc0 rev 0 int a irq 11 on pci0:12 ahc0: aic7880 Single Channel, SCSI Id=7, 16 SCBs ahc0 waiting for scsi devices to settle (ahc0:0:0): "MICROP 4421-07 0329SJ 0329" type 0 fixed SCSI 2 sd0(ahc0:0:0): Direct-Access 2047MB (4193360 512 byte sectors) (ahc0:6:0): "SONY CD-ROM CDU-76S 1.2d" type 5 removable SCSI 2 cd0(ahc0:6:0): CD-ROM cd present [400000 x 2048 byte records] The console shows the following error messages (which are not logged as the disk is inacessible): sd0(ahc0:0:0): no longer in timeout ahc0: Issued Channel A Bus Reset: 2SCBs aborted Clearing bus reset Clearing 'in-reset' flag Sd0(ahc0:0:0): SCB 0x1 - timed out while idle LASTPHASE == 0x1, SCSIISGI = 0x0 SEQADDR == 0x12 The above message repeats with different values for SEQADDR. The first message which gets printed out says something like 'timed out in command phase'. I can't paraphrase it here as it happened in the middle of the night and scrolled off. After resetting following this occurance, the disk is not visible even to the Adaptec probe on boot-up. We must power-cycle. The block position where the error is triggered varies, by the way. Has somebody else seen a problem like this before? Or would otherwise know what is going on here? Any help greatly appreciated! Just can't afford to have these machines go down every night while doing a backup!! Thanks. --rohit. PS: I am attaching the output of 'scsi -f /dev/rsd0 -m1' and 'df' here, if that is of any use in tracking this problem. #scsi -f /dev/rsd0 -m1 AWRE (Auto Write Reallocation Enbld): 1 ARRE (Auto Read Reallocation Enbld): 1 TB (Transfer Block): 0 RC (Read Continuous): 0 EER (Enable Early Recovery): 0 PER (Post Error): 0 DTE (Disable Transfer on Error): 0 DCR (Disable Correction): 0 Read Retry Count: 14 Correction Span: 28 Head Offset Count: 0 Data Strobe Offset Count: 0 Write Retry Count: 15 Recovery Time Limit: 0 # df Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/sd0a 47183 13098 30311 30% / /dev/sd0s1f 1822738 504147 1172772 30% /usr /dev/sd0s1e 98479 1372 89229 2% /var procfs 4 4 0 100% /proc amd:96 0 0 0 100% /fs <--