Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 29 Mar 1997 12:27:28 -0500
From:      Rohit Dube <rohit@cs.umd.edu>
To:        scsi@freebsd.org
Cc:        rohit@cs.umd.edu
Subject:   Re: AHA2940 bug(s) still exist in 2.2.1
Message-ID:  <199703291727.MAA07478@seine.cs.umd.edu>

next in thread | raw e-mail | index | archive | help
Hi,

I had posted the following to hardware earlier. Am reposting this
to scsi with some minor edits in the hope that it may help give the 
developers some additional clues. 

-->

I am seeing some weird problems with a couple of machines
running 2.2-970225-GAMMA

Everynight when we run amanda's 'amdump', these machines
crash. The crash can also be triggered by a 'dump'
to /dev/null or a 'dd'. (Not entirely deterministic but all 3
crash the machines most of the time). We tried 
2.1.5, 2.1.7, 2.2-961006-SNAP, 2.2.1 which exhibit the same behaviour.

We have the following hardware on the machines which are crashing -
(curtailed dmesg output showing only the PCI devices)

Probing for devices on PCI bus 0:
chip0 <Intel 82439> rev 3 on pci0:0
chip1 <Intel 82371SB PCI-ISA bridge> rev 1 on pci0:7:0
chip2 <Intel 82371SB IDE interface> rev 0 on pci0:7:1
vga0 <VGA-compatible display device> rev 0 int a irq 12 on pci0:9
de0 <Digital 21140A Fast Ethernet> rev 32 int a irq 10 on pci0:11
de0: SMC 9332BDT 21140A [10-100Mb/s] pass 2.0
de0: address 00:00:c0:03:6b:f9
ahc0 <Adaptec 2940 Ultra SCSI host adapter> rev 0 int a irq 11 on pci0:12
ahc0: aic7880 Single Channel, SCSI Id=7, 16 SCBs
ahc0 waiting for scsi devices to settle
(ahc0:0:0): "MICROP 4421-07   0329SJ 0329" type 0 fixed SCSI 2
sd0(ahc0:0:0): Direct-Access 2047MB (4193360 512 byte sectors)
(ahc0:6:0): "SONY CD-ROM CDU-76S 1.2d" type 5 removable SCSI 2
cd0(ahc0:6:0): CD-ROM cd present [400000 x 2048 byte records]


The console shows the following error messages (which are not logged as
the disk is inacessible):

sd0(ahc0:0:0): no longer in timeout
ahc0: Issued Channel A Bus Reset: 2SCBs aborted
Clearing bus reset
Clearing 'in-reset' flag
Sd0(ahc0:0:0): SCB 0x1 - timed out while idle
               LASTPHASE == 0x1, SCSIISGI = 0x0
	       SEQADDR == 0x12

The above message repeats with different values for SEQADDR.

The first message which gets printed out says something like 
'timed out in command phase'. I can't paraphrase it here as it happened 
in the middle of the night and scrolled off.

After resetting following this occurance, the disk is not visible even to 
the Adaptec probe on boot-up. We must power-cycle.
The block position where the error is triggered varies, by the way.

Has somebody else seen a problem like this before? Or would otherwise know
what is going on here?

Any help greatly appreciated! Just can't afford to have these machines
go down every night while doing a backup!!

Thanks.

--rohit.

PS: I am attaching the output of 'scsi -f /dev/rsd0 -m1' and 'df' here,
    if that is of any use in tracking this problem.

#scsi -f /dev/rsd0 -m1
AWRE (Auto Write Reallocation Enbld):  1 
ARRE (Auto Read Reallocation Enbld):  1 
TB (Transfer Block):  0 
RC (Read Continuous):  0 
EER (Enable Early Recovery):  0 
PER (Post Error):  0 
DTE (Disable Transfer on Error):  0 
DCR (Disable Correction):  0 
Read Retry Count:  14 
Correction Span:  28 
Head Offset Count:  0 
Data Strobe Offset Count:  0 
Write Retry Count:  15 
Recovery Time Limit:  0 

# df
Filesystem  1K-blocks     Used    Avail Capacity  Mounted on
/dev/sd0a       47183    13098    30311    30%    /
/dev/sd0s1f   1822738   504147  1172772    30%    /usr
/dev/sd0s1e     98479     1372    89229     2%    /var
procfs              4        4        0   100%    /proc
amd:96              0        0        0   100%    /fs

<--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703291727.MAA07478>