From owner-freebsd-hackers  Fri Mar  5 12:15:14 1999
Delivered-To: freebsd-hackers@freebsd.org
Received: from midten.fast.no (midten.fast.no [195.139.251.11])
	by hub.freebsd.org (Postfix) with ESMTP id 165E415151
	for <freebsd-hackers@FreeBSD.ORG>; Fri,  5 Mar 1999 12:14:57 -0800 (PST)
	(envelope-from tegge@fast.no)
Received: from fast.no (IDENT:tegge@midten.fast.no [195.139.251.11])
	by midten.fast.no (8.9.1/8.9.1) with ESMTP id VAA18648;
	Fri, 5 Mar 1999 21:14:35 +0100 (CET)
Message-Id: <199903052014.VAA18648@midten.fast.no>
To: sthaug@nethelp.no
Cc: stephw@xs4all.nl, freebsd-hackers@FreeBSD.ORG
Subject: Re: adaptec 2940u2w hangs on external disks
From: Tor.Egge@fast.no
In-Reply-To: Your message of "Fri, 05 Mar 1999 17:02:35 +0100"
References: <22812.920649755@verdi.nethelp.no>
X-Mailer: Mew version 1.70 on Emacs 19.34.1
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Date: Fri, 05 Mar 1999 21:14:35 +0100
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> A possibly related problem we've seen here: FreeBSD sometimes needs a
> hard reset (hit reset button) to reboot, while a software reboot will
> hang during bootup.
> 
> This happens on FreeBSD boxes with 3.1R or 3.1-STABLE, onboard Adaptec
> 7890 U2W controller. Various (Seagate, IBM) LVD disks on LVD chain,
> *and* DAT on single-ended chain. Using verbose boot, we see that the
> hang occurs while probing the DAT and/or the CDROM player on the SE
> chain.

I've also noticed this problem.  Having more disks on a machines gives
a higher probability of a hang.

I'm using a serial console and

	options         BREAK_TO_DEBUGGER

in the kernel config file.

Sending a break to enter the kernel debugger does not work when the
hang has occured.

The virtual NMI pushbutton does not work when the hang has occured
(SMP kernel, IOAPIC reprogrammed to treat irq 3 as NMI to be delivered
to CPU#0).

Programming CPU#1 to run with interrupts disabled (and lapic.tpr set to
255) sending about 100K IPIs/second to CPU#0 for sampling the program
counter at CPU#0 does not help.  CPU#1 stops running when the hang occurs:


	e0122535 -> scsi_interpret_sense+0x1
	e011e59c -> xpt_release_devq+0x4
	e012ff21 -> ahc_action+0x1
	e021a002 -> splcam+0x46
	e01306a1 -> ahc_action+0x781
	e011f983 -> xpt_set_transfer_settings+0x7b
	e01302fb -> ahc_action+0x3db
	e012b501 -> ahc_find_syncrate+0x1
	e012b6ea -> ahc_update_target_msg_request+0xce
	HANG


	e0122586 -> scsi_interpret_sense+0x52
	e014725e -> free+0x3a
	e0130462 -> ahc_action+0x542
	e011f923 -> xpt_set_transfer_settings+0x1b
	e022aefa -> strncpy+0x16
	e011f9e3 -> xpt_set_transfer_settings+0xdb
	e0219fbc -> splcam+0x0
	e012b501 -> ahc_find_syncrate+0x1
	e012b6e7 -> ahc_update_target_msg_request+0xcb
	HANG

Going back to an UP kernel, adding limited debug output has resulted
in the following reconstructed call stacks when the hang occurs:

	camisr
	probedone
	xpt_action
	xpt_set_transfer_settings
	ahc_action
	ahc_set_width
	ahc_update_target_msg_request
	unpause_sequencer

and

	camisr
	probedone
	xpt_action
	xpt_set_transfer_settings
	ahc_action
	ahc_set_syncrate
	ahc_update_target_msg_request
	unpause_sequencer


where ahc_inb(ahc, INTSTAT) in unpause_sequencer seems to hang.

removing AHC_ALLOW_MEMIO from the kernel configuration file caused 
the hang to occur a few lines earlier (while setting new value for
TARGET_MSG_REQUEST or TARGET_MSG_REQUEST + 1).

I'm using a modified splvm (which blocks cam interrupts) and a modified
splsoftcam (which blocks cam interrupts during device probing), but this
does not prevent the hangs.

- Tor Egge


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message