Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 3 Jun 1996 13:52:44 +1000
From:      Bruce Evans <bde@zeta.org.au>
To:        deborah@microunity.com, gusw@zedat.fu-berlin.de
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: Adaptec 2940 U makes fatal bus resets!
Message-ID:  <199606030352.NAA22079@godzilla.zeta.org.au>

next in thread | raw e-mail | index | archive | help
>I thus made a 2.2-960501-SNAP kernel and tried it out -- nothing went
>better with the new kernel. So, I had to help myself: The problem, and
>the fixes for it -- yes, it seems like I fixed the problem -- point
>into a major problem that FreeBSD might have particularly on fast
>machines: timeout timers or counters seem to be initialized too small,

Maybe.

>and thus, timeout states occur prematurely. Two evidences from
>different parts of the kernel: (1) the fdc driver and (2) the aic7xxx
>driver.

>(1) FDC driver

>Please look at this (i386/isa/fdc.c):

>int
>in_fdc(fdcu_t fdcu)
>{
>	int baseport = fdc_data[fdcu].baseport;
>	int i, j = 100000;
>	while ((i = inb(baseport+FDSTS) & (NE7_DIO|NE7_RQM))
>		!= (NE7_DIO|NE7_RQM) && j-- > 0)
>		if (i == NE7_RQM)
>			return fdc_err(fdcu, "ready for output in input\n");
>	if (j <= 0)
>		return fdc_err(fdcu, "input ready timeout\n");
>...

>This is obviously a counter, not a timer. My machine is fast, it
>counts considerably more in the same amount of time, and thus results
>in nasty timeouts (that even lock the machine sometimes)

It actually acts as a timer.  inb() is very slow on all machines.  On
all ISA machines, inb() takes about 1-1.25 usec.  On PCI machines, it
may be faster, but it probably won't be more than a few times faster,
and certainly can't be more than 100 times faster.  The initial count
is large enough to allow for a speedup of a few thousand.

On my ASUS P55TP4XE (rev.2.4), inb(0x1f0+FDSTS) actually takes 1180 ns,
so the loop goes only about 10/9 or 11/9 times as fast as on my slow ISA
systems, and the loop times out after about 118 ms.  Timeouts occured
because of a bug elsewhere in the driver and unusual behaviour of the
UMC i/o chip.  The chip sometimes interrupted early in response to i/o
commands.  This causeed the driver to enter the spinloop too early and
busy-wait until i/o completion.  118 ms is long enough for i/o to
complete in most cases except after a seek, when it usually takes
slightly less than one disk revolution (200 ms or 167 ms) for i/o to
complete.  Increasing the timeout masked the problem.  The fix was to
clear all the interrupts generated by reset instead of just one.  This
has been fixed in -current and -stable for a couple of months.

>  We need to depend the init value of j on the speed of the
>machine. And, after all, we shouldn't just count and block the whole
>machine from doing better things. Insert a tsleep()!

Interrupt handlers can't call tsleep().  In this case, there is nothing
better to do than to busy-wait, since setting up a timeout would take
much longer than the expected wait time.

>First I define a constant with the counter value times 10, for a basic
>safety, such that it can be predefined as an option in the config
>file. I use the old value 100000 for my i486/33 ISA machine, and the
>times 10 value for the i586/133 PCI -- the timeouts didn't occur since
>I did this! But one can clearly watch the machine hang for a few
>milliseconds, when e.g. fdformat(8) is running (see how the regular
>blinking of the cursor stucks) -- I bet that a tsleep() instead of the
>counter would fix this for ever.

I saw i by watching systat.  An i586/133 PCI shouldn't have a 10% overhead
for floppy interrupts!  This also showed that increasing the timeout was
the wrong fix.

>O.K. that's for the FD controller driver, but the real nasty thing
>will be fixed now!

>(2) the PCI ahc driver (i386/scsi/aic7xxx.c)

I don't know much about this.

>O.K. since I no longer trust the time/tick/hz management and proper
>adjustment of my kernel to high CPU speeds, I decided to just increase
>the timeout values by the same factor of 10.

The timeout() and clock interrupt and higher level parts (including
everything to do with hz) can be trusted.

>void
>ahc_scb_timeout(unit, ahc, scb)
>...

>			timeout(ahc_timeout, (caddr_t)scb, ( TOFACT * 2 * hz));

2 seconds was already a lot.  The fatal problem is probably in poor handling
of SCSI errors.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199606030352.NAA22079>