Skip site navigation (1)Skip section navigation (2)
Date:      31 Jan 2001 19:21:07 -0500
From:      Andrew Heybey <ath@niksun.com>
To:        freebsd-scsi@freebsd.org
Cc:        Ian Dowse <iedowse@maths.tcd.ie>
Subject:   Re: Corruption on ahc reads - seems PCI latency related
Message-ID:  <85r91jqmmj.fsf@stiegl.niksun.com>
In-Reply-To: Ian Dowse's message of "Wed, 31 Jan 2001 22:53:10 %2B0000"
References:  <200101312253.aa86550@salmon.maths.tcd.ie>

next in thread | previous in thread | raw e-mail | index | archive | help
Ian Dowse <iedowse@maths.tcd.ie> writes:

> We have a heavily loaded 4.2-STABLE NFS fileserver machine that
> has recently delevoped a file corruption problem. The corruption
> seems to be occurring during reads from one SCSI disk (da0). It
> appears that small regions (usually 18 bytes) of a read are 'missed',
> so the buffer cache ends up with mostly the new data, but some
> bytes are from whatever happened to be in the buffer cache before
> the read.
> 

  [...]

> The odd thing is that we can only reproduce the corruption when
> reading from da0 (Quantum 9Gb), while writing over NFS to another
> disk (I have only tried da2). Swapping out da0 with another similar
> disk did not help.
> 
> Anyway, today I tried fiddling with the PCI latency timer settings,
> and it seems that reducing the value of the ahc PCI latency timer
> makes the corruption go away. On this motherboard (Supermicro with
> onboard SCSI) the default PCI latency timer value on all devices
> is 0x40.  If I reduce this to 0x20 on ahc0,ahc1,fxp0,fxp1,pcib1,
> then I can't repeat the corruption. When I put it back to 0x40 on
> ahc0 and ahc1 the corruption returns.
> 
> Has anyone any ideas on what this might mean? If a FIFO somewhere
> is filling or a DMA is failing, shouldn't an error get back to the
> driver or OS somehow? Or is this just a sign of dying hardware?

This sounds almost exactly like a problem I had with 3.1 in 1999.
Under heavy disk and network load I would see exactly this problem.
Fiddling with the PCI latency registers seemed to fix the problem at
first but then it came back.  See kern/10243.  However (as noted at
the end of the PR) my problem went away with
sys/dev/aic7xxx/aic7xxx.seq revision 1.91.

Looking at the diffs from 1.90 to 1.91, the fix for the bug is:

+ultra2_dmafifoflush:
                or      DFCNTRL, FIFOFLUSH;
-               test    DFSTATUS, FIFOEMP jz . - 1;
+               /*
+                * The FIFOEMP status bit on the Ultra2 class
+                * of controllers seems to be a bit flaky.
+                * It appears that if the FIFO is full and the
+                * transfer ends with some data in the REQ/ACK
+                * FIFO, FIFOEMP will fall temporarily
+                * as the data is transferred to the PCI bus.
+                * This glitch lasts for fewer than 5 clock cycles,
+                * so we work around the problem by ensuring the
+                * status bit stays false through a full glitch
+                * window.
+                */
+               test    DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
+               test    DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
+               test    DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
+               test    DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
+               test    DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
+
+ultra2_dmafifoempty:
+               /* Don't clobber an inprogress host data transfer */
+               test    DFSTATUS, MREQPEND      jnz ultra2_dmafifoempty;
+

In -stable, the corresponding code seems to be (rev 1.94.2.8):

ultra2_dmafifoflush:
		if ((ahc->bugs & AHC_AUTOFLUSH_BUG) != 0) {
			/*
			 * On Rev A of the aic7890, the autoflush
			 * features doesn't function correctly.
			 * Perform an explicit manual flush.  During
			 * a manual flush, the FIFOEMP bit becomes
			 * true every time the PCI FIFO empties
			 * regardless of the state of the SCSI FIFO.
			 * It can take up to 4 clock cycles for the
			 * SCSI FIFO to get data into the PCI FIFO
			 * and for FIFOEMP to de-assert.  Here we
			 * guard against this condition by making
			 * sure the FIFOEMP bit stays on for 5 full
			 * clock cycles.
			 */
			or	DFCNTRL, FIFOFLUSH;
			test	DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
			test	DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
			test	DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
			test	DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
		}
		test	DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;

Maybe AHC_AUTOFLUSH_BUG does not get set for all the chips that
actually have the bug?  That is a WAG, since I am by no means an ahc
expert.

andrew


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?85r91jqmmj.fsf>