Date: 31 Jan 2001 19:21:07 -0500 From: Andrew Heybey <ath@niksun.com> To: freebsd-scsi@freebsd.org Cc: Ian Dowse <iedowse@maths.tcd.ie> Subject: Re: Corruption on ahc reads - seems PCI latency related Message-ID: <85r91jqmmj.fsf@stiegl.niksun.com> In-Reply-To: Ian Dowse's message of "Wed, 31 Jan 2001 22:53:10 %2B0000" References: <200101312253.aa86550@salmon.maths.tcd.ie>
next in thread | previous in thread | raw e-mail | index | archive | help
Ian Dowse <iedowse@maths.tcd.ie> writes: > We have a heavily loaded 4.2-STABLE NFS fileserver machine that > has recently delevoped a file corruption problem. The corruption > seems to be occurring during reads from one SCSI disk (da0). It > appears that small regions (usually 18 bytes) of a read are 'missed', > so the buffer cache ends up with mostly the new data, but some > bytes are from whatever happened to be in the buffer cache before > the read. > [...] > The odd thing is that we can only reproduce the corruption when > reading from da0 (Quantum 9Gb), while writing over NFS to another > disk (I have only tried da2). Swapping out da0 with another similar > disk did not help. > > Anyway, today I tried fiddling with the PCI latency timer settings, > and it seems that reducing the value of the ahc PCI latency timer > makes the corruption go away. On this motherboard (Supermicro with > onboard SCSI) the default PCI latency timer value on all devices > is 0x40. If I reduce this to 0x20 on ahc0,ahc1,fxp0,fxp1,pcib1, > then I can't repeat the corruption. When I put it back to 0x40 on > ahc0 and ahc1 the corruption returns. > > Has anyone any ideas on what this might mean? If a FIFO somewhere > is filling or a DMA is failing, shouldn't an error get back to the > driver or OS somehow? Or is this just a sign of dying hardware? This sounds almost exactly like a problem I had with 3.1 in 1999. Under heavy disk and network load I would see exactly this problem. Fiddling with the PCI latency registers seemed to fix the problem at first but then it came back. See kern/10243. However (as noted at the end of the PR) my problem went away with sys/dev/aic7xxx/aic7xxx.seq revision 1.91. Looking at the diffs from 1.90 to 1.91, the fix for the bug is: +ultra2_dmafifoflush: or DFCNTRL, FIFOFLUSH; - test DFSTATUS, FIFOEMP jz . - 1; + /* + * The FIFOEMP status bit on the Ultra2 class + * of controllers seems to be a bit flaky. + * It appears that if the FIFO is full and the + * transfer ends with some data in the REQ/ACK + * FIFO, FIFOEMP will fall temporarily + * as the data is transferred to the PCI bus. + * This glitch lasts for fewer than 5 clock cycles, + * so we work around the problem by ensuring the + * status bit stays false through a full glitch + * window. + */ + test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; + test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; + test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; + test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; + test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; + +ultra2_dmafifoempty: + /* Don't clobber an inprogress host data transfer */ + test DFSTATUS, MREQPEND jnz ultra2_dmafifoempty; + In -stable, the corresponding code seems to be (rev 1.94.2.8): ultra2_dmafifoflush: if ((ahc->bugs & AHC_AUTOFLUSH_BUG) != 0) { /* * On Rev A of the aic7890, the autoflush * features doesn't function correctly. * Perform an explicit manual flush. During * a manual flush, the FIFOEMP bit becomes * true every time the PCI FIFO empties * regardless of the state of the SCSI FIFO. * It can take up to 4 clock cycles for the * SCSI FIFO to get data into the PCI FIFO * and for FIFOEMP to de-assert. Here we * guard against this condition by making * sure the FIFOEMP bit stays on for 5 full * clock cycles. */ or DFCNTRL, FIFOFLUSH; test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; } test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; Maybe AHC_AUTOFLUSH_BUG does not get set for all the chips that actually have the bug? That is a WAG, since I am by no means an ahc expert. andrew To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-scsi" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?85r91jqmmj.fsf>