Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 12 Aug 2004 19:32:11 GMT
From:      Wayne Cox <wc_fbsd@xxiii.com>
To:        freebsd-gnats-submit@FreeBSD.org
Subject:   kern/70379: System hangs under heavy disk IO with SiI 3112 SATA150 controller and Western Digital drive
Message-ID:  <200408121932.i7CJWBaV033200@www.freebsd.org>
Resent-Message-ID: <200408121940.i7CJePbL061005@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help

>Number:         70379
>Category:       kern
>Synopsis:       System hangs under heavy disk IO with SiI 3112 SATA150 controller and Western Digital drive
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Aug 12 19:40:25 GMT 2004
>Closed-Date:
>Last-Modified:
>Originator:     Wayne Cox
>Release:        FreeBSD 5.2.1-RELEASE-p9 i386
>Organization:
Twenty-Three, Inc.  xxiii.com
>Environment:
FreeBSD stimpy.xxiii.com 5.2.1-RELEASE-p9 FreeBSD 5.2.1-RELEASE-p9 #4: Wed Aug 11 11:34:00 EDT 2004     root@stimpy.xxiii.com:/usr/src/sys/i386/compile/WMC  i386

Generic PC with Celeron 433MHz CPU, Adaptec 1210SA Serial-ATA controller using SiI 3112 chipset, Western Digital "Raptor" WD360GD SATA disk.

GENERIC kernel.

make.conf has  CFLAGS= -O -pipe;  NOPROFILE=true

Although I have messed with patching, and kernel config' & compilation, the problem is identical on a bone-stock installation.

>Description:
  Under heavy disk IO, the system reports a series of errors on the console, similar to "ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=xxxxxxx"

  Sometimes it recovers, but may go on to repeat similar errors.  At some point, it WILL error and hang.  There is no "panic" message or anything on the console.  Keyboard is unresponsive to ctrl-alt-del.  A hardware reset or power cycle is required to reboot.  Data corruption can be severe.

  This may be similar ot identical with kern/69446 or i386/59895, but those happened under somewhat different curcumstances, and make no mention of the timing issue (see FIX section.)

  SATA drives are rapidly supplanting IDE and SCSI drives, and it sure would be nice to be able to use them reliably.

>How-To-Repeat:
  Do random-seek intensive IO on the drive.  I had a large (4GB) backup .tar on one filesystem, and was attempting to extract it to another file system on the same drive/spindle.  eg:
  cd /fs2 ; tar -xvf /fs1/BigBackup.tar 

  Sequential reads don't seem to cause trouble.  For example:
    cat /fs1/BigBackup.tar > /dev/null
  causes no hiccups.

  Also, pulling files over the network hasn't caused problems.  Using rsh to pull a .tar from a remote system and un-tarring locally, or simply ftping big files works ok, even though they approach the 10MB/sec wire speed.

>Fix:
  This is a fairly slow system (433MHz) to start with.  In one similar bug report (kern/69446), the author couldn't even get the basic install to run.  So I'm speculating that it might be some sort of timing issue in the ata driver???

  One work-around I found is to artificially slow down the IO.  I the above example of repeating the problem, I was able to successfully restore the file by wasting many CPU cycles piping the data through some compression, eg:
  cd /fs2 ; gzip -c --best /fs1/BigBackup.tar | zcat | tar -xvf -

  I'm no kernel programmer.  But just as a shot in extreme darkness, I found some code in src/sys/dev/ata/ata-lowlevel.c setting a time out value ("int timeout = 5000") with a comment "might be less for fast devices".  I tried changing it to 3000 and 8000 and recompiling, but with no apparent change.

>Release-Note:
>Audit-Trail:
>Unformatted:



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200408121932.i7CJWBaV033200>