From owner-freebsd-stable@FreeBSD.ORG  Fri May 11 18:11:07 2007
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
X-Original-To: freebsd-stable@freebsd.org
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id F322316A40A
	for <freebsd-stable@freebsd.org>; Fri, 11 May 2007 18:11:06 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from alnrmhc13.comcast.net (alnrmhc13.comcast.net [206.18.177.53])
	by mx1.freebsd.org (Postfix) with ESMTP id C93B913C4B0
	for <freebsd-stable@freebsd.org>; Fri, 11 May 2007 18:11:06 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from icarus.home.lan
	(c-71-198-0-135.hsd1.ca.comcast.net[71.198.0.135])
	by comcast.net (alnrmhc13) with ESMTP
	id <20070511181058b1300dp8dve>; Fri, 11 May 2007 18:11:06 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id 3D2331FA01D; Fri, 11 May 2007 11:10:58 -0700 (PDT)
Date: Fri, 11 May 2007 11:10:58 -0700
From: Jeremy Chadwick <koitsu@FreeBSD.org>
To: Richard Puga <puga@mauibuilt.com>
Message-ID: <20070511181058.GA34752@icarus.home.lan>
Mail-Followup-To: Richard Puga <puga@mauibuilt.com>, freebsd-stable@freebsd.org
References: <4644A5C6.4BB0BAD@mauibuilt.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4644A5C6.4BB0BAD@mauibuilt.com>
User-Agent: Mutt/1.5.15 (2007-04-06)
Cc: freebsd-stable@freebsd.org
Subject: Re: Intel ICH5 UDMA100 controller TIMEOUT - READ_DMA
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 11 May 2007 18:11:07 -0000

On Fri, May 11, 2007 at 07:20:06AM -1000, Richard Puga wrote:
> I am working with a new IBM XSeries 226 server.
> 
> It worked fine with the original 80 gig drives.
> 
> Upon replacing them with 2 new Hitichi 500 gig drives I get DMA timouts
> at random times while using the on board Intel SATA controller.
> 
> I put a Promice SATA controller in the machine and everything works
> great.

There's no mention of what FreeBSD version and kernel build date
you're using.  uname -a would be very useful here.

>  kernel: ad3: TIMEOUT - READ_DMA retrying (1 retry left) LBA=0
>  kernel: ad2: TIMEOUT - WRITE_DMA48 retrying (1 retry left)
> LBA=324524575
>  kernel: ad2: TIMEOUT - READ_DMA retrying (1 retry left) LBA=3780487
>  kernel: ad2: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=2651511
> and so on....

The interesting part is that the LBAs are all over the place; it's
not sequential, which means (in my opinion) the drive itself is fine.

> atapci1: <Intel ICH5 UDMA100 controller> port
> 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x14a0-0x14af at device 31.1 on pci0
> 
> ad4: 476940MB <Hitachi HDT725050VLA360 V56OA73A> at ata2-master SATA150
> ad6: 476940MB <Hitachi HDT725050VLA360 V56OA73A> at ata3-master SATA150

Some clarification:

These drives are not attached to atapci1.  They're attached to a
different PCI device.  UDMA100 is the ATA/IDE port (read: old PATA), not
an SATA port.  What you should be pointing to is something that looks
like this:

atapci0: <Intel ICH5 SATA150 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f irq 18 at device 31.2 on pci0

(The above example is from a machine we have sitting around doing
heavy I/O work due to MySQL.  We have no disk problems there.)

Now...

I have seen similar behaviour to what you've described on an Intel-based
SATA controller (ICH6) with a Western Digital drive that I have
personally used and determined to be reliable on Windows and verified as
such with WD's testing software under DOS too.  I've only seen this
happen *once* on the system.  That system:

FreeBSD eos.sc1.parodius.com 6.2-STABLE FreeBSD 6.2-STABLE #0: Thu Mar 8 10:41:09 PST 2007 root@eos.sc1.parodius.com:/usr/obj/usr/src/sys/EOS  i386

atapci0@pci0:31:2:      class=0x010180 card=0x628015d9 chip=0x26528086 rev=0x03 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82801FR/FRW ICH6R/ICH6RW SATA Controller'
    class      = mass storage
    subclass   = ATA

Master:  ad0 <WDC WD2500KS-00MJB0/02.01C03> Serial ATA II
Slave:       no device present

ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
ad0: timeout waiting to issue command
ad0: error issuing WRITE_DMA command
g_vfs_done():ad0s1d[WRITE(offset=16821780480, length=16384)]error = 5
g_vfs_done():ad0s1d[WRITE(offset=16826417152, length=16384)]error = 5
g_vfs_done():ad0s1d[WRITE(offset=813531136, length=16384)]error = 5
g_vfs_done():ad0s1d[WRITE(offset=817922048, length=16384)]error = 5
g_vfs_done():ad0s1d[WRITE(offset=870563840, length=16384)]error = 5

And SMART (smartctl) shows absolutely no signs of any problems with the
drive (the Temperature_Celcius "in_the_past" error is how the drive came
from the factory -- I think Western Digital was doing some testing, who
knows.)

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   214   214   021    Pre-fail  Always       -       4283
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       9
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4145
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       8
190 Temperature_Celsius     0x0022   063   042   045    Old_age   Always   In_the_past 37
194 Temperature_Celsius     0x0022   113   092   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%      3925         -
# 2  Extended offline    Completed without error       00%      3921         -
# 3  Short offline       Completed without error       00%      3920         -
# 4  Short offline       Completed without error       00%      3080         -
# 5  Short offline       Completed without error       00%      3039         -
# 6  Short offline       Completed without error       00%      2898         -
# 7  Short offline       Completed without error       00%      2613         -
# 8  Short offline       Completed without error       00%        43         -

Finally, one can see for RELENG_6 that there are still ongoing changes.
There were some recent ones regarding DMA, but I believe they were for
ATAPI devices and not ATA (disk) devices.

http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/

Note my system kernel is from March 8th.  Since then, there's been a lot
of changes regarding DMA, including some "oops I broke this" fixes which
may explain what I am seeing, and maybe what you are too.  Though this
is in regards to 64-bit DMA, and I believe most of my systems (and
yours?) are using 48-bit DMA.

http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/ata-dma.c

Soren might know what's going on here though...

-- 
| Jeremy Chadwick                                    jdc at parodius.com |
| Parodius Networking                           http://www.parodius.com/ |
| UNIX Systems Administrator                      Mountain View, CA, USA |
| Making life hard for others since 1977.                  PGP: 4BD6C0CB |