From owner-freebsd-stable@FreeBSD.ORG Fri May 11 18:11:07 2007 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id F322316A40A for ; Fri, 11 May 2007 18:11:06 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from alnrmhc13.comcast.net (alnrmhc13.comcast.net [206.18.177.53]) by mx1.freebsd.org (Postfix) with ESMTP id C93B913C4B0 for ; Fri, 11 May 2007 18:11:06 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from icarus.home.lan (c-71-198-0-135.hsd1.ca.comcast.net[71.198.0.135]) by comcast.net (alnrmhc13) with ESMTP id <20070511181058b1300dp8dve>; Fri, 11 May 2007 18:11:06 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 3D2331FA01D; Fri, 11 May 2007 11:10:58 -0700 (PDT) Date: Fri, 11 May 2007 11:10:58 -0700 From: Jeremy Chadwick To: Richard Puga Message-ID: <20070511181058.GA34752@icarus.home.lan> Mail-Followup-To: Richard Puga , freebsd-stable@freebsd.org References: <4644A5C6.4BB0BAD@mauibuilt.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4644A5C6.4BB0BAD@mauibuilt.com> User-Agent: Mutt/1.5.15 (2007-04-06) Cc: freebsd-stable@freebsd.org Subject: Re: Intel ICH5 UDMA100 controller TIMEOUT - READ_DMA X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 May 2007 18:11:07 -0000 On Fri, May 11, 2007 at 07:20:06AM -1000, Richard Puga wrote: > I am working with a new IBM XSeries 226 server. > > It worked fine with the original 80 gig drives. > > Upon replacing them with 2 new Hitichi 500 gig drives I get DMA timouts > at random times while using the on board Intel SATA controller. > > I put a Promice SATA controller in the machine and everything works > great. There's no mention of what FreeBSD version and kernel build date you're using. uname -a would be very useful here. > kernel: ad3: TIMEOUT - READ_DMA retrying (1 retry left) LBA=0 > kernel: ad2: TIMEOUT - WRITE_DMA48 retrying (1 retry left) > LBA=324524575 > kernel: ad2: TIMEOUT - READ_DMA retrying (1 retry left) LBA=3780487 > kernel: ad2: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=2651511 > and so on.... The interesting part is that the LBAs are all over the place; it's not sequential, which means (in my opinion) the drive itself is fine. > atapci1: port > 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x14a0-0x14af at device 31.1 on pci0 > > ad4: 476940MB at ata2-master SATA150 > ad6: 476940MB at ata3-master SATA150 Some clarification: These drives are not attached to atapci1. They're attached to a different PCI device. UDMA100 is the ATA/IDE port (read: old PATA), not an SATA port. What you should be pointing to is something that looks like this: atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f irq 18 at device 31.2 on pci0 (The above example is from a machine we have sitting around doing heavy I/O work due to MySQL. We have no disk problems there.) Now... I have seen similar behaviour to what you've described on an Intel-based SATA controller (ICH6) with a Western Digital drive that I have personally used and determined to be reliable on Windows and verified as such with WD's testing software under DOS too. I've only seen this happen *once* on the system. That system: FreeBSD eos.sc1.parodius.com 6.2-STABLE FreeBSD 6.2-STABLE #0: Thu Mar 8 10:41:09 PST 2007 root@eos.sc1.parodius.com:/usr/obj/usr/src/sys/EOS i386 atapci0@pci0:31:2: class=0x010180 card=0x628015d9 chip=0x26528086 rev=0x03 hdr=0x00 vendor = 'Intel Corporation' device = '82801FR/FRW ICH6R/ICH6RW SATA Controller' class = mass storage subclass = ATA Master: ad0 Serial ATA II Slave: no device present ad0: timeout waiting to issue command ad0: error issuing WRITE_DMA command ad0: timeout waiting to issue command ad0: error issuing WRITE_DMA command ad0: timeout waiting to issue command ad0: error issuing WRITE_DMA command ad0: timeout waiting to issue command ad0: error issuing WRITE_DMA command ad0: timeout waiting to issue command ad0: error issuing WRITE_DMA command g_vfs_done():ad0s1d[WRITE(offset=16821780480, length=16384)]error = 5 g_vfs_done():ad0s1d[WRITE(offset=16826417152, length=16384)]error = 5 g_vfs_done():ad0s1d[WRITE(offset=813531136, length=16384)]error = 5 g_vfs_done():ad0s1d[WRITE(offset=817922048, length=16384)]error = 5 g_vfs_done():ad0s1d[WRITE(offset=870563840, length=16384)]error = 5 And SMART (smartctl) shows absolutely no signs of any problems with the drive (the Temperature_Celcius "in_the_past" error is how the drive came from the factory -- I think Western Digital was doing some testing, who knows.) ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 214 214 021 Pre-fail Always - 4283 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 9 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4145 10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 253 051 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 8 190 Temperature_Celsius 0x0022 063 042 045 Old_age Always In_the_past 37 194 Temperature_Celsius 0x0022 113 092 000 Old_age Always - 37 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0 SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Conveyance offline Completed without error 00% 3925 - # 2 Extended offline Completed without error 00% 3921 - # 3 Short offline Completed without error 00% 3920 - # 4 Short offline Completed without error 00% 3080 - # 5 Short offline Completed without error 00% 3039 - # 6 Short offline Completed without error 00% 2898 - # 7 Short offline Completed without error 00% 2613 - # 8 Short offline Completed without error 00% 43 - Finally, one can see for RELENG_6 that there are still ongoing changes. There were some recent ones regarding DMA, but I believe they were for ATAPI devices and not ATA (disk) devices. http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/ Note my system kernel is from March 8th. Since then, there's been a lot of changes regarding DMA, including some "oops I broke this" fixes which may explain what I am seeing, and maybe what you are too. Though this is in regards to 64-bit DMA, and I believe most of my systems (and yours?) are using 48-bit DMA. http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/ata-dma.c Soren might know what's going on here though... -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |