From owner-freebsd-stable@FreeBSD.ORG Tue Sep 16 17:59:01 2008 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 757CC106566C for ; Tue, 16 Sep 2008 17:59:01 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from QMTA10.westchester.pa.mail.comcast.net (qmta10.westchester.pa.mail.comcast.net [76.96.62.17]) by mx1.freebsd.org (Postfix) with ESMTP id 235248FC0A for ; Tue, 16 Sep 2008 17:59:00 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from OMTA14.westchester.pa.mail.comcast.net ([76.96.62.60]) by QMTA10.westchester.pa.mail.comcast.net with comcast id FPVa1a0061HzFnQ5AVyzd4; Tue, 16 Sep 2008 17:58:59 +0000 Received: from koitsu.dyndns.org ([67.180.253.227]) by OMTA14.westchester.pa.mail.comcast.net with comcast id FVyy1a00Z4v8bD73aVyzht; Tue, 16 Sep 2008 17:58:59 +0000 X-Authority-Analysis: v=1.0 c=1 a=6I5d2MoRAAAA:8 a=QycZ5dHgAAAA:8 a=87x36UTPhJl8_RHRPRYA:9 a=FyeCD9GqiQ6LyMcH27sA:7 a=rB_3eSVa9l5bwDWEXDOw-AjDWlkA:4 a=EoioJ0NPDVgA:10 a=LY0hPdMaydYA:10 Received: by icarus.home.lan (Postfix, from userid 1000) id 4764B17B81A; Tue, 16 Sep 2008 10:58:58 -0700 (PDT) Date: Tue, 16 Sep 2008 10:58:58 -0700 From: Jeremy Chadwick To: Clint Olsen Message-ID: <20080916175858.GA70396@icarus.home.lan> References: <20080916170452.GB4861@0lsen.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20080916170452.GB4861@0lsen.net> User-Agent: Mutt/1.5.18 (2008-05-17) Cc: stable@freebsd.org Subject: Re: Help debugging DMA_READ errors X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Sep 2008 17:59:01 -0000 On Tue, Sep 16, 2008 at 10:04:52AM -0700, Clint Olsen wrote: > Ok, I've had some flakiness with my 6.3-STABLE (Sun May 25 21:55:57 PDT > 2008) box. I assume that these errors are indicative of a system-level > problem rather than a single disk: Not necessarily, but FreeBSD makes debugging this kind of situation fairly difficult. It takes time and a lot of patience. If the problem is easily reproducible, that can significantly help. > Event 1 > ------- > Sep 14 05:12:54 belle kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=216477719 > > Result: Hard reset required > > Event 2 > ------- > Sep 16 02:11:09 belle kernel: ad4: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=172088735 > Sep 16 02:13:08 belle kernel: acd0: WARNING - READ_TOC taskqueue timeout - completing request directly > Sep 16 02:13:09 belle kernel: acd0: timeout waiting for ATAPI ready > Sep 16 02:13:09 belle kernel: acd0: error issuing ATA PACKET command > Sep 16 02:13:09 belle kernel: acd0: WARNING - READ_TOC freeing taskqueue zombie request > Sep 16 02:13:09 belle kernel: acd0: timeout waiting for ATAPI ready > Sep 16 02:13:09 belle kernel: acd0: error issuing ATA PACKET command > ...last two repeating until reset... > > Result: Hard reset required The ad4 error looks very similar to your ad0 timeout earlier, just on a different disk. acd0 is a CD/DVD drive. ad4 is a hard disk. What exactly were you doing with the system at the time these errors appeared? Were you using the CD/DVD drive? Was there a disc in the drive that was mounted? If none of these things, I'm baffled as to what would read acd0 and cause what you see here. I have a feeling all of these might be driven off of a single southbridge controller, which could be going bad, or "wedged" in some way. You've now seen errors on ad0 (PATA device), ad4 (SATA device), and acd0 (unknown, but probably a PATA device). > Disk configuration: > > ad0: 114473MB at ata0-master UDMA100 > ad4: 114473MB at ata2-master SATA150 > ad6: 476940MB at ata3-master SATA150 Can you please provide full details of what these disks are connected to? I'd like to see dmesg output for ata0, ata2, and ata3, as well as the atapci devices those ataX devices are attached to, ditto with vmstat -i output. Are there any other errors in your logs around that time (e.g. watchdog timeouts of any kind on network devices, etc.?) Additionally, it would be very useful if you could install ports/sysutils/smartmontools and provide the following output: # smartctl -a /dev/ad0 # smartctl -a /dev/ad4 This will help in determining if either of the disks saw the DMA errors reported, and help determine if the disks are going bad, or if your machine somehow lost power briefly, or imply that you might have a voltage/PSU problem of some kind. > I'm using one of those eSATA converter brackets in the back of the machine > for ad6. I'm guessing this doesn't have to do with this problem since that > disk wasn't mentioned. I can't say for certain. The above information will help. > Any advice you can offer will be much appreciated. The best advice I can give you is the above, combined with the below Wiki document I've made, time permitting. It is in no way complete, and it may simply induce more questions than answers. http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting The bottom line is that, if the problems you're seeing are the "same thing" others are seeing, then you are not alone. As I said initially, finding the source of these problems is difficult, and they are often "unique" to each individual's machine. For some, replacing cables, the entire motherboard, disk controller, or just the PSU helped; for others, the problem disappeared on its own; in other cases, the problem was so severe that they ended up switching to Linux. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |