From owner-freebsd-stable@FreeBSD.ORG Wed Feb 27 18:32:51 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2E5FE1065679 for ; Wed, 27 Feb 2008 18:32:51 +0000 (UTC) (envelope-from shurd@sasktel.net) Received: from misav09.sasknet.sk.ca (misav09.sasknet.sk.ca [142.165.20.173]) by mx1.freebsd.org (Postfix) with ESMTP id DBEF58FC17 for ; Wed, 27 Feb 2008 18:32:50 +0000 (UTC) (envelope-from shurd@sasktel.net) Received: from bgmpomr2.sasknet.sk.ca ([142.165.72.23]) by misav09 with InterScan Messaging Security Suite; Wed, 27 Feb 2008 12:32:50 -0600 Received: from server.hurd.local (adsl-76-202-204-46.dsl.lsan03.sbcglobal.net [76.202.204.46]) by bgmpomr2.sasknet.sk.ca (SaskTel eMessaging Service) with ESMTPA id <0JWW008C5U6OAI30@bgmpomr2.sasknet.sk.ca>; Wed, 27 Feb 2008 12:32:50 -0600 (CST) Date: Wed, 27 Feb 2008 10:32:48 -0800 From: Stephen Hurd In-reply-to: <20080227121129.GA76419@eos.sc1.parodius.com> To: Jeremy Chadwick Message-id: <47C5ACD0.8000009@sasktel.net> MIME-version: 1.0 Content-type: text/plain; charset=ISO-8859-1; format=flowed Content-transfer-encoding: 7BIT References: <47C52948.2070500@sasktel.net> <20080227121129.GA76419@eos.sc1.parodius.com> User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.8.1.9) Gecko/20071123 SeaMonkey/1.1.6 Cc: freebsd-stable@freebsd.org Subject: Re: ad0 READ_DMA TIMEOUT errors on install of 7.0-RELEASE X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Feb 2008 18:32:51 -0000 Jeremy Chadwick wrote: >> And after the reboot, the READ_DMA timeouts were back. >> > > You're not the only one seeing this behaviour. There are too many posts > in the past reporting similar. Here's the breakdown: > > * Some have switched to alternate operating systems (usually Linux) for > a short while and seen no sign of DMA timeouts. > Booting the 6.3-RELEASE CD seems to make the problem go away... possibly 7.0 stresses the HD more? > However: in your case, your disk does look to have problems based on the > SMART output you provided. It does not matter how new/old the disk is, > by the way. I'll point out the problematic stats. You need to replace > the disk ASAP. > Yeah, that's pretty much what I figured, the timing (ie: the moment I boot 7.0-RELEASE) is the only bit that seems fishy. This HD has been powered on pretty much continuously for around three years. Given that it's a Maxtor, I'm honestly a bit surprised that it's lasted as well as it has. >> SMART Attributes Data Structure revision number: 16 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE >> 5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail Always - 4 >> > > This shows you've had 4 reallocated sectors, meaning your disk does in > fact have bad blocks. In 90% of the cases out there, bad blocks > continue to "grow" over time, due to whatever reason (I remember reading > an article explaining it, but I can't for the life of me find the URL). > This is unusual now? I've always "known" that a small number of bad blocks is normal. Time to readjust my knowledge again? >> 194 Temperature_Celsius 0x0032 253 253 000 Old_age Always - 48 >> > > This is excessive, and may be attributing to problems. A hard disk > running at 48C is not a good sign. This should really be somewhere > between high 20s and mid 30s. > Yeah, this is a known problem with this drive... it's been running hot for years. I always figured it was due to the rotational speed increase in commodity drives. >> Error 2 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours) >> When the command that caused the error occurred, the device was in an unknown state. >> Error 1 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours) >> When the command that caused the error occurred, the device was in an unknown state. >> > > These are automated SMART log entries confirming the DMA failures. The > fact that SMART saw them means that the disk is also aware of said > issues. These may have been caused by the reallocated sectors. It's > also interesting that the LBAs are different than the ones FreeBSD > reported issues with. > If that power on lifetime is accurate, that was at least a year ago... but I can't find any documentation as to when the power-on lifetime wraps or what it actually indicates. I'm assuming that it is total power on time since the drive was manufactured. If it's total hours as a 16-bit integer, it shouldn't wrap. Is there a way of getting the "current" power-on lifetime value that you're aware of? That power on minutes is interesting, but its current value is lower than the value at the error (but higher than the power uptime of the system): 9 Power_On_Minutes 0x0032 219 219 000 Old_age Always - 1061h+40m Also interesting is that after getting more errors from FreeBSD, I did not get more errors in smartctl. > My advice to you is: replace the disk ASAP. This problem will only get > worse. Try another hard disk brand too (I don't have anything "against" > Maxtor, but usually its recommended to avoid a brand you have problems > with until the next time you have issues, then switch brands, etc. > etc...). I'm very fond of Western Digital's SE16, RE, and RE2 series > currently. But avoid Fujitsu and Samsung (both have a long track record > of having buggy drive firmwares, forcing vendors to make custom > workarounds for issues); stick with Seagate, Western Digital, or Maxtor. > Yeah, that's my plan... but I wanted to stake out some whining rights in advance so I can do the "But you said it was a bad HD or cable! Now I'm out $x00 and my system still doesn't work! Help me or I switch to DragonFly BSD/Desktop BSD/Linux which is perfect and has no problems!" thing. Then go on Slashdot and post long rambling messages about how FreeBSD is dead and it doesn't matter than the manpages on any given Linux box are useless.