From owner-freebsd-stable@FreeBSD.ORG  Wed Feb 27 18:32:51 2008
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2E5FE1065679
	for <freebsd-stable@freebsd.org>; Wed, 27 Feb 2008 18:32:51 +0000 (UTC)
	(envelope-from shurd@sasktel.net)
Received: from misav09.sasknet.sk.ca (misav09.sasknet.sk.ca [142.165.20.173])
	by mx1.freebsd.org (Postfix) with ESMTP id DBEF58FC17
	for <freebsd-stable@freebsd.org>; Wed, 27 Feb 2008 18:32:50 +0000 (UTC)
	(envelope-from shurd@sasktel.net)
Received: from bgmpomr2.sasknet.sk.ca ([142.165.72.23]) by misav09 with
	InterScan Messaging Security Suite; Wed, 27 Feb 2008 12:32:50 -0600
Received: from server.hurd.local
	(adsl-76-202-204-46.dsl.lsan03.sbcglobal.net [76.202.204.46])
	by bgmpomr2.sasknet.sk.ca (SaskTel eMessaging Service)
	with ESMTPA id <0JWW008C5U6OAI30@bgmpomr2.sasknet.sk.ca>; Wed,
	27 Feb 2008 12:32:50 -0600 (CST)
Date: Wed, 27 Feb 2008 10:32:48 -0800
From: Stephen Hurd <shurd@sasktel.net>
In-reply-to: <20080227121129.GA76419@eos.sc1.parodius.com>
To: Jeremy Chadwick <koitsu@freebsd.org>
Message-id: <47C5ACD0.8000009@sasktel.net>
MIME-version: 1.0
Content-type: text/plain; charset=ISO-8859-1; format=flowed
Content-transfer-encoding: 7BIT
References: <47C52948.2070500@sasktel.net>
	<20080227121129.GA76419@eos.sc1.parodius.com>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.8.1.9)
	Gecko/20071123 SeaMonkey/1.1.6
Cc: freebsd-stable@freebsd.org
Subject: Re: ad0 READ_DMA TIMEOUT errors on install of 7.0-RELEASE
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 27 Feb 2008 18:32:51 -0000

Jeremy Chadwick wrote:
>> And after the reboot, the READ_DMA timeouts were back.
>>     
>
> You're not the only one seeing this behaviour.  There are too many posts
> in the past reporting similar.  Here's the breakdown:
>
> * Some have switched to alternate operating systems (usually Linux) for
>   a short while and seen no sign of DMA timeouts.
>   

Booting the 6.3-RELEASE CD seems to make the problem go away... possibly 
7.0 stresses the HD more?

> However: in your case, your disk does look to have problems based on the
> SMART output you provided.  It does not matter how new/old the disk is,
> by the way.  I'll point out the problematic stats.  You need to replace
> the disk ASAP.
>   

Yeah, that's pretty much what I figured, the timing (ie: the moment I 
boot 7.0-RELEASE) is the only bit that seems fishy.  This HD has been 
powered on pretty much continuously for around three years.  Given that 
it's a Maxtor, I'm honestly a bit surprised that it's lasted as well as 
it has.

>> SMART Attributes Data Structure revision number: 16
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>>   5 Reallocated_Sector_Ct   0x0033   253   253   063    Pre-fail  Always       -       4
>>     
>
> This shows you've had 4 reallocated sectors, meaning your disk does in
> fact have bad blocks.  In 90% of the cases out there, bad blocks
> continue to "grow" over time, due to whatever reason (I remember reading
> an article explaining it, but I can't for the life of me find the URL).
>   

This is unusual now?  I've always "known" that a small number of bad 
blocks is normal.  Time to readjust my knowledge again?

>> 194 Temperature_Celsius     0x0032   253   253   000    Old_age   Always       -       48
>>     
>
> This is excessive, and may be attributing to problems.  A hard disk
> running at 48C is not a good sign.  This should really be somewhere
> between high 20s and mid 30s.
>   

Yeah, this is a known problem with this drive... it's been running hot 
for years.  I always figured it was due to the rotational speed increase 
in commodity drives.

>> Error 2 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours)
>>   When the command that caused the error occurred, the device was in an unknown state.
>> Error 1 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours)
>>   When the command that caused the error occurred, the device was in an unknown state.
>>     
>
> These are automated SMART log entries confirming the DMA failures.  The
> fact that SMART saw them means that the disk is also aware of said
> issues.  These may have been caused by the reallocated sectors.  It's
> also interesting that the LBAs are different than the ones FreeBSD
> reported issues with.
>   

If that power on lifetime is accurate, that was at least a year ago... 
but I can't find any documentation as to when the power-on lifetime 
wraps or what it actually indicates.  I'm assuming that it is total 
power on time since the drive was manufactured.  If it's total hours as 
a 16-bit integer, it shouldn't wrap.  Is there a way of getting the 
"current" power-on lifetime value that you're aware of?  That power on 
minutes is interesting, but its current value is lower than the value at 
the error (but higher than the power uptime of the system):
  9 Power_On_Minutes        0x0032   219   219   000    Old_age   
Always       -       1061h+40m

Also interesting is that after getting more errors from FreeBSD, I did 
not get more errors in smartctl.

> My advice to you is: replace the disk ASAP.  This problem will only get
> worse.  Try another hard disk brand too (I don't have anything "against"
> Maxtor, but usually its recommended to avoid a brand you have problems
> with until the next time you have issues, then switch brands, etc.
> etc...).  I'm very fond of Western Digital's SE16, RE, and RE2 series
> currently.  But avoid Fujitsu and Samsung (both have a long track record
> of having buggy drive firmwares, forcing vendors to make custom
> workarounds for issues); stick with Seagate, Western Digital, or Maxtor.
>   

Yeah, that's my plan... but I wanted to stake out some whining rights in 
advance so I can do the "But you said it was a bad HD or cable!  Now I'm 
out $x00 and my system still doesn't work!  Help me or I switch to 
DragonFly BSD/Desktop BSD/Linux which is perfect and has no problems!" 
thing.  Then go on Slashdot and post long rambling messages about how 
FreeBSD is dead and it doesn't matter than the manpages on any given 
Linux box are useless.