From owner-freebsd-stable@FreeBSD.ORG  Wed Feb 27 20:20:58 2008
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EDA6A1065672
	for <freebsd-stable@freebsd.org>; Wed, 27 Feb 2008 20:20:57 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.freebsd.org (Postfix) with ESMTP id 8B9EB8FC17
	for <freebsd-stable@freebsd.org>; Wed, 27 Feb 2008 20:20:57 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from phobos.samsco.home (phobos.samsco.home [192.168.254.11])
	(authenticated bits=0)
	by pooker.samsco.org (8.13.8/8.13.8) with ESMTP id m1RKKoWW029163;
	Wed, 27 Feb 2008 13:20:50 -0700 (MST)
	(envelope-from scottl@samsco.org)
Message-ID: <47C5C622.5000209@samsco.org>
Date: Wed, 27 Feb 2008 13:20:50 -0700
From: Scott Long <scottl@samsco.org>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US;
	rv:1.8.1.11) Gecko/20071128 SeaMonkey/1.1.7
MIME-Version: 1.0
To: Stephen Hurd <shurd@sasktel.net>
References: <47C52948.2070500@sasktel.net>	<20080227121129.GA76419@eos.sc1.parodius.com>
	<47C5ACD0.8000009@sasktel.net>
In-Reply-To: <47C5ACD0.8000009@sasktel.net>
X-Enigmail-Version: 0.95.6
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-1.4 required=5.4 tests=ALL_TRUSTED autolearn=failed
	version=3.1.8
X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org
Cc: Jeremy Chadwick <koitsu@freebsd.org>, freebsd-stable@freebsd.org
Subject: Re: ad0 READ_DMA TIMEOUT errors on install of 7.0-RELEASE
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 27 Feb 2008 20:20:58 -0000

Stephen Hurd wrote:
>>
>> This shows you've had 4 reallocated sectors, meaning your disk does in
>> fact have bad blocks.  In 90% of the cases out there, bad blocks
>> continue to "grow" over time, due to whatever reason (I remember reading
>> an article explaining it, but I can't for the life of me find the URL).
>>   
> 
> This is unusual now?  I've always "known" that a small number of bad 
> blocks is normal.  Time to readjust my knowledge again?

Modern drives hide bad sectors by keeping a pool of spare tracks and
automatically remapping bad sectors to that pool.  The problem lies in
when the drive has aged enough that it's run out of spares.

> 
>>> 194 Temperature_Celsius     0x0032   253   253   000    Old_age   
>>> Always       -       48
>>>     
>>
>> This is excessive, and may be attributing to problems.  A hard disk
>> running at 48C is not a good sign.  This should really be somewhere
>> between high 20s and mid 30s.
>>   
> 
> Yeah, this is a known problem with this drive... it's been running hot 
> for years.  I always figured it was due to the rotational speed increase 
> in commodity drives.

48C is high, but I wouldn't consider it excessive.  Drives that start 
generating "excessive" heat tend to fail shortly thereafter.  I do agree 
that the heat is probably shortening the lifespan on the drive.

> 
>>> Error 2 occurred at disk power-on lifetime: 5171 hours (215 days + 11 
>>> hours)
>>>   When the command that caused the error occurred, the device was in 
>>> an unknown state.
>>> Error 1 occurred at disk power-on lifetime: 5171 hours (215 days + 11 
>>> hours)
>>>   When the command that caused the error occurred, the device was in 
>>> an unknown state.
>>>     
>>
>> These are automated SMART log entries confirming the DMA failures.  The
>> fact that SMART saw them means that the disk is also aware of said
>> issues.  These may have been caused by the reallocated sectors.  It's
>> also interesting that the LBAs are different than the ones FreeBSD
>> reported issues with.
>>   
> 
> If that power on lifetime is accurate, that was at least a year ago... 
> but I can't find any documentation as to when the power-on lifetime 
> wraps or what it actually indicates.  I'm assuming that it is total 
> power on time since the drive was manufactured.  If it's total hours as 
> a 16-bit integer, it shouldn't wrap.  Is there a way of getting the 
> "current" power-on lifetime value that you're aware of?  That power on 
> minutes is interesting, but its current value is lower than the value at 
> the error (but higher than the power uptime of the system):
>  9 Power_On_Minutes        0x0032   219   219   000    Old_age   
> Always       -       1061h+40m
> 
> Also interesting is that after getting more errors from FreeBSD, I did 
> not get more errors in smartctl.
> 

The errors you're getting from FreeBSD have nothing to do directly with
SMART.  The driver thinks that commands are timing out and that the
drive is becoming unresponsive.  Whether they actually are is another
question.  Given that this problem changes behavior with the version of
FreeBSD that you're running (and even happens in completely virtual
environments like vmware) I'm betting that it's a driver problem and not
a hardware problem, though you should probably think about migrating
your data off to a new drive sometime soon.

I'd like to attack these driver problems.  What I need is to spend a
couple of days with an affected system that can reliably reproduce the
problem, instrumenting and testing the driver.  I have a number of
theories about what might be going wrong, but nothing that I'm
definitely sure about.  If you are willing to set up your system with
remote power and remote serial, and if we knew a reliable way to
reproduce the problem, I could probably have the problem identified and
fixed pretty quickly.

Scott