From owner-freebsd-stable@FreeBSD.ORG Wed Feb 27 20:20:58 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EDA6A1065672 for ; Wed, 27 Feb 2008 20:20:57 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 8B9EB8FC17 for ; Wed, 27 Feb 2008 20:20:57 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from phobos.samsco.home (phobos.samsco.home [192.168.254.11]) (authenticated bits=0) by pooker.samsco.org (8.13.8/8.13.8) with ESMTP id m1RKKoWW029163; Wed, 27 Feb 2008 13:20:50 -0700 (MST) (envelope-from scottl@samsco.org) Message-ID: <47C5C622.5000209@samsco.org> Date: Wed, 27 Feb 2008 13:20:50 -0700 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.11) Gecko/20071128 SeaMonkey/1.1.7 MIME-Version: 1.0 To: Stephen Hurd References: <47C52948.2070500@sasktel.net> <20080227121129.GA76419@eos.sc1.parodius.com> <47C5ACD0.8000009@sasktel.net> In-Reply-To: <47C5ACD0.8000009@sasktel.net> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-1.4 required=5.4 tests=ALL_TRUSTED autolearn=failed version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Cc: Jeremy Chadwick , freebsd-stable@freebsd.org Subject: Re: ad0 READ_DMA TIMEOUT errors on install of 7.0-RELEASE X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Feb 2008 20:20:58 -0000 Stephen Hurd wrote: >> >> This shows you've had 4 reallocated sectors, meaning your disk does in >> fact have bad blocks. In 90% of the cases out there, bad blocks >> continue to "grow" over time, due to whatever reason (I remember reading >> an article explaining it, but I can't for the life of me find the URL). >> > > This is unusual now? I've always "known" that a small number of bad > blocks is normal. Time to readjust my knowledge again? Modern drives hide bad sectors by keeping a pool of spare tracks and automatically remapping bad sectors to that pool. The problem lies in when the drive has aged enough that it's run out of spares. > >>> 194 Temperature_Celsius 0x0032 253 253 000 Old_age >>> Always - 48 >>> >> >> This is excessive, and may be attributing to problems. A hard disk >> running at 48C is not a good sign. This should really be somewhere >> between high 20s and mid 30s. >> > > Yeah, this is a known problem with this drive... it's been running hot > for years. I always figured it was due to the rotational speed increase > in commodity drives. 48C is high, but I wouldn't consider it excessive. Drives that start generating "excessive" heat tend to fail shortly thereafter. I do agree that the heat is probably shortening the lifespan on the drive. > >>> Error 2 occurred at disk power-on lifetime: 5171 hours (215 days + 11 >>> hours) >>> When the command that caused the error occurred, the device was in >>> an unknown state. >>> Error 1 occurred at disk power-on lifetime: 5171 hours (215 days + 11 >>> hours) >>> When the command that caused the error occurred, the device was in >>> an unknown state. >>> >> >> These are automated SMART log entries confirming the DMA failures. The >> fact that SMART saw them means that the disk is also aware of said >> issues. These may have been caused by the reallocated sectors. It's >> also interesting that the LBAs are different than the ones FreeBSD >> reported issues with. >> > > If that power on lifetime is accurate, that was at least a year ago... > but I can't find any documentation as to when the power-on lifetime > wraps or what it actually indicates. I'm assuming that it is total > power on time since the drive was manufactured. If it's total hours as > a 16-bit integer, it shouldn't wrap. Is there a way of getting the > "current" power-on lifetime value that you're aware of? That power on > minutes is interesting, but its current value is lower than the value at > the error (but higher than the power uptime of the system): > 9 Power_On_Minutes 0x0032 219 219 000 Old_age > Always - 1061h+40m > > Also interesting is that after getting more errors from FreeBSD, I did > not get more errors in smartctl. > The errors you're getting from FreeBSD have nothing to do directly with SMART. The driver thinks that commands are timing out and that the drive is becoming unresponsive. Whether they actually are is another question. Given that this problem changes behavior with the version of FreeBSD that you're running (and even happens in completely virtual environments like vmware) I'm betting that it's a driver problem and not a hardware problem, though you should probably think about migrating your data off to a new drive sometime soon. I'd like to attack these driver problems. What I need is to spend a couple of days with an affected system that can reliably reproduce the problem, instrumenting and testing the driver. I have a number of theories about what might be going wrong, but nothing that I'm definitely sure about. If you are willing to set up your system with remote power and remote serial, and if we knew a reliable way to reproduce the problem, I could probably have the problem identified and fixed pretty quickly. Scott