From owner-freebsd-questions@FreeBSD.ORG  Mon Sep  5 17:19:14 2005
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
X-Original-To: freebsd-questions@freebsd.org
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 5E6C816A41F
	for <freebsd-questions@freebsd.org>;
	Mon,  5 Sep 2005 17:19:14 +0000 (GMT)
	(envelope-from jwm-freebsd@sentinelchicken.net)
Received: from lakecmmtar04.coxmail.com (lakecmmtar04.coxmail.com
	[68.99.120.38]) by mx1.FreeBSD.org (Postfix) with ESMTP id D696C43D45
	for <freebsd-questions@freebsd.org>;
	Mon,  5 Sep 2005 17:19:13 +0000 (GMT)
	(envelope-from jwm-freebsd@sentinelchicken.net)
Received: from sentinelchicken.net ([70.183.13.213])
	by lakecmmtar04.coxmail.com
	(InterMail vM.6.01.05.02 201-2131-123-102-20050715) with SMTP id
	<20050905171914.UJON27582.lakecmmtar04.coxmail.com@sentinelchicken.net>
	for <freebsd-questions@freebsd.org>; Mon, 5 Sep 2005 13:19:14 -0400
Received: (qmail 26074 invoked from network); 5 Sep 2005 17:19:12 -0000
Received: from unknown (HELO numbuscus.sentinelchicken.net) (10.0.0.2)
	by samson.sentinelchicken.net with SMTP; 5 Sep 2005 17:19:12 -0000
Received: (nullmailer pid 78577 invoked by uid 1000);
	Mon, 05 Sep 2005 17:19:12 -0000
Date: Mon, 5 Sep 2005 13:19:12 -0400
From: Jason Morgan <jwm-freebsd@sentinelchicken.net>
To: FreeBSD Questions <freebsd-questions@freebsd.org>
Message-ID: <20050905171912.GF67960@sentinelchicken.net>
References: <20050905151332.P16924@saturn.araneidae.co.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20050905151332.P16924@saturn.araneidae.co.uk>
User-Agent: Mutt/1.4.2.1i
Subject: Re: Hard disk woes
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 05 Sep 2005 17:19:14 -0000

On Mon, Sep 05, 2005 at 03:16:13PM +0000, Michael Abbott wrote:
> I'm having some very odd behaviour from one of my hard disks and I wonder
> what anybody makes of it.
> 
> In brief, the hard disk in questions works just fine much of the time, but
> when high volume data transfers are requested I get the following in
> /var/log/messages:
> 
> Sep  3 15:21:02 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
> resetting
> Sep  3 15:21:02 saturn /kernel: ata3: resetting devices .. done
> Sep  3 15:21:12 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
> resetting
> Sep  3 15:21:12 saturn /kernel: ata3: resetting devices .. done
> Sep  3 15:21:23 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
> resetting
> Sep  3 15:21:23 saturn /kernel: ata3: resetting devices .. done
> Sep  3 15:21:33 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
> resetting
> Sep  3 15:21:33 saturn /kernel: ad6: trying fallback to PIO mode
> Sep  3 15:21:33 saturn /kernel: ata3: resetting devices .. done
> Sep  3 15:21:43 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
> resetting
> Sep  3 15:21:43 saturn /kernel: ata3: resetting devices .. ata3-slave: ATA 
> identify retries exceeded
> Sep  3 15:21:43 saturn /kernel: done
> 
> After this point the hard disk in question is frozen until I reboot, and
> any process that tries to touch it is similarly frozen (doesn't even
> respond to kill -9).  `shutdown -r` is enough to restore operation, and
> the rest of the system seemed happy enough.
> 
> Another interesting effect.  I placed a replacement hard disk on the same
> ATA bus (as a slave, device ad7) and tried copying files from ad6 to ad7.
> This time when ad6 froze and the kerned decided to give up on ata3 (and so
> decided to disable ad7 at the same time, naturally enough) the entire
> system froze!  No response from the console, stone cold dead, hard reset
> needed.
> 
> 
> So some questions seem to me to arise from this.
> 
> 1.  Why does FreeBSD handle this so ungracefully?  If restarting is
> sufficient to bring ata3 back then can't the ata driver do a proper
> restart?
> 
> 2.  Goodness me, FreeBSD froze!  I know it's a hardware failure, but
> still: it's on a auxillary ATA controller with no system files attached.
> Is this problem of general interest?  It's certainly a massive hint to me
> not to consider (parallel) ATA for RAID!
> 
> 3.  Any thoughts on what is wrong with the hard disk in question?  I've
> changed ATA controllers, so it seems to be the disk, not the controller.
> The behaviour is very odd.  If I copy files off one at a time, eg using:
>  	find . -type f -exec cp {} "$TARGET/"{} \; -exec echo -n '.' \;
> the disk seems to hang in there, but if I just do
>  	cp -R . "$TARGET"
> then it freezes!  (This statement may not have been thoroughly tested:
> having to restart each time gets old quite quickly.)
> 
> 
> Ok, now for the boring bits.
> 
> $ uname -a
> FreeBSD saturn.araneidae.co.uk 4.11-RELEASE-p11 FreeBSD 4.11-RELEASE-p11 
> #6: Sat Aug 27 16:33:58 GMT 2005     
> root@saturn.araneidae.co.uk:/usr/obj/usr/src/sys/GENERIC  i386
> $ dmesg | grep ata
> atapci0: <HighPoint HPT370 ATA100 controller> port 
> 0xa000-0xa0ff,0x9c00-0x9c03,0x9800-0x9807,0x9400-0x9403,0x9000-0x9007 irq 
> 12 at device 11.0 on pci0
> ata2: at 0x9000 on atapci0
> ata3: at 0x9800 on atapci0
> atapci1: <VIA 8233 ATA133 controller> port 0xa800-0xa80f at device 17.1 on 
> pci0
> ata0: at 0x1f0 irq 14 on atapci1
> ata1: at 0x170 irq 15 on atapci1
> atapci2: <HighPoint HPT372 ATA133 controller> port 
> 0xc400-0xc4ff,0xc000-0xc003,0xbc00-0xbc07,0xb800-0xb803,0xb400-0xb407 irq 
> 10 at device 19.0 on pci0
> ata4: at 0xb400 on atapci2
> ata5: at 0xbc00 on atapci2
> ad0: 39083MB <Maxtor 4D040H2> [79408/16/63] at ata0-master UDMA100
> ad1: 190782MB <SAMSUNG SP2014N> [387621/16/63] at ata0-slave UDMA133
> ad4: 76319MB <ST380021A> [155061/16/63] at ata2-master UDMA100
> ad6: 76319MB <ST380021A> [155061/16/63] at ata3-master UDMA100
> acd0: DVD-ROM <CREATIVEDVD-ROM DVD2240E 12/24/97> at ata1-master PIO4
> $ sudo atacontrol cap ata3 0
> ATA channel 3, Master, device ad6:
> 
> ATA/ATAPI revision    5
> device model          ST380021A
> serial number         3HV0MYL9
> firmware revision     3.10
> cylinders             16383
> heads                 16
> sectors/track         63
> lba supported         156301488 sectors
> lba48 not supported dma supported
> overlap not supported
> 
> Feature                      Support  Enable    Value   Vendor
> write cache                    yes      yes
> read ahead                     yes      yes
> dma queued                     no       no      0/00
> SMART                          yes      no
> microcode download             yes      yes
> security                       yes      no
> power management               yes      yes
> advanced power management      no       no      65278/FEFE
> automatic acoustic management  yes      yes     128/80  128/80
> $
> 
> That's everything I can think of.
> 

Just a general comment:

I had a very similar problem a while back. After replacing the drive in
question, then replacing the motherboard, I discovered it was a power
issue. The power supply was freaking out at medium to high loads, which
was causing the device to continually reset.

Jason