From owner-freebsd-questions@FreeBSD.ORG Mon Sep 5 17:19:14 2005 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5E6C816A41F for ; Mon, 5 Sep 2005 17:19:14 +0000 (GMT) (envelope-from jwm-freebsd@sentinelchicken.net) Received: from lakecmmtar04.coxmail.com (lakecmmtar04.coxmail.com [68.99.120.38]) by mx1.FreeBSD.org (Postfix) with ESMTP id D696C43D45 for ; Mon, 5 Sep 2005 17:19:13 +0000 (GMT) (envelope-from jwm-freebsd@sentinelchicken.net) Received: from sentinelchicken.net ([70.183.13.213]) by lakecmmtar04.coxmail.com (InterMail vM.6.01.05.02 201-2131-123-102-20050715) with SMTP id <20050905171914.UJON27582.lakecmmtar04.coxmail.com@sentinelchicken.net> for ; Mon, 5 Sep 2005 13:19:14 -0400 Received: (qmail 26074 invoked from network); 5 Sep 2005 17:19:12 -0000 Received: from unknown (HELO numbuscus.sentinelchicken.net) (10.0.0.2) by samson.sentinelchicken.net with SMTP; 5 Sep 2005 17:19:12 -0000 Received: (nullmailer pid 78577 invoked by uid 1000); Mon, 05 Sep 2005 17:19:12 -0000 Date: Mon, 5 Sep 2005 13:19:12 -0400 From: Jason Morgan To: FreeBSD Questions Message-ID: <20050905171912.GF67960@sentinelchicken.net> References: <20050905151332.P16924@saturn.araneidae.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050905151332.P16924@saturn.araneidae.co.uk> User-Agent: Mutt/1.4.2.1i Subject: Re: Hard disk woes X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Sep 2005 17:19:14 -0000 On Mon, Sep 05, 2005 at 03:16:13PM +0000, Michael Abbott wrote: > I'm having some very odd behaviour from one of my hard disks and I wonder > what anybody makes of it. > > In brief, the hard disk in questions works just fine much of the time, but > when high volume data transfers are requested I get the following in > /var/log/messages: > > Sep 3 15:21:02 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - > resetting > Sep 3 15:21:02 saturn /kernel: ata3: resetting devices .. done > Sep 3 15:21:12 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - > resetting > Sep 3 15:21:12 saturn /kernel: ata3: resetting devices .. done > Sep 3 15:21:23 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - > resetting > Sep 3 15:21:23 saturn /kernel: ata3: resetting devices .. done > Sep 3 15:21:33 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - > resetting > Sep 3 15:21:33 saturn /kernel: ad6: trying fallback to PIO mode > Sep 3 15:21:33 saturn /kernel: ata3: resetting devices .. done > Sep 3 15:21:43 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - > resetting > Sep 3 15:21:43 saturn /kernel: ata3: resetting devices .. ata3-slave: ATA > identify retries exceeded > Sep 3 15:21:43 saturn /kernel: done > > After this point the hard disk in question is frozen until I reboot, and > any process that tries to touch it is similarly frozen (doesn't even > respond to kill -9). `shutdown -r` is enough to restore operation, and > the rest of the system seemed happy enough. > > Another interesting effect. I placed a replacement hard disk on the same > ATA bus (as a slave, device ad7) and tried copying files from ad6 to ad7. > This time when ad6 froze and the kerned decided to give up on ata3 (and so > decided to disable ad7 at the same time, naturally enough) the entire > system froze! No response from the console, stone cold dead, hard reset > needed. > > > So some questions seem to me to arise from this. > > 1. Why does FreeBSD handle this so ungracefully? If restarting is > sufficient to bring ata3 back then can't the ata driver do a proper > restart? > > 2. Goodness me, FreeBSD froze! I know it's a hardware failure, but > still: it's on a auxillary ATA controller with no system files attached. > Is this problem of general interest? It's certainly a massive hint to me > not to consider (parallel) ATA for RAID! > > 3. Any thoughts on what is wrong with the hard disk in question? I've > changed ATA controllers, so it seems to be the disk, not the controller. > The behaviour is very odd. If I copy files off one at a time, eg using: > find . -type f -exec cp {} "$TARGET/"{} \; -exec echo -n '.' \; > the disk seems to hang in there, but if I just do > cp -R . "$TARGET" > then it freezes! (This statement may not have been thoroughly tested: > having to restart each time gets old quite quickly.) > > > Ok, now for the boring bits. > > $ uname -a > FreeBSD saturn.araneidae.co.uk 4.11-RELEASE-p11 FreeBSD 4.11-RELEASE-p11 > #6: Sat Aug 27 16:33:58 GMT 2005 > root@saturn.araneidae.co.uk:/usr/obj/usr/src/sys/GENERIC i386 > $ dmesg | grep ata > atapci0: port > 0xa000-0xa0ff,0x9c00-0x9c03,0x9800-0x9807,0x9400-0x9403,0x9000-0x9007 irq > 12 at device 11.0 on pci0 > ata2: at 0x9000 on atapci0 > ata3: at 0x9800 on atapci0 > atapci1: port 0xa800-0xa80f at device 17.1 on > pci0 > ata0: at 0x1f0 irq 14 on atapci1 > ata1: at 0x170 irq 15 on atapci1 > atapci2: port > 0xc400-0xc4ff,0xc000-0xc003,0xbc00-0xbc07,0xb800-0xb803,0xb400-0xb407 irq > 10 at device 19.0 on pci0 > ata4: at 0xb400 on atapci2 > ata5: at 0xbc00 on atapci2 > ad0: 39083MB [79408/16/63] at ata0-master UDMA100 > ad1: 190782MB [387621/16/63] at ata0-slave UDMA133 > ad4: 76319MB [155061/16/63] at ata2-master UDMA100 > ad6: 76319MB [155061/16/63] at ata3-master UDMA100 > acd0: DVD-ROM at ata1-master PIO4 > $ sudo atacontrol cap ata3 0 > ATA channel 3, Master, device ad6: > > ATA/ATAPI revision 5 > device model ST380021A > serial number 3HV0MYL9 > firmware revision 3.10 > cylinders 16383 > heads 16 > sectors/track 63 > lba supported 156301488 sectors > lba48 not supported dma supported > overlap not supported > > Feature Support Enable Value Vendor > write cache yes yes > read ahead yes yes > dma queued no no 0/00 > SMART yes no > microcode download yes yes > security yes no > power management yes yes > advanced power management no no 65278/FEFE > automatic acoustic management yes yes 128/80 128/80 > $ > > That's everything I can think of. > Just a general comment: I had a very similar problem a while back. After replacing the drive in question, then replacing the motherboard, I discovered it was a power issue. The power supply was freaking out at medium to high loads, which was causing the device to continually reset. Jason