From owner-freebsd-questions@FreeBSD.ORG Mon Sep 5 15:16:22 2005 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8031C16A41F for ; Mon, 5 Sep 2005 15:16:22 +0000 (GMT) (envelope-from michael@araneidae.co.uk) Received: from mail.araneidae.co.uk (araneidae.co.uk [62.3.233.233]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9C50F43D4C for ; Mon, 5 Sep 2005 15:16:19 +0000 (GMT) (envelope-from michael@araneidae.co.uk) Received: from saturn.araneidae.co.uk (localhost [127.0.0.1]) by mail.araneidae.co.uk (8.13.1/8.13.1) with ESMTP id j85FGEHT016994 for ; Mon, 5 Sep 2005 15:16:14 GMT (envelope-from michael@araneidae.co.uk) Received: from localhost (michael@localhost) by saturn.araneidae.co.uk (8.13.1/8.13.1/Submit) with ESMTP id j85FGDRg016991 for ; Mon, 5 Sep 2005 15:16:13 GMT (envelope-from michael@araneidae.co.uk) X-Authentication-Warning: saturn.araneidae.co.uk: michael owned process doing -bs Date: Mon, 5 Sep 2005 15:16:13 +0000 (GMT) From: Michael Abbott To: FreeBsd Questions Message-ID: <20050905151332.P16924@saturn.araneidae.co.uk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Subject: Hard disk woes X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Sep 2005 15:16:22 -0000 I'm having some very odd behaviour from one of my hard disks and I wonder what anybody makes of it. In brief, the hard disk in questions works just fine much of the time, but when high volume data transfers are requested I get the following in /var/log/messages: Sep 3 15:21:02 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - resetting Sep 3 15:21:02 saturn /kernel: ata3: resetting devices .. done Sep 3 15:21:12 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - resetting Sep 3 15:21:12 saturn /kernel: ata3: resetting devices .. done Sep 3 15:21:23 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - resetting Sep 3 15:21:23 saturn /kernel: ata3: resetting devices .. done Sep 3 15:21:33 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - resetting Sep 3 15:21:33 saturn /kernel: ad6: trying fallback to PIO mode Sep 3 15:21:33 saturn /kernel: ata3: resetting devices .. done Sep 3 15:21:43 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - resetting Sep 3 15:21:43 saturn /kernel: ata3: resetting devices .. ata3-slave: ATA identify retries exceeded Sep 3 15:21:43 saturn /kernel: done After this point the hard disk in question is frozen until I reboot, and any process that tries to touch it is similarly frozen (doesn't even respond to kill -9). `shutdown -r` is enough to restore operation, and the rest of the system seemed happy enough. Another interesting effect. I placed a replacement hard disk on the same ATA bus (as a slave, device ad7) and tried copying files from ad6 to ad7. This time when ad6 froze and the kerned decided to give up on ata3 (and so decided to disable ad7 at the same time, naturally enough) the entire system froze! No response from the console, stone cold dead, hard reset needed. So some questions seem to me to arise from this. 1. Why does FreeBSD handle this so ungracefully? If restarting is sufficient to bring ata3 back then can't the ata driver do a proper restart? 2. Goodness me, FreeBSD froze! I know it's a hardware failure, but still: it's on a auxillary ATA controller with no system files attached. Is this problem of general interest? It's certainly a massive hint to me not to consider (parallel) ATA for RAID! 3. Any thoughts on what is wrong with the hard disk in question? I've changed ATA controllers, so it seems to be the disk, not the controller. The behaviour is very odd. If I copy files off one at a time, eg using: find . -type f -exec cp {} "$TARGET/"{} \; -exec echo -n '.' \; the disk seems to hang in there, but if I just do cp -R . "$TARGET" then it freezes! (This statement may not have been thoroughly tested: having to restart each time gets old quite quickly.) Ok, now for the boring bits. $ uname -a FreeBSD saturn.araneidae.co.uk 4.11-RELEASE-p11 FreeBSD 4.11-RELEASE-p11 #6: Sat Aug 27 16:33:58 GMT 2005 root@saturn.araneidae.co.uk:/usr/obj/usr/src/sys/GENERIC i386 $ dmesg | grep ata atapci0: port 0xa000-0xa0ff,0x9c00-0x9c03,0x9800-0x9807,0x9400-0x9403,0x9000-0x9007 irq 12 at device 11.0 on pci0 ata2: at 0x9000 on atapci0 ata3: at 0x9800 on atapci0 atapci1: port 0xa800-0xa80f at device 17.1 on pci0 ata0: at 0x1f0 irq 14 on atapci1 ata1: at 0x170 irq 15 on atapci1 atapci2: port 0xc400-0xc4ff,0xc000-0xc003,0xbc00-0xbc07,0xb800-0xb803,0xb400-0xb407 irq 10 at device 19.0 on pci0 ata4: at 0xb400 on atapci2 ata5: at 0xbc00 on atapci2 ad0: 39083MB [79408/16/63] at ata0-master UDMA100 ad1: 190782MB [387621/16/63] at ata0-slave UDMA133 ad4: 76319MB [155061/16/63] at ata2-master UDMA100 ad6: 76319MB [155061/16/63] at ata3-master UDMA100 acd0: DVD-ROM at ata1-master PIO4 $ sudo atacontrol cap ata3 0 ATA channel 3, Master, device ad6: ATA/ATAPI revision 5 device model ST380021A serial number 3HV0MYL9 firmware revision 3.10 cylinders 16383 heads 16 sectors/track 63 lba supported 156301488 sectors lba48 not supported dma supported overlap not supported Feature Support Enable Value Vendor write cache yes yes read ahead yes yes dma queued no no 0/00 SMART yes no microcode download yes yes security yes no power management yes yes advanced power management no no 65278/FEFE automatic acoustic management yes yes 128/80 128/80 $ That's everything I can think of.