From owner-freebsd-questions@FreeBSD.ORG  Sun Jul 24 23:56:13 2011
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9123C106564A
	for <freebsd-questions@freebsd.org>;
	Sun, 24 Jul 2011 23:56:13 +0000 (UTC) (envelope-from ml@netfence.it)
Received: from cp-out8.libero.it (cp-out8.libero.it [212.52.84.108])
	by mx1.freebsd.org (Postfix) with ESMTP id DF2778FC12
	for <freebsd-questions@freebsd.org>;
	Sun, 24 Jul 2011 23:56:12 +0000 (UTC)
X-CTCH-Spam: Unknown
X-CTCH-RefID: str=0001.0A0B0204.4E2CB11B.002C,ss=1,re=0.000,fgs=0
X-libjamoibt: 1555
Received: from soth.ventu (151.41.129.72) by cp-out8.libero.it (8.5.133)
	id 4DD2415409C47E56 for freebsd-questions@freebsd.org;
	Mon, 25 Jul 2011 01:56:11 +0200
Received: from alamar.ventu (alamar.ventu [10.1.2.18])
	by soth.ventu (8.14.5/8.14.4) with ESMTP id p6ONu4xE044248
	for <freebsd-questions@freebsd.org>;
	Mon, 25 Jul 2011 01:56:04 +0200 (CEST) (envelope-from ml@netfence.it)
Message-ID: <4E2CB114.3060408@netfence.it>
Date: Mon, 25 Jul 2011 01:56:04 +0200
From: Andrea Venturoli <ml@netfence.it>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; it-IT;
	rv:1.9.2.18) Gecko/20110711 Thunderbird/3.1.11
MIME-Version: 1.0
To: freebsd-questions@freebsd.org
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Scanned-By: MIMEDefang 2.71 on 10.1.2.13
Subject: ATA troubles
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 24 Jul 2011 23:56:13 -0000

Hello everyone.

For those interested, this post is a sequel of:
http://www.mailinglistarchive.com/html/freebsd-questions%40freebsd.org/2011-06/msg00018.html
However, I'll summarize.


At the beginning of June, I installed two WD 1TB Caviar Green SATA 
drives into an Intel-S5000-based production box of mine and it was hell!
This server runs 7.3/i386 off a SAS RAID and the two new drives should 
have worked with gstripe to constitute a secondary storage.
I started getting:
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing
> request directly
> ad4: WARNING - SMART taskqueue timeout - completing request directly
> ad8: WARNING - SMART taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing
> request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing
> request directly
> ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing
> request directly
> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request
> directly
and the box would reboot within minutes.
This also prevented me from running tests with smartctl.
Notice the box had previously a single SATA drive working perfectly.

It was suggested I ran wdidle.exe from DOS to prevent the drives from 
spinning down and it helped: now I was at least able to fsck the stripe 
and copy something on it.
Still I keep getting the above messages; the drives would also 
occasionally hang and then restart. Uptime raised to some hours, but the 
box would still reboot.

In the meantime the drives went bad (smartd, BIOS and WD-tools proven) 
and I had them replaced.

When they came back, I decided to put up a test box: hardware is 
completely different from the production box, but still FreeBSD will run 
from a SCSI drive and the two WD will constitute an additional stripe.
First I run WD tools to check the driver and they passed every test 
(including long one).

So I installed FreeBSD 7.3/i386, smartctl and verified the disks again.

I created the stripe, fscked it, and copied about 420GB of data via 
rsync over NFS. It seemed to work fine, but, after about 15 hours, the 
box rebooted after:
> ad6: FAILURE - device detached
> g_vfs_done():stripe/backup[WRITE(offset=1709926940672, length=131072)]error = 6
> /mnt/local: got error 6 while accessing filesystem
> panic: softdep_deallocate_dependencies: unrecovered I/O error

Subsequent retries always gave the same results, until I disabled 
softupdates on the stripe. I then was able to complete the rsync.

Not quite happy, I made a local to local copy and started getting a lot of:
> Jul 24 18:54:28 mydavid kernel: ad4: WARNING - READ_DMA48 UDMA ICRC error (retrying request) LBA=1620416000
> Jul 24 18:54:28 mydavid kernel: ad4: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=1620416000
> Jul 24 18:54:28 mydavid kernel: g_vfs_done():stripe/backup[READ(offset=1659305967616, length=131072)]error = 5
> Jul 24 18:54:42 mydavid kernel: ad6: WARNING - READ_DMA48 UDMA ICRC error (retrying request) LBA=1621920384
> Jul 24 18:54:42 mydavid kernel: ad6: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=1621920384
> Jul 24 18:54:42 mydavid kernel: g_vfs_done():stripe/backup[READ(offset=1660846522368, length=131072)]error = 5
I run smartctl's short test on both drives and they were ok; I tried the 
offline test, but they got interrupted (???).
In spite of the messages above, it looked like it was working...

However, I was logged in via ssh and had to turn of the client; so I 
stopped it, went into the console and started it again.
Now it looks like one drive is not working fine anymore...
> Jul 24 23:48:36 mydavid kernel: ad6: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=1671887488
> Jul 24 23:48:36 mydavid kernel: g_vfs_done():stripe/backup[READ(offset=1712012836864, length=131072)]error = 5
> Jul 24 23:48:39 mydavid kernel: ad6: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=1671897856
> Jul 24 23:48:39 mydavid kernel: g_vfs_done():stripe/backup[READ(offset=1712023420928, length=131072)]error = 5
> Jul 24 23:48:41 mydavid kernel: ad6: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=1671897888
> Jul 24 23:48:41 mydavid kernel: g_vfs_done():stripe/backup[READ(offset=1712023486464, length=131072)]error = 5
Also, smartd is complaining:
> Jul 24 23:41:59 mydavid smartd[2630]: Device: /dev/ad6, 38 Currently unreadable (pending) sectors
> Jul 24 23:50:56 mydavid smartd[538]: Device: /dev/ad6, 39 Currently unreadable (pending) sectors