From owner-freebsd-stable@FreeBSD.ORG  Thu Jul 18 08:25:27 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 147B58BF;
 Thu, 18 Jul 2013 08:25:27 +0000 (UTC) (envelope-from rb@gid.co.uk)
Received: from mx0.gid.co.uk (mx0.gid.co.uk [194.32.164.250])
 by mx1.freebsd.org (Postfix) with ESMTP id C0550E7B;
 Thu, 18 Jul 2013 08:25:26 +0000 (UTC)
Received: from [194.32.164.26] (80-46-130-69.static.dsl.as9105.com
 [80.46.130.69])
 by mx0.gid.co.uk (8.14.2/8.14.2) with ESMTP id r6I8POHj066332;
 Thu, 18 Jul 2013 09:25:25 +0100 (BST) (envelope-from rb@gid.co.uk)
Subject: Re: Drive failures with ada on FreeBSD-9.1,
 driver bug or wiring issue?
Mime-Version: 1.0 (Apple Message framework v1283)
Content-Type: text/plain; charset=windows-1252
From: Bob Bishop <rb@gid.co.uk>
In-Reply-To: <60F7BE75-5E2F-471E-A9CE-AF4CD17D96E2@karthauser.co.uk>
Date: Thu, 18 Jul 2013 09:25:19 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <281DBD06-81D5-4DDD-9464-B96C80C22C3F@gid.co.uk>
References: <20130716225013.1C63B23A@babel.karthauser.co.uk>
 <60F7BE75-5E2F-471E-A9CE-AF4CD17D96E2@karthauser.co.uk>
To: Dr Josef Karthauser <joe@karthauser.co.uk>
X-Mailer: Apple Mail (2.1283)
Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>,
 "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org>
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 18 Jul 2013 08:25:27 -0000

Hi,

On 18 Jul 2013, at 08:29, Dr Josef Karthauser wrote:

> Hi there,
>=20
> I'm scratching my head. I've just migrated to a super micro chassis =
and at the same time gone from FreeBSD 9.0 to 9.1-RELEASE.
>=20
> The machine in question is running a ZFS mirror configuration on two =
ada devices (with a 8gb gmirror carved out for swap).
>=20
> Since doing so I've been having strange drop outs on the drives; the =
just disappear from the bus like so:
>=20
> (ada2:ahcich2:0:0:0): removing device entry
> (aprobe0:ahcich2:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
> (aprobe0:ahcich2:0:0:0): CAM status: ATA Status Error
> (aprobe0:ahcich2:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 =
(ABRT )
> (aprobe0:ahcich2:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
> (aprobe0:ahcich2:0:0:0): Error 5, Retries exhausted
> (aprobe0:ahcich2:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
> (aprobe0:ahcich2:0:0:0): CAM status: ATA Status Error
> (aprobe0:ahcich2:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 =
(ABRT )
> (aprobe0:ahcich2:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
> (aprobe0:ahcich2:0:0:0): Error 5, Retries exhausted
>=20
>=20
> At first I though it was a failing drive - one of the drives did this, =
and I limped on a single drive for a week until I could get someone up =
to the rack to plug a third drive in.  We resilvered the zpool onto the =
new device and ran with the failed drive still plugged in (but not =
responding to a reset on the ada bus with camcontrol) for a week or so.
>=20
> Then, the new drive dropped out in exactly the same way, followed in =
short order by the remaining original drive!!!
>=20
> After rebooting the machine, and observing all three drives probing =
and available, I resilvered the gmirror and zpool again on the two =
devices expected that I thought were reliable, but before the =
resilvering was completed the new drive dropped out again.
>=20
> I'm scratching my head now. I can't imagine that it's a wiring =
problem, as they are all on individual SATA buses and individually =
cabled.
>=20
> Smart isn't reporting an drive issues either=85. :/
>=20
> So, I'm wondering, is it a driver issuer with 9.1-RELEASE, if I =
upgrade to 9-RELENG would I expect that to resolve the problem?  (Have =
there been any reported ada bus issuer reported since last December?)
>=20
> The hardware in question is:
>=20
> ahci0: <Intel Cougar Point AHCI SATA controller> port =
0xf050-0xf057,0xf040-0xf043,0xf030-0xf037,0xf020-0xf023,0xf000-0xf01f =
mem 0xdfb02000-0xdfb027ff irq 19 at device 31.2 on pci0
> ahci0: AHCI v1.30 with 6 3Gbps ports, Port Multiplier not supported
> ahcich0: <AHCI channel> at channel 0 on ahci0
> ahcich1: <AHCI channel> at channel 1 on ahci0
> ahcich2: <AHCI channel> at channel 2 on ahci0
> ahcich3: <AHCI channel> at channel 3 on ahci0
> ahcich4: <AHCI channel> at channel 4 on ahci0
> ahcich5: <AHCI channel> at channel 5 on ahci0
> ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
> ada0: <WDC WD1000FYPS-01ZKB0 02.01B01> ATA-8 SATA 2.x device
> ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> ada0: Command Queueing enabled
> ada0: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
> ada0: Previously was known as ad4
> ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
> ada1: <WDC WD1000FYPS-01ZKB0 02.01B01> ATA-8 SATA 2.x device
> ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> ada1: Command Queueing enabled
> ada1: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
> ada1: Previously was known as ad6
> ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
> ada2: <WDC WD1000FYPS-01ZKB0 02.01B01> ATA-8 SATA 2.x device
> ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> ada2: Command Queueing enabled
> ada2: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
> ada2: Previously was known as ad8
>=20
>=20
> Any ideas would be greatly welcomed.
>=20
> Thanks,
> Joe

Me too (over a long period, with various hardware).

There is a general problem with energy-saving drives that controllers =
don't understand them. Typically the drive decides to go into some =
power-saving mode, the controller wants to do some operation, the drive =
takes too long to come ready, the controller decides the drive has gone =
away.

You have to persuade the controller to wait longer for the drive to come =
ready, and/or persuade the drive to stay awake. This isn't necessarily =
easy, eg the controller's ready wait may not be programmable.

(Or avoid such drives like the plague, life's too short).

--
Bob Bishop
rb@gid.co.uk