From owner-freebsd-questions@FreeBSD.ORG  Sun Jan 11 10:24:33 2015
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id C2368D37
 for <freebsd-questions@freebsd.org>; Sun, 11 Jan 2015 10:24:33 +0000 (UTC)
Received: from mail.unitedinsong.com.au (mail.unitedinsong.com.au
 [150.101.178.33])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 343F9F9E
 for <freebsd-questions@freebsd.org>; Sun, 11 Jan 2015 10:24:32 +0000 (UTC)
Received: from laptop2.herveybayaustralia.com.au
 (laptop2.herveybayaustralia.com.au [192.168.0.185])
 (using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits))
 (No client certificate requested)
 by mail.unitedinsong.com.au (Postfix) with ESMTPSA id A7F39273AA
 for <freebsd-questions@freebsd.org>; Sun, 11 Jan 2015 20:24:20 +1000 (EST)
Message-ID: <54B24F53.4080904@herveybayaustralia.com.au>
Date: Sun, 11 Jan 2015 20:24:19 +1000
From: Da Rock <freebsd-questions@herveybayaustralia.com.au>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:24.0) Gecko/20100101 Thunderbird/24.3.0
MIME-Version: 1.0
To: freebsd-questions@freebsd.org
Subject: [SOLVED] Re: ZFS replacing drive issues
References: <54A9D9E6.2010008@herveybayaustralia.com.au>
 <54A9E3CC.1010009@hiwaay.net> <54AB25A7.4040901@herveybayaustralia.com.au>
 <20150109191850.GA58984@vash.rhavenn.local>
In-Reply-To: <20150109191850.GA58984@vash.rhavenn.local>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions/>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 11 Jan 2015 10:24:33 -0000

On 01/10/15 05:18, Henrik Hudson wrote:
> On Tue, 06 Jan 2015, Da Rock wrote:
>
>> On 05/01/2015 11:07, William A. Mahaffey III wrote:
>>> On 01/04/15 18:25, Da Rock wrote:
>>>> I haven't seen anything specifically on this when googling, but I'm
>>>> having a strange issue in replacing a degraded drive in ZFS.
>>>>
>>>> The drive has been REMOVED from ZFS pool, and so I ran 'zpool replace
>>>> <pool> <old device> <new device>'. This normally just works, and I
>>>> have checked that I have removed the correct drive via serial number.
>>>>
>>>> After resilvering, it still shows that it is in a degraded state, and
>>>> that the old and the new drive have been REMOVED.
>>>>
>>>> No matter what I do, I can't seem to get the zfs system online and in
>>>> a good state.
>>>>
>>>> I'm running a raidz1 on 9.1 and zfs is v28.
>>>>
>>>> Cheers
>>>> _______________________________________________
>>>> freebsd-questions@freebsd.org mailing list
>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
>>>> To unsubscribe, send any mail to
>>>> "freebsd-questions-unsubscribe@freebsd.org"
>>>>
>>> Someone posted a similar problem a few weeks ago; rebooting fixed it
>>> for them (as opposed to trying to get zfs to fix itself w/ management
>>> commands), might try that if feasible .... $0.02, no more,l no less ....
>>>
>> Sorry, that didn't work unfortunately. I had to wait a bit until I could
>> do it between it trying to resilver and workload. It came online at
>> first, but then went back to removed when I checked again later.
>>
>> Any other diags I can do? I've already run smartctl on all the drives
>> (5hrs+) and they've come back clean. There's not much to go on in the
>> logs either. Do a small number of drives just naturally error when
>> placed in a raid or something?
>> _______________________________________________
>> freebsd-questions@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
>> To unsubscribe, send any mail to "freebsd-questions-unsubscribe@freebsd.org"
> a) try a 'zpool clear' to perhaps force it to clear errors, but to
> be safe I'd still do "c" below.
>
> b) Did you physically remove the old drive and replace it and then
> run a zpool replace? Did the devices have the same device ID or did
> you use GPT ids?
>
> c) If it's a mirror try just removing the device, zpool remove pool
> device and then re-attaching it via zpool attach.
>
> henrik
>
Thanks for that info, I'll try it next time.

Meanwhile, I had to spend more than a few hours (about 2 days actually - 
each test takes 5+ hours, and it had some hissy fits; some of which 
occurred at about 90% completion, the little #$%!) going through the 
drives and running tests using smartctl and the vendors tools. Turns out 
I had a DOA, but with a twist: using smartctl the test would run on 
other drives, maybe up to 50%, and then stop and say the test failed. On 
the DOA it would pass.

I then turned to the vendor tools, and ran through each drive (I had 8 
to test amongst my lot as I got more than a bit curious/suspicious about 
what was happening overall). I tried testing all in one machine and they 
all interfered with one another, so I needed to test individually and 
try and save the result ( a tricky one given the ridiculous tools 
supplied (I know a good trade never blames his tools, but take windows 
for eg... :) ). Once that was all sorted (24 hours work later), I found 
the DOA drive for my raid would pass a simple test, go through maybe 50% 
of the longer test, and then come up with a failed test - but with 
absolutely no error code (one is expected).

So it was a bit of an odd duck. As a general rule I find the vendor 
rather good and support is second to none, but the drives aren't exactly 
top dollar either so I have no complaints - but this did send me into a 
bit of a spin. At least the experience has been enlightening :)

For reference, smartctl and such aren't taken seriously by vendors. They 
will accept if smart has been tripped (failed health test), but other 
than that you need to use their tools for diags. Maybe not news to some, 
but there's a lot of fluff out there that says otherwise.

Thanks again for the pointers guys!