From owner-freebsd-questions@FreeBSD.ORG Sun Jan 11 10:24:33 2015 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C2368D37 for ; Sun, 11 Jan 2015 10:24:33 +0000 (UTC) Received: from mail.unitedinsong.com.au (mail.unitedinsong.com.au [150.101.178.33]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 343F9F9E for ; Sun, 11 Jan 2015 10:24:32 +0000 (UTC) Received: from laptop2.herveybayaustralia.com.au (laptop2.herveybayaustralia.com.au [192.168.0.185]) (using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by mail.unitedinsong.com.au (Postfix) with ESMTPSA id A7F39273AA for ; Sun, 11 Jan 2015 20:24:20 +1000 (EST) Message-ID: <54B24F53.4080904@herveybayaustralia.com.au> Date: Sun, 11 Jan 2015 20:24:19 +1000 From: Da Rock User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: freebsd-questions@freebsd.org Subject: [SOLVED] Re: ZFS replacing drive issues References: <54A9D9E6.2010008@herveybayaustralia.com.au> <54A9E3CC.1010009@hiwaay.net> <54AB25A7.4040901@herveybayaustralia.com.au> <20150109191850.GA58984@vash.rhavenn.local> In-Reply-To: <20150109191850.GA58984@vash.rhavenn.local> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Jan 2015 10:24:33 -0000 On 01/10/15 05:18, Henrik Hudson wrote: > On Tue, 06 Jan 2015, Da Rock wrote: > >> On 05/01/2015 11:07, William A. Mahaffey III wrote: >>> On 01/04/15 18:25, Da Rock wrote: >>>> I haven't seen anything specifically on this when googling, but I'm >>>> having a strange issue in replacing a degraded drive in ZFS. >>>> >>>> The drive has been REMOVED from ZFS pool, and so I ran 'zpool replace >>>> '. This normally just works, and I >>>> have checked that I have removed the correct drive via serial number. >>>> >>>> After resilvering, it still shows that it is in a degraded state, and >>>> that the old and the new drive have been REMOVED. >>>> >>>> No matter what I do, I can't seem to get the zfs system online and in >>>> a good state. >>>> >>>> I'm running a raidz1 on 9.1 and zfs is v28. >>>> >>>> Cheers >>>> _______________________________________________ >>>> freebsd-questions@freebsd.org mailing list >>>> http://lists.freebsd.org/mailman/listinfo/freebsd-questions >>>> To unsubscribe, send any mail to >>>> "freebsd-questions-unsubscribe@freebsd.org" >>>> >>> Someone posted a similar problem a few weeks ago; rebooting fixed it >>> for them (as opposed to trying to get zfs to fix itself w/ management >>> commands), might try that if feasible .... $0.02, no more,l no less .... >>> >> Sorry, that didn't work unfortunately. I had to wait a bit until I could >> do it between it trying to resilver and workload. It came online at >> first, but then went back to removed when I checked again later. >> >> Any other diags I can do? I've already run smartctl on all the drives >> (5hrs+) and they've come back clean. There's not much to go on in the >> logs either. Do a small number of drives just naturally error when >> placed in a raid or something? >> _______________________________________________ >> freebsd-questions@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-questions >> To unsubscribe, send any mail to "freebsd-questions-unsubscribe@freebsd.org" > a) try a 'zpool clear' to perhaps force it to clear errors, but to > be safe I'd still do "c" below. > > b) Did you physically remove the old drive and replace it and then > run a zpool replace? Did the devices have the same device ID or did > you use GPT ids? > > c) If it's a mirror try just removing the device, zpool remove pool > device and then re-attaching it via zpool attach. > > henrik > Thanks for that info, I'll try it next time. Meanwhile, I had to spend more than a few hours (about 2 days actually - each test takes 5+ hours, and it had some hissy fits; some of which occurred at about 90% completion, the little #$%!) going through the drives and running tests using smartctl and the vendors tools. Turns out I had a DOA, but with a twist: using smartctl the test would run on other drives, maybe up to 50%, and then stop and say the test failed. On the DOA it would pass. I then turned to the vendor tools, and ran through each drive (I had 8 to test amongst my lot as I got more than a bit curious/suspicious about what was happening overall). I tried testing all in one machine and they all interfered with one another, so I needed to test individually and try and save the result ( a tricky one given the ridiculous tools supplied (I know a good trade never blames his tools, but take windows for eg... :) ). Once that was all sorted (24 hours work later), I found the DOA drive for my raid would pass a simple test, go through maybe 50% of the longer test, and then come up with a failed test - but with absolutely no error code (one is expected). So it was a bit of an odd duck. As a general rule I find the vendor rather good and support is second to none, but the drives aren't exactly top dollar either so I have no complaints - but this did send me into a bit of a spin. At least the experience has been enlightening :) For reference, smartctl and such aren't taken seriously by vendors. They will accept if smart has been tripped (failed health test), but other than that you need to use their tools for diags. Maybe not news to some, but there's a lot of fluff out there that says otherwise. Thanks again for the pointers guys!