Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 26 Jan 2010 08:46:19 -0800
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        freebsd-stable@freebsd.org
Subject:   Re: ZFS "zpool replace" problems
Message-ID:  <20100126164619.GA50461@icarus.home.lan>
In-Reply-To: <20100126160320.6ed67b92.gerrit@pmp.uni-hannover.de>
References:  <20100126143021.GA47535@icarus.home.lan> <20100126160320.6ed67b92.gerrit@pmp.uni-hannover.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jan 26, 2010 at 04:03:20PM +0100, Gerrit Kühn wrote:
> On Tue, 26 Jan 2010 06:30:21 -0800 Jeremy Chadwick
> <freebsd@jdc.parodius.com> wrote about Re: ZFS "zpool replace" problems:
> JC> 2) How did you attach ad18?  Did you tell the system about it using
> JC>    atacontrol?  If so, what commands did you use?
> 
> Yes. The drives did not appear automatically (verified with atacontrol
> list). Then I first tried reinit ata9, but that did not work out, so I did
> a detach/attach for ata9, then the drive was there (with list and also
> the device node appeared).

The procedure -- at least on Intel controllers in AHCI mode -- is:

- zpool offline <pool> <disk>
- atacontrol detach ataX (where X = channel associated with disk)
- Physically remove bad disk
- Physically insert new disk
- Wait 15 seconds for stuff to settle
- atacontrol attach ataX (where X = previous channel detached)
- zpool replace <pool> <disk>
- zpool online <pool> <disk>

"reinit" shouldn't be needed at all -- in fact, I've seen reinit cause
some craziness (even on Intel controllers), including a system deadlock,
but this was back during the RELENG_6 and RELENG_7 days.  Great
improvements have been made to ata(4) since then.

If you need me to validate the above procedure (it's been a while since
I've had to hot-swap a disk), I can do so.  I do have a 4-disk
Supermicro SuperServer 5015B-MTB (ICH9-based) sitting on my workbench
which I can test with.

> Meanwhile I took out the ad18 drive again and tried to use a different
> drive. But that was listed as "UNAVAIL" with corrupted data by zfs.
> Probably it already branded the disk for resilvering and is looking for
> exactly this one now. I also put in the disk which caused the problem
> above again. The resilvering process started again, but very soon the
> drive got detached again resulting in the same situation I described above.

It honestly sounds like hot-swapping is causing some chaos on your
system.  Are all of the controllers involved configured for AHCI?  If
not, physical removal/insertion should be done only when the system
power is off.  If so, mav@ or others may be able to help figure out
what's going on in the underlying ata(4) layer.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100126164619.GA50461>