Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 18 May 2011 08:13:13 +0200
From:      Per von Zweigbergk <pvz@itassistans.se>
To:        freebsd-fs@freebsd.org
Subject:   HAST + ZFS self healing? Hot spares?
Message-ID:  <85EC77D3-116E-43B0-BFF1-AE1BD71B5CE9@itassistans.se>

next in thread | raw e-mail | index | archive | help
I've been investigating HAST as a possibility in adding synchronous =
replication and failover to a set of two NFS servers backed by NFS. The =
servers themselves contain quite a few disks. 20 of them (7200 RPM SAS =
disks), to be exact. (If I didn't lose count again...) Plus two quick =
but small SSD's for ZIL and two not-as-quick but larger SSD's for L2ARC.

These machines weren't originally designed with synchronous replication =
in mind - they were designed to be NFS file servers (used as VMware data =
stores) backed by ZFS. They contain LSI MegaRaid 9260 controllers (as an =
aside, these were perhaps not the best choice for ZFS since they lack a =
true JBOD mode, I have worked around this by making single-disk RAID-0 =
arrays, and then using those single-disk arrays to make up the zpool).

Now, I've been considering making an active/passive (or, possibly, =
active/passive + passive/active) synchronously replicated pair of =
servers out of these, and my eyes fall on HAST.

Initially, my thoughts land on simply creating HAST resources for the =
corresponding pairs of disks and SSDs in servers A and B, and then using =
these HAST resources to make up the ZFS pool.

But this raises two questions:

---

1. Hardware failure management. In case of a hardware failure, I'm not =
exactly sure what will happen, but I suspect the single-disk RAID-0 =
array containing the failed disk will simply fail. I assume it will =
still exist, but refuse to be read or written. In this situation I =
understand HAST will handle this by routing all I/O to the secondary =
server, in case the disk on the primary side dies, or simply by cutting =
off replication if the disk on the secondary server fails.

I have not seen any "hot spare" mechanism in HAST, but I would think =
that I could edit the cluster configuration file to manually configure a =
hot spare in case I receive an alert. Would I have to restart all of =
hastd to do this, though? Or is it sufficient to bring the resource into =
init and back into secondary using hastctl?

Of course it may just be infinitely simpler just to configure spares on =
the ZFS level, and keep entire spare hast resources, and just do a zfs =
replace, replacing an entire array of two disks whenever one of the =
disks in an array fails. Still, it would be know what I can reconfigure =
on-the-fly with hast itself.

---

2. ZFS self-healing. As far as I understand it, ZFS does self-healing, =
in that all data is checksummed, and if one disk in a mirror happens to =
contain corrupted data, ZFS will re-read the same data from the other =
disk in the ZFS mirror. I don't see any way this could work in a =
configuration where ZFS is not mirroring itself, but rather, running on =
top of HAST, currently. Am I wrong about this? Or is there any way to =
achieve this same self-healing effect except with HAST?

---

So, what is it, do I have to give up ZFS's self healing (one of the =
really neat features in ZFS) if I go for HAST? Of course, I could mirror =
the drives first with HAST, and then mirror the two HAST mirrors using a =
zfs mirror, but that would be wasteful and a little silly. I might even =
be able to get away with using "copies=3D2" in this scenario. Or I could =
use raid-z on top of the mirrors, wasting less disk, but causing a =
performance hit.

I mean, ideally, ZFS would have a really neat synchronous replication =
feature built into it. Or ZFS could be HAST-aware, and know how to ask =
HAST to bring it a copy of a block of data on the remote block device in =
a HAST mirror in case the checksum on the local block device doesn't =
match. Or HAST would itself have some kind of block-level checksums, and =
do self-healing itself. (This would probably be the easiest to =
implement. The secondary site could even continually be reading the same =
data as the primary site is, merely to check the checksums on disk, not =
to send it over the wire. It's not like it's doing anything else useful =
with that untapped read performance.)

So, what's the current state of solving this problem? Is there any work =
being done in this area? Have I overlooked some technology I might use =
to achieve this goal?=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?85EC77D3-116E-43B0-BFF1-AE1BD71B5CE9>