FreeBSD Mail Archives

Date:      Fri, 20 May 2011 01:03:43 +0200
From:      Per von Zweigbergk <pvz@itassistans.se>
To:        Pawel Jakub Dawidek <pjd@FreeBSD.org>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: HAST + ZFS self healing? Hot spares?
Message-ID:  <4DD5A1CF.70807@itassistans.se>
In-Reply-To: <20110519181436.GB2100@garage.freebsd.pl>
References:  <85EC77D3-116E-43B0-BFF1-AE1BD71B5CE9@itassistans.se> <20110519181436.GB2100@garage.freebsd.pl>

On 2011-05-19 20:14, Pawel Jakub Dawidek wrote:
> On Wed, May 18, 2011 at 08:13:13AM +0200, Per von Zweigbergk wrote:
>> I've been investigating HAST as a possibility in adding synchronous replication and failover to a set of two NFS servers backed by NFS. The servers themselves contain quite a few disks. 20 of them (7200 RPM SAS disks), to be exact. (If I didn't lose count again...) Plus two quick but small SSD's for ZIL and two not-as-quick but larger SSD's for L2ARC.
> [...]
>
> The configuration you should try first is to connect each disks pair
> using HAST and create ZFS pool on top of those HAST devices.
>
> Let's assume you have 4 data disks (da0-da3), 2 SSD disks for ZIL
> (da4-da5) and 2 SSD disks for L2ARC (da6-da7).
>
> Then you create the following HAST devices:
>
> /dev/hast/data0 = MachineA(da0) + MachineB(da0)
> /dev/hast/data1 = MachineA(da1) + MachineB(da1)
> /dev/hast/data2 = MachineA(da2) + MachineB(da2)
> /dev/hast/data3 = MachineA(da3) + MachineB(da3)
>
> /dev/hast/slog0 = MachineA(da4) + MachineB(da4)
> /dev/hast/slog1 = MachineA(da5) + MachineB(da5)
>
> /dev/hast/cache0 = MachineA(da6) + MachineB(da6)
> /dev/hast/cache1 = MachineA(da7) + MachineB(da7)
>
> And then you create ZFS pool of your choice. Here you specify
> redundancy, so if there is any you will have ZFS self-healing:
>
> zpool create tank raidz1 hast/data{0,1,2,3} log mirror hast/slog{0,1} cache hast/cache{0,1}
Raidz on top of hast is one possibility, although raidz does add 
overhead to the equation. I'll have to find out how much. It's also 
possible to just mirror twice as well, although that would essentially 
mean that every write would go over the wire twice. Raidz might be the 
better bargain here, that would only increase the number of writes on 
the write by a factor 1/n where n is the number of data drives, at the 
cost of CPU to calculate parity. Testing will tell.
>> 1. Hardware failure management. In case of a hardware failure, I'm not exactly sure what will happen, but I suspect the single-disk RAID-0 array containing the failed disk will simply fail. I assume it will still exist, but refuse to be read or written. In this situation I understand HAST will handle this by routing all I/O to the secondary server, in case the disk on the primary side dies, or simply by cutting off replication if the disk on the secondary server fails.
> HAST sends all write requests to both nodes (if secondary is present)
> and read requests only to primary node. In some cases reads can be send
> to secondary node, for example when synchronization is in progress and
> secondary has more recent data or reading from local disk failed (either
> because of single EIO or entire disk went bad).
>
> In other words HAST itself can handle one of the mirrored disk failure.
>
> If entire hast/<resource>  dies for some reason (eg. secondary is down
> and local disk dies) then ZFS redundancy kicks in.
Very well, that is how failures are handled. But how do we *recover* 
from a disk failure? Without taking the entire server down that is.

I already know how to deal with my HBA to hot-add and hot-remove 
devices. And how to deal with hardware failures on the *secondary* node 
seems fairly straightforward, after all, it doesn't really matter if the 
mirroring becomes degraded for a few seconds while I futz around with 
restarting hastd and such. The primary sees the secondary disappear a 
few seconds, when it comes back, it will just truck all of the dirty 
data over. Big deal.

But what if the drive fails on the primary side? On the primary server I 
can't just restart hastd at my leisure, the underlying filesystem relies 
on it not going away. Ideally I'd want to just be able to tell hast that 
"hey, there's a new drive you can use, just suck over all the data from 
the secondary onto this drive, and route I/O from the secondary in the 
meantime" - without restarting hastd. Is this possible?

Of course I could just avoid the problem by failing over the entire 
server whenever I want to replace hardware on the primary, making it the 
secondary. But causing a 20 second (just guessing about the actual 
failover time here) I/O hiccup in my virtualization environment just 
because I want to swap a hard drive seems unreasonable.

These unresolved questions is why I would feel safer in simply running 
ZFS on the metal and running HAST on Zvols. :-) If running ZFS on top of 
a Zvol is a bad idea, there is always the option of simply exporting the 
HAST resource backed by Zvols as an iSCSI target and run VMFS on the 
drives. But that does mean losing some of the cooler features of ZFS 
which is a shame.

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4DD5A1CF.70807>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation