From owner-freebsd-fs@FreeBSD.ORG  Wed May 18 07:59:48 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D8DCC106564A
	for <freebsd-fs@freebsd.org>; Wed, 18 May 2011 07:59:48 +0000 (UTC)
	(envelope-from daniel@digsys.bg)
Received: from smtp-sofia.digsys.bg (smtp-sofia.digsys.bg [193.68.3.230])
	by mx1.freebsd.org (Postfix) with ESMTP id 77E498FC12
	for <freebsd-fs@freebsd.org>; Wed, 18 May 2011 07:59:48 +0000 (UTC)
Received: from dcave.digsys.bg (dcave.digsys.bg [192.92.129.5])
	(authenticated bits=0)
	by smtp-sofia.digsys.bg (8.14.4/8.14.4) with ESMTP id p4I7xbB7038788
	(version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO)
	for <freebsd-fs@freebsd.org>; Wed, 18 May 2011 10:59:42 +0300 (EEST)
	(envelope-from daniel@digsys.bg)
Message-ID: <4DD37C69.5020005@digsys.bg>
Date: Wed, 18 May 2011 10:59:37 +0300
From: Daniel Kalchev <daniel@digsys.bg>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US;
	rv:1.9.2.15) Gecko/20110307 Thunderbird/3.1.9
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
References: <85EC77D3-116E-43B0-BFF1-AE1BD71B5CE9@itassistans.se>
In-Reply-To: <85EC77D3-116E-43B0-BFF1-AE1BD71B5CE9@itassistans.se>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: HAST + ZFS self healing? Hot spares?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 18 May 2011 07:59:48 -0000


On 18.05.11 09:13, Per von Zweigbergk wrote:
> I've been investigating HAST as a possibility in adding synchronous replication and failover to a set of two NFS servers backed by NFS. The servers themselves contain quite a few disks. 20 of them (7200 RPM SAS disks), to be exact. (If I didn't lose count again...) Plus two quick but small SSD's for ZIL and two not-as-quick but larger SSD's for L2ARC.
Your idea is to have hot standby server, to replace the primary, should 
the primary fail (hardware-wise)?
You need probably CAPR in addition to HAST in order to maintain the same 
shared IP address.

> Initially, my thoughts land on simply creating HAST resources for the corresponding pairs of disks and SSDs in servers A and B, and then using these HAST resources to make up the ZFS pool.
This would be the most natural decision, especially if you have 
identical hardware on both servers. Let's call this variant 1.

Variant 2, would be to create local ZFS pools (as you already have) and 
then create ZVOLs there, that are managed by HAST. Then, you will use 
the HAST provider  for whatever storage needs you have. Performance 
might not be what you expect and you need to trust HAST for the checksuming.

> 1. Hardware failure management. In case of a hardware failure, I'm not exactly sure what will happen, but I suspect the single-disk RAID-0 array containing the failed disk will simply fail. I assume it will still exist, but refuse to be read or written. In this situation I understand HAST will handle this by routing all I/O to the secondary server, in case the disk on the primary side dies, or simply by cutting off replication if the disk on the secondary server fails.
Having local ZFS makes hardware management easier, as ZFS is designed 
for this. This is variant 2.

In your case, with variant 1 you will have several issues:
- handle the disk failure and array management on the controller level. 
You need to check if this will work - you may end up with a new array 
name and thus having to edit config files.
- there is no hot spare mechanism in HAST and I do not believe you can 
switch to secondary easily. Switching to secondary will make the HAST 
device node disappear for sure on the primary server and reappear on the 
secondary server. Maybe someone might suggest proper way to handle this.

> 2. ZFS self-healing. As far as I understand it, ZFS does self-healing, in that all data is checksummed, and if one disk in a mirror happens to contain corrupted data, ZFS will re-read the same data from the other disk in the ZFS mirror. I don't see any way this could work in a configuration where ZFS is not mirroring itself, but rather, running on top of HAST, currently. Am I wrong about this? Or is there any way to achieve this same self-healing effect except with HAST?
HAST is simple mirror. It only makes sure blocks on the local and remove 
drives contains the same data. I do not believe it has strong enough 
checksuming to compare with ZFS. Therefore, your best bet is to use ZFS 
on top of HAST for best data protection.

In your example, you will need to create 20 HAST resources, out of each 
disk. Then create ZFS on top of these HAST resources. ZFS will then be 
able to heal itself in case there are inconsistencies with data on the 
HAST resources (for whatever reason).

Some reported they used HAST for the SLOG as well. I do not know if 
using HAST for the L2ARC makes any sense. On failure you will import the 
pool on the slave node and this will wipe the L2ARC anyway.

> I mean, ideally, ZFS would have a really neat synchronous replication feature built into it. Or ZFS could be HAST-aware, and know how to ask HAST to bring it a copy of a block of data on the remote block device in a HAST mirror in case the checksum on the local block device doesn't match. Or HAST would itself have some kind of block-level checksums, and do self-healing itself. (This would probably be the easiest to implement. The secondary site could even continually be reading the same data as the primary site is, merely to check the checksums on disk, not to send it over the wire. It's not like it's doing anything else useful with that untapped read performance.)
With HAST, no (hast) storage providers exist on the secondary node. 
Therefore, you cannot do any I/O on the secondary node, until it becomes 
primary.

I too, would be interested in the failure management scenario with 
HAST+ZFS, as I am currently experimenting with a similar system.

Daniel