From owner-freebsd-fs@freebsd.org  Thu Aug 11 08:16:43 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id E5592BB5908
 for <freebsd-fs@mailman.ysv.freebsd.org>; Thu, 11 Aug 2016 08:16:43 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from cu01176b.smtpx.saremail.com (cu01176b.smtpx.saremail.com
 [195.16.151.151])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 84B02151C
 for <freebsd-fs@freebsd.org>; Thu, 11 Aug 2016 08:16:43 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from [172.16.8.36] (izaro.sarenet.es [192.148.167.11])
 by proxypop01.sare.net (Postfix) with ESMTPSA id CA1429DD2AB;
 Thu, 11 Aug 2016 10:11:15 +0200 (CEST)
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
Subject: Re: HAST + ZFS + NFS + CARP
From: Borja Marcos <borjam@sarenet.es>
In-Reply-To: <20160704193131.GJ41276@mordor.lan>
Date: Thu, 11 Aug 2016 10:11:15 +0200
Cc: Jordan Hubbard <jkh@ixsystems.com>,
 freebsd-fs@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <E7D42341-D324-41C7-B03A-2420DA7A7952@sarenet.es>
References: <678321AB-A9F7-4890-A8C7-E20DFDC69137@gmail.com>
 <20160630185701.GD5695@mordor.lan>
 <6035AB85-8E62-4F0A-9FA8-125B31A7A387@gmail.com>
 <20160703192945.GE41276@mordor.lan> <20160703214723.GF41276@mordor.lan>
 <65906F84-CFFC-40E9-8236-56AFB6BE2DE1@ixsystems.com>
 <B48FB28E-30FA-477F-810E-DF4F575F5063@gmail.com>
 <61283600-A41A-4A8A-92F9-7FAFF54DD175@ixsystems.com>
 <20160704183643.GI41276@mordor.lan>
 <AE372BF0-02BE-4BF3-9073-A05DB4E7FE34@ixsystems.com>
 <20160704193131.GJ41276@mordor.lan>
To: Julien Cigar <julien@perdition.city>
X-Mailer: Apple Mail (2.3124)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Aug 2016 08:16:44 -0000


> On 04 Jul 2016, at 21:31, Julien Cigar <julien@perdition.city> wrote:
>=20
>> To get specific again, I am not sure I would do what you are =
contemplating given your circumstances since it=E2=80=99s not the =
cheapest / simplest solution.  The cheapest / simplest solution would be =
to create 2 small ZFS servers and simply do zfs snapshot replication =
between them at periodic intervals, so you have a backup copy of the =
data for maximum safety as well as a physically separate server in case =
one goes down hard.  Disk storage is the cheap part now, particularly if =
you have data redundancy and can therefore use inexpensive disks, and =
ZFS replication is certainly =E2=80=9Cgood enough=E2=80=9D for disaster =
recovery.  As others have said, adding additional layers will only =
increase the overall fragility of the solution, and =E2=80=9Cfragile=E2=80=
=9D is kind of the last thing you need when you=E2=80=99re frantically =
trying to deal with a server that has gone down for what could be any =
number of reasons.
>>=20
>> I, for example, use a pair of FreeNAS Minis at home to store all my =
media and they work fine at minimal cost.  I use one as the primary =
server that talks to all of the VMWare / Plex / iTunes server =
applications (and serves as a backup device for all my iDevices) and it =
replicates the entire pool to another secondary server that can be =
pushed into service as the primary if the first one loses a power supply =
/ catches fire / loses more than 1 drive at a time / etc.  Since I have =
a backup, I can also just use RAIDZ1 for the 4x4Tb drive configuration =
on the primary and get a good storage / redundancy ratio (I can lose a =
single drive without data loss but am also not wasting a lot of storage =
on parity).
>=20
> You're right, I'll definitively reconsider the zfs send / zfs receive
> approach.

Sorry to be so late to the party.

Unless you have a *hard* requirement for synchronous replication, I =
would avoid it like the plague. Synchronous replication sounds sexy, but =
it
has several disadvantages: Complexity and in case you wish to keep an =
off-site replica it will definitely impact performance. Distance will
increase delay.

Asynchronous replication with ZFS has several advantages, however.

First and foremost: the snapshot-replicate approach is a terrific =
short-term =E2=80=9Cbackup=E2=80=9D solution that will allow you to =
recover quickly from some
often too quickly incidents, like your own software corrupting data. A =
ZFS snapshot is trivial to roll back and it won=E2=80=99t involve a =
costly =E2=80=9Cbackup
recovery=E2=80=9D procedure. You can do both replication *and* keep some =
snapshot retention policy =C3=A0la Apple=E2=80=99s Time Machine.=20

Second: I mentioned distance when keeping off-site replicas, as distance =
necessarily increases delay. Asynchronous replication doesn=C2=B4t have =
that problem.

Third: With some care you can do a one to N replication, even involving =
different replication frequencies.

Several years ago, in 2009 I think, I set up a system that worked quite =
well. It was based on NFS and ZFS. The requirements were a bit =
particular,
which in this case greatly simplified it for me.

I had a farm of front-end web servers (running Apache) that took all of =
the content from a NFS server. The NFS server used ZFS as the file =
system. This might not be useful for everyone, but in this case the web =
servers were CPU bound due to plenty of PHP crap. As the front ends =
weren=E2=80=99t supposed to write to the file server (and indeed it was =
undesirable for security reasons) I could afford to export the NFS file =
systems in read-only mode.=20

The server was replicated to a sibling in 1 or 2 minute intervals, I =
don=E2=80=99t remember. And the interesting part was this. I used =
Heartbeat to decide which of the servers was the master. When Heartbeat =
decided which one was the master, a specific IP address was assigned to =
it, starting the NFS service. So, the front-ends would happily mount it.

What happened in case of a server failure?=20

Heartbeat would detect it in a minute more or less. Assuming a master =
failure, the former slave would become master, assigning itself the NFS
server IP address and starting up NFS. Meanwhile, the front-ends had a =
silly script running in 1 minute intervals that simply read a file from =
the
NFS mounted filesystem. In case there was a reading error it would force =
an unmount of the NFS server and it would enter a loop trying to mount =
it again until it succeeded.

It looks kludgy, but that means that in case of a server loss (ZFS on =
FreeBSD wasn=E2=80=99t that stable at the time and we suffered a couple =
of them) the website was titsup for maybe two minutes, recovering =
automatically. It worked.=20

Both NFS servers were in the same datacenter, but I could have added =
geographical dispersion by using BGP to announce the NFS IP address to =
our routers.=20

There are better solutions, but this one involved no fancy software =
licenses, no expensive hardware and it was quite reliable. The only =
problem we had was, maybe I was just too daring, we were bitten by a ZFS =
deadlock bug several times. But it worked anyway.


Borja.