Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Aug 2016 11:10:16 +0200
From:      Julien Cigar <julien@perdition.city>
To:        Borja Marcos <borjam@sarenet.es>
Cc:        Jordan Hubbard <jkh@ixsystems.com>, freebsd-fs@freebsd.org
Subject:   Re: HAST + ZFS + NFS + CARP
Message-ID:  <20160811091016.GI70364@mordor.lan>
In-Reply-To: <E7D42341-D324-41C7-B03A-2420DA7A7952@sarenet.es>
References:  <6035AB85-8E62-4F0A-9FA8-125B31A7A387@gmail.com> <20160703192945.GE41276@mordor.lan> <20160703214723.GF41276@mordor.lan> <65906F84-CFFC-40E9-8236-56AFB6BE2DE1@ixsystems.com> <B48FB28E-30FA-477F-810E-DF4F575F5063@gmail.com> <61283600-A41A-4A8A-92F9-7FAFF54DD175@ixsystems.com> <20160704183643.GI41276@mordor.lan> <AE372BF0-02BE-4BF3-9073-A05DB4E7FE34@ixsystems.com> <20160704193131.GJ41276@mordor.lan> <E7D42341-D324-41C7-B03A-2420DA7A7952@sarenet.es>

next in thread | previous in thread | raw e-mail | index | archive | help

--1E1Oui4vdubnXi3o
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Aug 11, 2016 at 10:11:15AM +0200, Borja Marcos wrote:
>=20
> > On 04 Jul 2016, at 21:31, Julien Cigar <julien@perdition.city> wrote:
> >=20
> >> To get specific again, I am not sure I would do what you are contempla=
ting given your circumstances since it=E2=80=99s not the cheapest / simples=
t solution.  The cheapest / simplest solution would be to create 2 small ZF=
S servers and simply do zfs snapshot replication between them at periodic i=
ntervals, so you have a backup copy of the data for maximum safety as well =
as a physically separate server in case one goes down hard.  Disk storage i=
s the cheap part now, particularly if you have data redundancy and can ther=
efore use inexpensive disks, and ZFS replication is certainly =E2=80=9Cgood=
 enough=E2=80=9D for disaster recovery.  As others have said, adding additi=
onal layers will only increase the overall fragility of the solution, and =
=E2=80=9Cfragile=E2=80=9D is kind of the last thing you need when you=E2=80=
=99re frantically trying to deal with a server that has gone down for what =
could be any number of reasons.
> >>=20
> >> I, for example, use a pair of FreeNAS Minis at home to store all my me=
dia and they work fine at minimal cost.  I use one as the primary server th=
at talks to all of the VMWare / Plex / iTunes server applications (and serv=
es as a backup device for all my iDevices) and it replicates the entire poo=
l to another secondary server that can be pushed into service as the primar=
y if the first one loses a power supply / catches fire / loses more than 1 =
drive at a time / etc.  Since I have a backup, I can also just use RAIDZ1 f=
or the 4x4Tb drive configuration on the primary and get a good storage / re=
dundancy ratio (I can lose a single drive without data loss but am also not=
 wasting a lot of storage on parity).
> >=20
> > You're right, I'll definitively reconsider the zfs send / zfs receive
> > approach.
>=20
> Sorry to be so late to the party.
>=20
> Unless you have a *hard* requirement for synchronous replication, I would=
 avoid it like the plague. Synchronous replication sounds sexy, but it
> has several disadvantages: Complexity and in case you wish to keep an off=
-site replica it will definitely impact performance. Distance will
> increase delay.
>=20
> Asynchronous replication with ZFS has several advantages, however.
>=20
> First and foremost: the snapshot-replicate approach is a terrific short-t=
erm =E2=80=9Cbackup=E2=80=9D solution that will allow you to recover quickl=
y from some
> often too quickly incidents, like your own software corrupting data. A ZF=
S snapshot is trivial to roll back and it won=E2=80=99t involve a costly =
=E2=80=9Cbackup
> recovery=E2=80=9D procedure. You can do both replication *and* keep some =
snapshot retention policy =C3=A0la Apple=E2=80=99s Time Machine.=20
>=20
> Second: I mentioned distance when keeping off-site replicas, as distance =
necessarily increases delay. Asynchronous replication doesn=C2=B4t have tha=
t problem.
>=20
> Third: With some care you can do a one to N replication, even involving d=
ifferent replication frequencies.
>=20
> Several years ago, in 2009 I think, I set up a system that worked quite w=
ell. It was based on NFS and ZFS. The requirements were a bit particular,
> which in this case greatly simplified it for me.
>=20
> I had a farm of front-end web servers (running Apache) that took all of t=
he content from a NFS server. The NFS server used ZFS as the file system. T=
his might not be useful for everyone, but in this case the web servers were=
 CPU bound due to plenty of PHP crap. As the front ends weren=E2=80=99t sup=
posed to write to the file server (and indeed it was undesirable for securi=
ty reasons) I could afford to export the NFS file systems in read-only mode=
=2E=20
>=20
> The server was replicated to a sibling in 1 or 2 minute intervals, I don=
=E2=80=99t remember. And the interesting part was this. I used Heartbeat to=
 decide which of the servers was the master. When Heartbeat decided which o=
ne was the master, a specific IP address was assigned to it, starting the N=
FS service. So, the front-ends would happily mount it.
>=20
> What happened in case of a server failure?=20
>=20
> Heartbeat would detect it in a minute more or less. Assuming a master fai=
lure, the former slave would become master, assigning itself the NFS
> server IP address and starting up NFS. Meanwhile, the front-ends had a si=
lly script running in 1 minute intervals that simply read a file from the
> NFS mounted filesystem. In case there was a reading error it would force =
an unmount of the NFS server and it would enter a loop trying to mount it a=
gain until it succeeded.
>=20
> It looks kludgy, but that means that in case of a server loss (ZFS on Fre=
eBSD wasn=E2=80=99t that stable at the time and we suffered a couple of the=
m) the website was titsup for maybe two minutes, recovering automatically. =
It worked.=20
>=20
> Both NFS servers were in the same datacenter, but I could have added geog=
raphical dispersion by using BGP to announce the NFS IP address to our rout=
ers.=20
>=20
> There are better solutions, but this one involved no fancy software licen=
ses, no expensive hardware and it was quite reliable. The only problem we h=
ad was, maybe I was just too daring, we were bitten by a ZFS deadlock bug s=
everal times. But it worked anyway.
>=20
>=20

As I said in a previous post I tested the zfs send/receive approach (with
zrep) and it works (more or less) perfectly.. so I concur in all what you
said, especially about off-site replicate and synchronous replication.

Out of curiosity I'm also testing a ZFS + iSCSI + CARP at the moment,=20
I'm in the early tests, haven't done any heavy writes yet, but ATM it=20
works as expected, I havent' managed to corrupt the zpool.

I think that with the following assumptions the failover from MASTER
(old master) -> BACKUP (new master) can be done quite safely (the
opposite *MUST* always be done manually IMHO):
1) Don't mount the zpool at boot
2) Ensure that the failover script is not executed at boot
3) Once the failover script has been executed and that the BACKUP is=20
the new MASTER assume that it will remain so, unless changed manually

This is to avoid the case of a catastrophic power loss in the DC and a
possible split-brain scenario when they both go off / on simultaneously.
2) is especially important with CARPed interface where the state could
flip from BACKUP -> MASTER -> BACKUP at boot sometimes.
For 3) you must adapt the advskew if the CARPed interface, so that even
if the BACKUP (now master) has an unplanned shutdown/reboot the old
MASTER (now backup) doesn't switch, unless done manually. So you should
do something like:
sysrc ifconfig_bge0_alias0=3D"vhid 54 advskew 10 pass xxx alias
192.168.10.15/32"
ifconfig bge0 vhid 54 advskew 10

in the failover script (where the "new" advskew (10) is smaller than=20
the old master (now backup) advskew)

The failover should only be done for unplanned events, so if you reboot
the MASTER for some reasons (freebsd-update, etc) the failover script on
the BACKUP should handle that.

(more soon...)

Julien

>=20
>=20
> Borja.
>=20
>=20
>=20

--=20
Julien Cigar
Belgian Biodiversity Platform (http://www.biodiversity.be)
PGP fingerprint: EEF9 F697 4B68 D275 7B11  6A25 B2BB 3710 A204 23C0
No trees were killed in the creation of this message.
However, many electrons were terribly inconvenienced.

--1E1Oui4vdubnXi3o
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAABCgAGBQJXrED1AAoJELK7NxCiBCPAqOgQAJPq/UXsc8VfydbO0R4WCXsO
pQuoRErCLu0wYOyeZmKyPgRO05V+Iv8fDcvw/uhzrx6bz+mxISmSgUFt/7PQM7M/
q+VkFyE1whh/Yh3G23n6s3tISoopXgAi+kvSJal/hcOmYDxJ6nlFZ27QsIrBL8UN
JWJ0kY+MBUR9wdQhZES37Y4pu/o3ZA+uthRyH+VpW7DavKjVU9yNzddPp+8kCL8K
si2c6QQaRiTTIEszOXygeRaZTuwjSy5dzuFqtpOQvJrQcrBJ4duapXWfTVr97I9u
9VuAs+Ffr1eWi4U2VhChGxwc3zivcpU+OvZDrJTWIeWJQtYvxQ0S37WUijeSPOi/
iGR5daA4zbiaN9OIDyODKtOjAzNNSehqGxRneWLN7I16BbCkg8U8rI5ObDZf3wn7
yUHZ34MXA5X+wB1z0q/uNq9vG5KYEaIcM35NtzBWE+iLkLaSwuJjpBJCmO2SCk3F
3yg24LiSdNagiRIDykW0I0BU0r1hbv3zPSWSEjNwx7hxOkZp32C4sImoFBY0Df6H
njj93uZM5/c05HluhSp2T4SemmnGTkjU0vtfNxr2l3JadLay/VAQr37nsoaOisN1
L4W2UaAGNqW/5iL2EkxaQFYBPbmm2vQgCsn7nwlZ9bvkqPEi2CRUSVeIrUdH6H+I
DQCYrx+6HLkdiEQi/7cu
=H3Gg
-----END PGP SIGNATURE-----

--1E1Oui4vdubnXi3o--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160811091016.GI70364>