Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 30 Oct 2015 08:06:55 -0500
From:      Josh Paetzel <josh@tcbug.org>
To:        Jan Bramkamp <crest@rlwinm.de>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: iSCSI/ZFS strangeness
Message-ID:  <9D4FE448-28EC-45F6-B525-E660E3AF57B0@tcbug.org>
In-Reply-To: <563262C4.1040706@rlwinm.de>
References:  <20151029015721.GA95057@mail.michaelwlucas.com> <563262C4.1040706@rlwinm.de>

next in thread | previous in thread | raw e-mail | index | archive | help

> On Oct 29, 2015, at 1:17 PM, Jan Bramkamp <crest@rlwinm.de> wrote:
>=20
>> On 29/10/15 02:57, Michael W. Lucas wrote:
>> The initiators can both access the iSCSI-based pool--not
>> simultaneously, of course. But CARP, devd, and some shell scripting
>> should get me a highly available pool that can withstand the demise of
>> any one iSCSI server and any one initiator.
>>=20
>> The hope is that the pool would continue to work even if an iSCSI host
>> shuts down. When the downed iSCSI host returns, the initiators should
>> log back in and the pool auto-resilver.
>=20
> I would recommend against using CARP for this because CARP is prone to spl=
it-brain situations and in this case they could destroy your whole storage p=
ool. If the current head node fails the replacement has to `zpool import -f`=
 the pool and and in the case of a split-brain situation both head nodes wou=
ld continue writing to the iSCSI targets.
>=20
> I would move the leader election to an external service like consul, etcd o=
r zookeeper. This is one case where the added complexity is worth it. If you=
 can't run an external service for this e.g. it would exceed the scope of th=
e chapter you're writing please simplify the setup with more reliable hardwa=
re, good monitoring and manual failover for maintenance. CARP isn't designed=
 to implement reliable (enough) master election for your storage cluster.
>=20
> Adding iSCSI to your storage stack adds complexity and overhead. For setup=
s which still fit inside a single rack SAS (with geom_multipath) is normally=
 faster and cheaper. On the other hand you can't spread out SAS storage far e=
nough to implement disaster tolerance should you really need it and it certa=
inly is an setup.


I'll impart some wisdom here.

1) HA with two nodes is impossible to do right.  You need a third system to a=
chieve quorum.

2) You can do SAS over optical these days. Perfect for having mirrored JBODs=
 in different fire suppression zones of a datacenter.

3) I've seen a LOT of "cobbled together with shell script" HA rigs.  They mo=
stly get disabled eventually as it's realized that they go split brain in th=
e edge cases and destroy the storage.  What we did was go passive/passive an=
d then address those cases as "how could we have avoided going passive/passi=
ve". It took two years.

4) Leverage mav@'s ALUA support.  For block access this will make your life m=
uch easier.

5) Give me a call. I type slow and tend to leave things out, but would happi=
ly do one or more brain dump sessions.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9D4FE448-28EC-45F6-B525-E660E3AF57B0>