Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 29 Mar 2013 10:24:20 -0400
From:      J David <j.david.lists@gmail.com>
To:        Kamil Choudhury <Kamil.Choudhury@anserinae.net>
Cc:        "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: Building ZFS out of ISCSI LUNs?
Message-ID:  <CABXB=RRyyA8uw%2Bn6rKzkm%2B_sSdXv-riCdokJiTWxvsi0GM=vRQ@mail.gmail.com>
In-Reply-To: <F9A7386EC2A26E4293AF13FABCCB32B3A650C514@janus.anserinae.net>
References:  <F9A7386EC2A26E4293AF13FABCCB32B3A650C514@janus.anserinae.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Mar 28, 2013 at 10:27 PM, Kamil Choudhury <
Kamil.Choudhury@anserinae.net> wrote:

> Summary: export LUNs from  various (independent, share-nothing) storage
> nodes, cobble them together into vdevs, and then create storage pools on
> them.
>
> Am I insane, or could this (with a  layer to coordinate access to the
> LUNs) be a pretty nifty way to create a distributed storage system?
>

It is possible, and I've done it as a test, but it's not a good idea.

If a storage node (box of drives) dies, you lose every drive in it.  To
minimize the impact of losing a box of drives, you have to increase the
number of boxes.  Realistically the smallest number of drives per 1U server
is 4.  So if a node fails, you still lose 4 vdevs and your pool is probably
shot until you fix it anyway.  You can create four mirrors from one 4-drive
box to another, but then if you lose a box it's one drive failure from
there to permanent data loss.  You can get more expensive boxes and put 10
drives per box and RAID 6 them together than mirror the result, but that's
the exact opposite of putting ZFS close to the disks where it belongs, so
performance dives some more.

SAS shelves are cheap and SAS shelf failures are rarer than server chassis
failures (fewer parts and good redundancy), so by swapping out a
limited-complex node with good redundancy for a complex node with none,
you've paid more for significantly less reliability and moderately awful
performance.

Also SAS is significantly faster at this than reading disks with SATA and
exporting them with 1G ethernet via iSCSI.  If you use 10G ethernet you
will close the gap but of course you need multi path which means two 10G
$witches.

Finally, this is not in any way "distributed."  The headend is still a
single point of failure. Hooking an active and standby headend to the same
pool of disks via Ethernet is conceptually easier than doing it via SAS,
but the financial and performance costs aren't worth it.  And even if you
do that, you still have to put any ZIL/L2ARC SSD's in the shared storage
too, otherwise the standby can't import the pool, and putting SSD on the
wrong side of an ethernet switch when its raison d'=EAtre is low latency
really blunts its effectiveness.

So by the time you get two ZFS headends, two 10gig switches, and enough 1U
"shared nothing" drive servers that you can build pools of usable size that
don't get crushed by the loss of a single node, you have spent enough to
buy a nice NetApp and a Mercedes to deliver it in.  And you still have to
solve the problem of how to fail over from the active headend to the
standby.  And the resulting performance will be embarrassing compared to a
dual-head SAS setup at the same price. (Which admittedly still has the same
how-to-failover problem.)

ZFS is *not* a building block for a distributed storage system, or even a
high-availability one.  In fact, to the best of my knowledge, there are
*zero* functioning production-quality distributed filesystems that can be
implemented using only FreeBSD.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CABXB=RRyyA8uw%2Bn6rKzkm%2B_sSdXv-riCdokJiTWxvsi0GM=vRQ>