From owner-freebsd-fs@FreeBSD.ORG  Fri Mar 29 14:24:21 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 3CCAAB0A
 for <freebsd-fs@freebsd.org>; Fri, 29 Mar 2013 14:24:21 +0000 (UTC)
 (envelope-from jdavidlists@gmail.com)
Received: from mail-ia0-x232.google.com (mail-ia0-x232.google.com
 [IPv6:2607:f8b0:4001:c02::232])
 by mx1.freebsd.org (Postfix) with ESMTP id 12189883
 for <freebsd-fs@freebsd.org>; Fri, 29 Mar 2013 14:24:21 +0000 (UTC)
Received: by mail-ia0-f178.google.com with SMTP id r13so430260iar.23
 for <freebsd-fs@freebsd.org>; Fri, 29 Mar 2013 07:24:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:sender:in-reply-to:references:date
 :x-google-sender-auth:message-id:subject:from:to:cc:content-type;
 bh=veCJOBjp9ESYoLje7TNTn8GOKNiVRgzHxi/jWsSY67s=;
 b=k/8jS+WkyK5ZBI5jyuzj6YoxFbBRvU+MRnsrdSQboR+ezRBCKmBeJceo2o+2uGssO1
 77uFmH4g8TCuCGdTkiOi5sJZNN9zkI8w3g+6RSrSH8zxMxiFrM0SlOMXNfgmw3HQAA0S
 xq1ShnhEHmOqsfC6RQejUfrLpCLBMXpFs2Udj2PY+NwlOpy3Zt2uKzAtLn6Qd5RcRqqU
 KrCid84OGoqpskRWQgLytOf3PnyuEft9JvXRuY5bvcsMe7km3zh3j4TskgwrVj0MS+KY
 GotVRMrNeP6Gm3fq1riNgXBLm8CLWAlZDanzUnqQMTQKWizYjwTZHvQS5PzaQzCXpZDv
 jfPw==
MIME-Version: 1.0
X-Received: by 10.50.12.193 with SMTP id a1mr10008268igc.24.1364567060714;
 Fri, 29 Mar 2013 07:24:20 -0700 (PDT)
Sender: jdavidlists@gmail.com
Received: by 10.42.83.83 with HTTP; Fri, 29 Mar 2013 07:24:20 -0700 (PDT)
In-Reply-To: <F9A7386EC2A26E4293AF13FABCCB32B3A650C514@janus.anserinae.net>
References: <F9A7386EC2A26E4293AF13FABCCB32B3A650C514@janus.anserinae.net>
Date: Fri, 29 Mar 2013 10:24:20 -0400
X-Google-Sender-Auth: pi3NeowRlSpCZaPJ95yEEtEtp_A
Message-ID: <CABXB=RRyyA8uw+n6rKzkm+_sSdXv-riCdokJiTWxvsi0GM=vRQ@mail.gmail.com>
Subject: Re: Building ZFS out of ISCSI LUNs?
From: J David <j.david.lists@gmail.com>
To: Kamil Choudhury <Kamil.Choudhury@anserinae.net>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 29 Mar 2013 14:24:21 -0000

On Thu, Mar 28, 2013 at 10:27 PM, Kamil Choudhury <
Kamil.Choudhury@anserinae.net> wrote:

> Summary: export LUNs from  various (independent, share-nothing) storage
> nodes, cobble them together into vdevs, and then create storage pools on
> them.
>
> Am I insane, or could this (with a  layer to coordinate access to the
> LUNs) be a pretty nifty way to create a distributed storage system?
>

It is possible, and I've done it as a test, but it's not a good idea.

If a storage node (box of drives) dies, you lose every drive in it.  To
minimize the impact of losing a box of drives, you have to increase the
number of boxes.  Realistically the smallest number of drives per 1U server
is 4.  So if a node fails, you still lose 4 vdevs and your pool is probably
shot until you fix it anyway.  You can create four mirrors from one 4-drive
box to another, but then if you lose a box it's one drive failure from
there to permanent data loss.  You can get more expensive boxes and put 10
drives per box and RAID 6 them together than mirror the result, but that's
the exact opposite of putting ZFS close to the disks where it belongs, so
performance dives some more.

SAS shelves are cheap and SAS shelf failures are rarer than server chassis
failures (fewer parts and good redundancy), so by swapping out a
limited-complex node with good redundancy for a complex node with none,
you've paid more for significantly less reliability and moderately awful
performance.

Also SAS is significantly faster at this than reading disks with SATA and
exporting them with 1G ethernet via iSCSI.  If you use 10G ethernet you
will close the gap but of course you need multi path which means two 10G
$witches.

Finally, this is not in any way "distributed."  The headend is still a
single point of failure. Hooking an active and standby headend to the same
pool of disks via Ethernet is conceptually easier than doing it via SAS,
but the financial and performance costs aren't worth it.  And even if you
do that, you still have to put any ZIL/L2ARC SSD's in the shared storage
too, otherwise the standby can't import the pool, and putting SSD on the
wrong side of an ethernet switch when its raison d'=EAtre is low latency
really blunts its effectiveness.

So by the time you get two ZFS headends, two 10gig switches, and enough 1U
"shared nothing" drive servers that you can build pools of usable size that
don't get crushed by the loss of a single node, you have spent enough to
buy a nice NetApp and a Mercedes to deliver it in.  And you still have to
solve the problem of how to fail over from the active headend to the
standby.  And the resulting performance will be embarrassing compared to a
dual-head SAS setup at the same price. (Which admittedly still has the same
how-to-failover problem.)

ZFS is *not* a building block for a distributed storage system, or even a
high-availability one.  In fact, to the best of my knowledge, there are
*zero* functioning production-quality distributed filesystems that can be
implemented using only FreeBSD.