From owner-freebsd-fs@FreeBSD.ORG Fri Mar 29 14:24:21 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 3CCAAB0A for ; Fri, 29 Mar 2013 14:24:21 +0000 (UTC) (envelope-from jdavidlists@gmail.com) Received: from mail-ia0-x232.google.com (mail-ia0-x232.google.com [IPv6:2607:f8b0:4001:c02::232]) by mx1.freebsd.org (Postfix) with ESMTP id 12189883 for ; Fri, 29 Mar 2013 14:24:21 +0000 (UTC) Received: by mail-ia0-f178.google.com with SMTP id r13so430260iar.23 for ; Fri, 29 Mar 2013 07:24:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=veCJOBjp9ESYoLje7TNTn8GOKNiVRgzHxi/jWsSY67s=; b=k/8jS+WkyK5ZBI5jyuzj6YoxFbBRvU+MRnsrdSQboR+ezRBCKmBeJceo2o+2uGssO1 77uFmH4g8TCuCGdTkiOi5sJZNN9zkI8w3g+6RSrSH8zxMxiFrM0SlOMXNfgmw3HQAA0S xq1ShnhEHmOqsfC6RQejUfrLpCLBMXpFs2Udj2PY+NwlOpy3Zt2uKzAtLn6Qd5RcRqqU KrCid84OGoqpskRWQgLytOf3PnyuEft9JvXRuY5bvcsMe7km3zh3j4TskgwrVj0MS+KY GotVRMrNeP6Gm3fq1riNgXBLm8CLWAlZDanzUnqQMTQKWizYjwTZHvQS5PzaQzCXpZDv jfPw== MIME-Version: 1.0 X-Received: by 10.50.12.193 with SMTP id a1mr10008268igc.24.1364567060714; Fri, 29 Mar 2013 07:24:20 -0700 (PDT) Sender: jdavidlists@gmail.com Received: by 10.42.83.83 with HTTP; Fri, 29 Mar 2013 07:24:20 -0700 (PDT) In-Reply-To: References: Date: Fri, 29 Mar 2013 10:24:20 -0400 X-Google-Sender-Auth: pi3NeowRlSpCZaPJ95yEEtEtp_A Message-ID: Subject: Re: Building ZFS out of ISCSI LUNs? From: J David To: Kamil Choudhury Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Mar 2013 14:24:21 -0000 On Thu, Mar 28, 2013 at 10:27 PM, Kamil Choudhury < Kamil.Choudhury@anserinae.net> wrote: > Summary: export LUNs from various (independent, share-nothing) storage > nodes, cobble them together into vdevs, and then create storage pools on > them. > > Am I insane, or could this (with a layer to coordinate access to the > LUNs) be a pretty nifty way to create a distributed storage system? > It is possible, and I've done it as a test, but it's not a good idea. If a storage node (box of drives) dies, you lose every drive in it. To minimize the impact of losing a box of drives, you have to increase the number of boxes. Realistically the smallest number of drives per 1U server is 4. So if a node fails, you still lose 4 vdevs and your pool is probably shot until you fix it anyway. You can create four mirrors from one 4-drive box to another, but then if you lose a box it's one drive failure from there to permanent data loss. You can get more expensive boxes and put 10 drives per box and RAID 6 them together than mirror the result, but that's the exact opposite of putting ZFS close to the disks where it belongs, so performance dives some more. SAS shelves are cheap and SAS shelf failures are rarer than server chassis failures (fewer parts and good redundancy), so by swapping out a limited-complex node with good redundancy for a complex node with none, you've paid more for significantly less reliability and moderately awful performance. Also SAS is significantly faster at this than reading disks with SATA and exporting them with 1G ethernet via iSCSI. If you use 10G ethernet you will close the gap but of course you need multi path which means two 10G $witches. Finally, this is not in any way "distributed." The headend is still a single point of failure. Hooking an active and standby headend to the same pool of disks via Ethernet is conceptually easier than doing it via SAS, but the financial and performance costs aren't worth it. And even if you do that, you still have to put any ZIL/L2ARC SSD's in the shared storage too, otherwise the standby can't import the pool, and putting SSD on the wrong side of an ethernet switch when its raison d'=EAtre is low latency really blunts its effectiveness. So by the time you get two ZFS headends, two 10gig switches, and enough 1U "shared nothing" drive servers that you can build pools of usable size that don't get crushed by the loss of a single node, you have spent enough to buy a nice NetApp and a Mercedes to deliver it in. And you still have to solve the problem of how to fail over from the active headend to the standby. And the resulting performance will be embarrassing compared to a dual-head SAS setup at the same price. (Which admittedly still has the same how-to-failover problem.) ZFS is *not* a building block for a distributed storage system, or even a high-availability one. In fact, to the best of my knowledge, there are *zero* functioning production-quality distributed filesystems that can be implemented using only FreeBSD.