From owner-freebsd-fs@freebsd.org  Thu Aug 18 11:17:40 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id A7740BBE3D3
 for <freebsd-fs@mailman.ysv.freebsd.org>; Thu, 18 Aug 2016 11:17:40 +0000 (UTC)
 (envelope-from gpalmer@freebsd.org)
Received: from mail.in-addr.com (mail.in-addr.com
 [IPv6:2a01:4f8:191:61e8::2525:2525])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 56C8A1827
 for <freebsd-fs@freebsd.org>; Thu, 18 Aug 2016 11:17:40 +0000 (UTC)
 (envelope-from gpalmer@freebsd.org)
Received: from gjp by mail.in-addr.com with local (Exim 4.87 (FreeBSD))
 (envelope-from <gpalmer@freebsd.org>)
 id 1baLKL-0008Ap-Oq; Thu, 18 Aug 2016 12:17:37 +0100
Date: Thu, 18 Aug 2016 12:17:37 +0100
From: Gary Palmer <gpalmer@freebsd.org>
To: Ben RUBSON <ben.rubson@gmail.com>
Cc: FreeBSD FS <freebsd-fs@freebsd.org>
Subject: Re: HAST + ZFS + NFS + CARP
Message-ID: <20160818111737.GB47566@in-addr.com>
References: <20160817085413.GE22506@mordor.lan>
 <465bdec5-45b7-8a1d-d580-329ab6d4881b@internetx.com>
 <20160817095222.GG22506@mordor.lan>
 <52d5b687-1351-9ec5-7b67-bfa0be1c8415@kateley.com>
 <92F4BE3D-E4C1-4E5C-B631-D8F124988A83@gmail.com>
 <6b866b6e-1ab3-bcc5-151b-653e401742bd@kateley.com>
 <7468cc18-85e8-3765-2b2b-a93ef73ca05a@internetx.com>
 <CALfReydFhMfFpQ1v6F8nv5a-UN-EnY5ipYe_oe_edYJfBzjXVQ@mail.gmail.com>
 <409301a7-ce03-aaa3-c4dc-fa9f9ba66e01@internetx.com>
 <B668A4BF-FC28-4210-809A-38D23C214A3B@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <B668A4BF-FC28-4210-809A-38D23C214A3B@gmail.com>
X-SA-Exim-Connect-IP: <locally generated>
X-SA-Exim-Mail-From: gpalmer@freebsd.org
X-SA-Exim-Scanned: No (on mail.in-addr.com); SAEximRunCond expanded to false
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 18 Aug 2016 11:17:40 -0000

Isn't this exactly what the lockf command was designed to do for you?

I'd also suggest rmdir rather than rm -rf

On Thu, Aug 18, 2016 at 09:40:50AM +0200, Ben RUBSON wrote:
> Yep this is better :
> 
> if mkdir <lockdir>
> then
> 	do_your_job
> 	rm -rf <lockdir>
> fi
> 
> 
> 
> > On 18 Aug 2016, at 09:38, InterNetX - Juergen Gotteswinter <juergen.gotteswinter@internetx.com> wrote:
> > 
> > uhm, dont really investigated if it is or not. add a "sync" after that?
> > or replace it?
> > 
> > but anyway, thanks for the hint. will dig into this!
> > 
> > Am 18.08.2016 um 09:36 schrieb krad:
> >> I didnt think touch was atomic, mkdir is though
> >> 
> >> On 18 August 2016 at 08:32, InterNetX - Juergen Gotteswinter
> >> <juergen.gotteswinter@internetx.com
> >> <mailto:juergen.gotteswinter@internetx.com>> wrote:
> >> 
> >> 
> >> 
> >>    Am 17.08.2016 um 20:03 schrieb Linda Kateley:
> >>> I just do consulting so I don't always get to see the end of the
> >>> project. Although we are starting to do more ongoing support so we can
> >>> see the progress..
> >>> 
> >>> I have worked with some of the guys from high-availability.com <http://high-availability.com> for maybe
> >>> 20 years. RSF-1 is the cluster that is bundled with nexenta. Does work
> >>> beautifully with omni/illumos. The one customer I have running it in
> >>> prod is an isp in south america running openstack and zfs on freebsd as
> >>> iscsi. Big boxes, 90+ drives per frame.  If someone would like try it, i
> >>> have some contacts there. Ping me offlist.
> >> 
> >>    no offense, but it sounds a bit like marketing.
> >> 
> >>    here: running nexenta ha setup since several years with one catastrophic
> >>    failure due to split brain
> >> 
> >>> 
> >>> You do risk losing data if you batch zfs send. It is very hard to run
> >>> that real time.
> >> 
> >>    depends on how much data changes aka delta size
> >> 
> >> 
> >>    You have to take the snap then send the snap. Most
> >>> people run in cron, even if it's not in cron, you would want one to
> >>> finish before you started the next.
> >> 
> >>    thats the reason why lock files where invented, tools like zrep handle
> >>    that themself via additional zfs properties
> >> 
> >>    or, if one does not trust a single layer
> >> 
> >>    -- snip --
> >>    #!/bin/sh
> >>    if [ ! -f /var/run/replic ] ; then
> >>            touch /var/run/replic
> >>            /blah/path/zrep sync all >> /var/log/zfsrepli.log
> >>            rm -f /var/run/replic
> >>    fi
> >>    -- snip --
> >> 
> >>    something like this, simple
> >> 
> >>     If you lose the sending host before
> >>> the receive is complete you won't have a full copy.
> >> 
> >>    if rsf fails, and you end up in split brain you loose way more. been
> >>    there, seen that.
> >> 
> >>    With zfs though you
> >>> will probably still have the data on the sending host, however long it
> >>> takes to bring it back up. RSF-1 runs in the zfs stack and send the
> >>> writes to the second system. It's kind of pricey, but actually much less
> >>> expensive than commercial alternatives.
> >>> 
> >>> Anytime you run anything sync it adds latency but makes things safer..
> >> 
> >>    not surprising, it all depends on the usecase
> >> 
> >>> There is also a cool tool I like, called zerto for vmware that sits in
> >>> the hypervisor and sends a sync copy of a write locally and then an
> >>> async remotely. It's pretty cool. Although I haven't run it myself, have
> >>> a bunch of customers running it. I believe it works with proxmox too.
> >>> 
> >>> Most people I run into (these days) don't mind losing 5 or even 30
> >>> minutes of data. Small shops.
> >> 
> >>    you talk about minutes, what delta size are we talking here about? why
> >>    not using zrep in a loop for example
> >> 
> >>     They usually have a copy somewhere else.
> >>> Or the cost of 5-30 minutes isn't that great. I used work as a
> >>> datacenter architect for sun/oracle with only fortune 500. There losing
> >>> 1 sec could put large companies out of business. I worked with banks and
> >>> exchanges.
> >> 
> >>    again, usecase. i bet 99% on this list are not operating fortune 500
> >>    bank filers
> >> 
> >>    They couldn't ever lose a single transaction. Most people
> >>> nowadays do the replication/availability in the application though and
> >>> don't care about underlying hardware, especially disk.
> >>> 
> >>> 
> >>> On 8/17/16 11:55 AM, Chris Watson wrote:
> >>>> Of course, if you are willing to accept some amount of data loss that
> >>>> opens up a lot more options. :)
> >>>> 
> >>>> Some may find that acceptable though. Like turning off fsync with
> >>>> PostgreSQL to get much higher throughput. As little no as you are
> >>    made
> >>>> *very* aware of the risks.
> >>>> 
> >>>> It's good to have input in this thread from one with more experience
> >>>> with RSF-1 than the rest of us. You confirm what others have that
> >>    said
> >>>> about RSF-1, that it's stable and works well. What were you deploying
> >>>> it on?
> >>>> 
> >>>> Chris
> >>>> 
> >>>> Sent from my iPhone 5
> >>>> 
> >>>> On Aug 17, 2016, at 11:18 AM, Linda Kateley <lkateley@kateley.com
> >>    <mailto:lkateley@kateley.com>
> >>>> <mailto:lkateley@kateley.com <mailto:lkateley@kateley.com>>> wrote:
> >>>> 
> >>>>> The question I always ask, as an architect, is "can you lose 1
> >>    minute
> >>>>> worth of data?" If you can, then batched replication is perfect. If
> >>>>> you can't.. then HA. Every place I have positioned it, rsf-1 has
> >>>>> worked extremely well. If i remember right, it works at the dmu. I
> >>>>> would suggest try it. They have been trying to have a full freebsd
> >>>>> solution, I have several customers running it well.
> >>>>> 
> >>>>> linda
> >>>>> 
> >>>>> 
> >>>>> On 8/17/16 4:52 AM, Julien Cigar wrote:
> >>>>>> On Wed, Aug 17, 2016 at 11:05:46AM +0200, InterNetX - Juergen
> >>>>>> Gotteswinter wrote:
> >>>>>>> 
> >>>>>>> Am 17.08.2016 um 10:54 schrieb Julien Cigar:
> >>>>>>>> On Wed, Aug 17, 2016 at 09:25:30AM +0200, InterNetX - Juergen
> >>>>>>>> Gotteswinter wrote:
> >>>>>>>>> 
> >>>>>>>>> Am 11.08.2016 um 11:24 schrieb Borja Marcos:
> >>>>>>>>>>> On 11 Aug 2016, at 11:10, Julien Cigar <julien@perdition.city
> >>>>>>>>>>> <mailto:julien@perdition.city
> >>    <mailto:julien@perdition.city>>> wrote:
> >>>>>>>>>>> 
> >>>>>>>>>>> As I said in a previous post I tested the zfs send/receive
> >>>>>>>>>>> approach (with
> >>>>>>>>>>> zrep) and it works (more or less) perfectly.. so I concur in
> >>>>>>>>>>> all what you
> >>>>>>>>>>> said, especially about off-site replicate and synchronous
> >>>>>>>>>>> replication.
> >>>>>>>>>>> 
> >>>>>>>>>>> Out of curiosity I'm also testing a ZFS + iSCSI + CARP at the
> >>>>>>>>>>> moment,
> >>>>>>>>>>> I'm in the early tests, haven't done any heavy writes yet, but
> >>>>>>>>>>> ATM it
> >>>>>>>>>>> works as expected, I havent' managed to corrupt the zpool.
> >>>>>>>>>> I must be too old school, but I don???t quite like the idea of
> >>>>>>>>>> using an essentially unreliable transport
> >>>>>>>>>> (Ethernet) for low-level filesystem operations.
> >>>>>>>>>> 
> >>>>>>>>>> In case something went wrong, that approach could risk
> >>>>>>>>>> corrupting a pool. Although, frankly,
> >>>>>>>>>> ZFS is extremely resilient. One of mine even survived a SAS HBA
> >>>>>>>>>> problem that caused some
> >>>>>>>>>> silent corruption.
> >>>>>>>>> try dual split import :D i mean, zpool -f import on 2 machines
> >>>>>>>>> hooked up
> >>>>>>>>> to the same disk chassis.
> >>>>>>>> Yes this is the first thing on the list to avoid .. :)
> >>>>>>>> 
> >>>>>>>> I'm still busy to test the whole setup here, including the
> >>>>>>>> MASTER -> BACKUP failover script (CARP), but I think you can
> >>    prevent
> >>>>>>>> that thanks to:
> >>>>>>>> 
> >>>>>>>> - As long as ctld is running on the BACKUP the disks are locked
> >>>>>>>> and you can't import the pool (even with -f) for ex (filer2
> >>    is the
> >>>>>>>> BACKUP):
> >>>>>>>> 
> >>    https://gist.github.com/silenius/f9536e081d473ba4fddd50f59c56b58f
> >>    <https://gist.github.com/silenius/f9536e081d473ba4fddd50f59c56b58f>
> >>>>>>>> 
> >>>>>>>> - The shared pool should not be mounted at boot, and you should
> >>>>>>>> ensure
> >>>>>>>> that the failover script is not executed during boot time too:
> >>>>>>>> this is
> >>>>>>>> to handle the case wherein both machines turn off and/or
> >>    re-ignite at
> >>>>>>>> the same time. Indeed, the CARP interface can "flip" it's status
> >>>>>>>> if both
> >>>>>>>> machines are powered on at the same time, for ex:
> >>>>>>>> 
> >>    https://gist.github.com/silenius/344c3e998a1889f988fdfc3ceba57aaf
> >>    <https://gist.github.com/silenius/344c3e998a1889f988fdfc3ceba57aaf> and
> >>>>>>>> you will have a split-brain scenario
> >>>>>>>> 
> >>>>>>>> - Sometimes you'll need to reboot the MASTER for some $reasons
> >>>>>>>> (freebsd-update, etc) and the MASTER -> BACKUP switch should not
> >>>>>>>> happen, this can be handled with a trigger file or something like
> >>>>>>>> that
> >>>>>>>> 
> >>>>>>>> - I've still have to check if the order is OK, but I think
> >>    that as
> >>>>>>>> long
> >>>>>>>> as you shutdown the replication interface and that you adapt the
> >>>>>>>> advskew (including the config file) of the CARP interface
> >>    before the
> >>>>>>>> zpool import -f in the failover script you can be relatively
> >>>>>>>> confident
> >>>>>>>> that nothing will be written on the iSCSI targets
> >>>>>>>> 
> >>>>>>>> - A zpool scrub should be run at regular intervals
> >>>>>>>> 
> >>>>>>>> This is my MASTER -> BACKUP CARP script ATM
> >>>>>>>> 
> >>    https://gist.github.com/silenius/7f6ee8030eb6b923affb655a259bfef7
> >>    <https://gist.github.com/silenius/7f6ee8030eb6b923affb655a259bfef7>
> >>>>>>>> 
> >>>>>>>> Julien
> >>>>>>>> 
> >>>>>>> 100??? question without detailed looking at that script. yes from a
> >>>>>>> first
> >>>>>>> view its super simple, but: why are solutions like rsf-1 such more
> >>>>>>> powerful / featurerich. Theres a reason for, which is that
> >>    they try to
> >>>>>>> cover every possible situation (which makes more than sense
> >>    for this).
> >>>>>> I've never used "rsf-1" so I can't say much more about it, but
> >>    I have
> >>>>>> no doubts about it's ability to handle "complex situations", where
> >>>>>> multiple nodes / networks are involved.
> >>>>>> 
> >>>>>>> That script works for sure, within very limited cases imho
> >>>>>>> 
> >>>>>>>>> kaboom, really ugly kaboom. thats what is very likely to happen
> >>>>>>>>> sooner
> >>>>>>>>> or later especially when it comes to homegrown automatism
> >>    solutions.
> >>>>>>>>> even the commercial parts where much more time/work goes
> >>    into such
> >>>>>>>>> solutions fail in a regular manner
> >>>>>>>>> 
> >>>>>>>>>> The advantage of ZFS send/receive of datasets is, however, that
> >>>>>>>>>> you can consider it
> >>>>>>>>>> essentially atomic. A transport corruption should not cause
> >>>>>>>>>> trouble (apart from a failed
> >>>>>>>>>> "zfs receive") and with snapshot retention you can even roll
> >>>>>>>>>> back. You can???t roll back
> >>>>>>>>>> zpool replications :)
> >>>>>>>>>> 
> >>>>>>>>>> ZFS receive does a lot of sanity checks as well. As long as
> >>    your
> >>>>>>>>>> zfs receive doesn???t involve a rollback
> >>>>>>>>>> to the latest snapshot, it won???t destroy anything by mistake.
> >>>>>>>>>> Just make sure that your replica datasets
> >>>>>>>>>> aren???t mounted and zfs receive won???t complain.
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> Cheers,
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> Borja.
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> freebsd-fs@freebsd.org <mailto:freebsd-fs@freebsd.org>
> >>    <mailto:freebsd-fs@freebsd.org <mailto:freebsd-fs@freebsd.org>>
> >>    mailing list
> >>>>>>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>    <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>
> >>>>>>>>>> To unsubscribe, send any mail to
> >>>>>>>>>> "freebsd-fs-unsubscribe@freebsd.org
> >>    <mailto:freebsd-fs-unsubscribe@freebsd.org>
> >>>>>>>>>> <mailto:freebsd-fs-unsubscribe@freebsd.org
> >>    <mailto:freebsd-fs-unsubscribe@freebsd.org>>"
> >>>>>>>>>> 
> >>>>>>>>> _______________________________________________
> >>>>>>>>> freebsd-fs@freebsd.org <mailto:freebsd-fs@freebsd.org>
> >>    <mailto:freebsd-fs@freebsd.org <mailto:freebsd-fs@freebsd.org>>
> >>    mailing list
> >>>>>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>    <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>
> >>>>>>>>> To unsubscribe, send any mail to
> >>>>>>>>> "freebsd-fs-unsubscribe@freebsd.org
> >>    <mailto:freebsd-fs-unsubscribe@freebsd.org>
> >>>>>>>>> <mailto:freebsd-fs-unsubscribe@freebsd.org
> >>    <mailto:freebsd-fs-unsubscribe@freebsd.org>>"
> >>>>> 
> >>>>> _______________________________________________
> >>>>> freebsd-fs@freebsd.org <mailto:freebsd-fs@freebsd.org>
> >>    <mailto:freebsd-fs@freebsd.org <mailto:freebsd-fs@freebsd.org>>
> >>    mailing list
> >>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>    <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>
> >>>>> To unsubscribe, send any mail to
> >>    "freebsd-fs-unsubscribe@freebsd.org
> >>    <mailto:freebsd-fs-unsubscribe@freebsd.org>
> >>>>> <mailto:freebsd-fs-unsubscribe@freebsd.org
> >>    <mailto:freebsd-fs-unsubscribe@freebsd.org>>"
> >>> 
> >>> _______________________________________________
> >>> freebsd-fs@freebsd.org <mailto:freebsd-fs@freebsd.org> mailing list
> >>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>    <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>
> >>> To unsubscribe, send any mail to
> >>    "freebsd-fs-unsubscribe@freebsd.org
> >>    <mailto:freebsd-fs-unsubscribe@freebsd.org>"
> >>    _______________________________________________
> >>    freebsd-fs@freebsd.org <mailto:freebsd-fs@freebsd.org> mailing list
> >>    https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>    <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>
> >>    To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org
> >>    <mailto:freebsd-fs-unsubscribe@freebsd.org>"
> >> 
> >> 
> > _______________________________________________
> > freebsd-fs@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
> 
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"