From owner-freebsd-hackers@FreeBSD.ORG Fri Feb 24 21:59:25 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0ACC7106564A for ; Fri, 24 Feb 2012 21:59:25 +0000 (UTC) (envelope-from amvandemore@gmail.com) Received: from mail-we0-f182.google.com (mail-we0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id 8A0A88FC0A for ; Fri, 24 Feb 2012 21:59:24 +0000 (UTC) Received: by werm13 with SMTP id m13so2569439wer.13 for ; Fri, 24 Feb 2012 13:59:23 -0800 (PST) Received-SPF: pass (google.com: domain of amvandemore@gmail.com designates 10.180.99.7 as permitted sender) client-ip=10.180.99.7; Authentication-Results: mr.google.com; spf=pass (google.com: domain of amvandemore@gmail.com designates 10.180.99.7 as permitted sender) smtp.mail=amvandemore@gmail.com; dkim=pass header.i=amvandemore@gmail.com Received: from mr.google.com ([10.180.99.7]) by 10.180.99.7 with SMTP id em7mr8866882wib.7.1330120763510 (num_hops = 1); Fri, 24 Feb 2012 13:59:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=FY/uwovC3n4T/aVz9IOSyud/ffyrxrQ1t6ES2vxeIqU=; b=MZvzIdBNf/MZ57UKZnbE7s514GeGwCX4Xc+C76beDaJkFcGBagEflh/qlumEpqljDD /paW0hz7ayTyD0Wg7Jtw9u8sIa2PSEwqmgmsYP9eC020H8Tqi+8lZpcycvtcwkOtERa7 3G6b/f/tJxXR7+RVdhd/B5wSBDBcWL93WN75s= MIME-Version: 1.0 Received: by 10.180.99.7 with SMTP id em7mr7016348wib.7.1330118930638; Fri, 24 Feb 2012 13:28:50 -0800 (PST) Received: by 10.223.93.138 with HTTP; Fri, 24 Feb 2012 13:28:50 -0800 (PST) In-Reply-To: <20120224211011.300960@gmx.com> References: <20120224211011.300960@gmx.com> Date: Fri, 24 Feb 2012 15:28:50 -0600 Message-ID: From: Adam Vande More To: Dieter BSD Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: freebsd-hackers@freebsd.org Subject: Re: OS support for fault tolerance X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Feb 2012 21:59:25 -0000 On Fri, Feb 24, 2012 at 3:10 PM, Dieter BSD wrote: > Depends on what sort of work the machine is doing. If the job is > something that can be done again, you could simply try again, if > you still get different answers try a third machine or wade in and > start manually inspecting things until you find the problem. > If the job is time critical or you can't get the same inputs again, > then the machine needs to get it right the first time. How many > 9s of reliability do you need and how many resources can you throw > at it? 2x hardware can be good for better than 5 9s. (high quality > hardware and software, and technicians standing by with cold spares) > I've heard that mil gear uses 3x hardware. > > Building a 5 9s system is... non-trivial. So I'm wondering what sort > of reliability we can get with 2x off the shelf commodity hardware > and a bit of software? Similar to mirroring/RAID but with whole > computers rather than just disks. Classic Unix technique of doing > 10-20% of the work and getting 80-90% of the result. > I don't have anything particularly insightful to add to this conversation, but it is something I've looked into a bit. The solution which seemed most promising to me is Remus. I don't know if any have heard of it so I offer a link: http://static.usenix.org/event/nsdi08/tech/full_papers/cully/cully_html/ I understand this doesn't correlate exactly with the OP's point but there is good material there regardless. -- Adam Vande More