Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 24 Feb 2012 16:10:09 -0500
From:      "Dieter BSD" <dieterbsd@engineer.com>
To:        freebsd-hackers@freebsd.org
Subject:   Re: OS support for fault tolerance
Message-ID:  <20120224211011.300960@gmx.com>

next in thread | raw e-mail | index | archive | help
>> The problem then is how to feed both machines the same inputs, and
>> compare the outputs. Do we need a third machine to supervise?
>> Can we have each machine keep an eye on the other, avoiding the
>> need for a third machine?
>
> A pair would work as long as the only failures are "obvious" (e.g.
> crashes).  If they simply disagree as to the result, how would we
> determine which one was right?

Depends on what sort of work the machine is doing.  If the job is
something that can be done again, you could simply try again, if
you still get different answers try a third machine or wade in and
start manually inspecting things until you find the problem.
If the job is time critical or you can't get the same inputs again,
then the machine needs to get it right the first time.  How many
9s of reliability do you need and how many resources can you throw
at it?  2x hardware can be good for better than 5 9s. (high quality
hardware and software, and technicians standing by with cold spares)
I've heard that mil gear uses 3x hardware.

Building a 5 9s system is... non-trivial.  So I'm wondering what sort
of reliability we can get with 2x off the shelf commodity hardware
and a bit of software?  Similar to mirroring/RAID but with whole
computers rather than just disks.  Classic Unix technique of doing
10-20% of the work and getting 80-90% of the result.

>> Which then leads to the issue of how to avoid problems when *it*
>> breaks.
>
> For some reason, this reminds me of a Dr. Seuss story:
> http://www.goodreads.com/review/show/49519038

*grin*  Gotta love Dr. Seuss.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120224211011.300960>