Date: Fri, 24 Feb 2012 15:28:50 -0600 From: Adam Vande More <amvandemore@gmail.com> To: Dieter BSD <dieterbsd@engineer.com> Cc: freebsd-hackers@freebsd.org Subject: Re: OS support for fault tolerance Message-ID: <CA%2BtpaK2c3AjUF%2Bmy5=52xOHEFq0Q2a3nwXJKkfrjbhN4vQAv7A@mail.gmail.com> In-Reply-To: <20120224211011.300960@gmx.com> References: <20120224211011.300960@gmx.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Feb 24, 2012 at 3:10 PM, Dieter BSD <dieterbsd@engineer.com> wrote: > Depends on what sort of work the machine is doing. If the job is > something that can be done again, you could simply try again, if > you still get different answers try a third machine or wade in and > start manually inspecting things until you find the problem. > If the job is time critical or you can't get the same inputs again, > then the machine needs to get it right the first time. How many > 9s of reliability do you need and how many resources can you throw > at it? 2x hardware can be good for better than 5 9s. (high quality > hardware and software, and technicians standing by with cold spares) > I've heard that mil gear uses 3x hardware. > > Building a 5 9s system is... non-trivial. So I'm wondering what sort > of reliability we can get with 2x off the shelf commodity hardware > and a bit of software? Similar to mirroring/RAID but with whole > computers rather than just disks. Classic Unix technique of doing > 10-20% of the work and getting 80-90% of the result. > I don't have anything particularly insightful to add to this conversation, but it is something I've looked into a bit. The solution which seemed most promising to me is Remus. I don't know if any have heard of it so I offer a link: http://static.usenix.org/event/nsdi08/tech/full_papers/cully/cully_html/ I understand this doesn't correlate exactly with the OP's point but there is good material there regardless. -- Adam Vande More
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CA%2BtpaK2c3AjUF%2Bmy5=52xOHEFq0Q2a3nwXJKkfrjbhN4vQAv7A>