Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 25 May 1999 11:39:33 -0400
From:      Graeme Tait <graeme@echidna.com>
To:        Juergen Nickelsen <jnickelsen@acm.org>
Cc:        Alex Heiphetz <heiphetz@cvzoom.net>, freebsd-questions@freebsd.org
Subject:   Re: 100% dependability/failsafe/security/hardware
Message-ID:  <374AC435.15C6C178@echidna.com>
References:  <388916.3136624169@ockholm.jn.berlin.snafu.de>

next in thread | previous in thread | raw e-mail | index | archive | help


Juergen Nickelsen wrote:
> 
> --On Mon, 24. Mai 1999 18:52 -0400 Alex Heiphetz <heiphetz@cvzoom.net>
> wrote:
> 
> > 3. How to provide 100% failsafe system?
> 
> *All* hardware redundant: CPUs, RAM, secondary storage, data paths,
> power supplies, fans, UPSs, etc.; proactive hardware monitoring

<snip of description of highly reliable $$$y$tem>


I would say you need to start this project with an acceptable downtime target.
There is no such thing as perfection, and you can't engineer an appropriate
solution without quantifying your requirements. Beyond a certain point, the
reliability target becomes the primary driver of design and cost.

The first question I would ask it what reliability you expect from your Internet
connectivity. I'm assuming from your original post that these servers are
accessed via the Internet. I have modest experience of colocating our server
with a quality provider. In the last six months, we've lost connectivity to the
server for a total of about 3 hours, part a network problem involving the colo,
part failure of a UPS feeding our server in the colo. There have been other
instances of degraded connectivity. Our plain-vanilla-Pentium/FreeBSD box has
*never* missed a beat. Any other downtime has been elective on our part, for
upgrades, etc.

I have longer-term experience of using a provider hosted by Exodus, the major
Internet colo provider. They have experienced the odd few hours downtime
(complete outages) a year due to Exodus's problems - again they have both lost
network connectivity, and lost AC power, plus there have been many (usually
brief) instances of degraded connectivity.

So if you are dependent of such a single-point failure, there's not much point
in engineering your equipment to vastly higher standards. Absent widespread
implementation of advanced DNS features (as posted on recently either here or to
freebsd-isp) that would allow multiple systems to be located separately but
answer to a common host name with automatic failover, I don't know how you can
easily circumvent unreliability in your Internet provider.


I'm trying to deal with this issue for the server we operate. Our problem is
that we may not be able to achieve satisfactory repair time if our hardware at
the colo fails, because of access problems and personnel availability.

The current plan is to have duplicate hardware, one system active, one hot
standby (but possibly offloading background tasks like offine analysis and
backup). Rather than buy one verrry expensive system, we'll have two modest (but
decent quality), inexpensive systems - probably costing less overall, and
probably appreciably more reliable overall. The only common components would be
ones that are the responsibility of our colo provider, and they have plenty of
resources to fix problems.

BTW, a major benefit of total duplication is that upgrades can be performed and
tested on the standby machine, a switch effected, and then implemented on what
was the live machine. If you want 100% uptime, you need a plan to eliminate
elective downtime, and reduce the risk of live changes to the online system.

The hard part of all this is (1) detecting failure (in either machine); (2)
achieving automatic failover; and (3) what I see as the hardest of all, how to
deal with dynamic data (such as a live database) on the running machine, and
ensure the backup machine picks up where the other left off, as nearly as
possible, without loss of essential data (like orders or email).


BTW, one thing that concerns me particularly here is that I've seen at least two
cases of "uninterruptable" power fail in colo situations. Because this means all
systems (both duplicate servers) suffer an unclean shutdown, there's the
potential for both to be taken out at once, especially as in such situations,
it's not unusual to see multiple power glitches or line transients. So one might
want to furnish each machine with a separate, small UPS offering status
feedback, enabling a clean shutdown on prolonged primary power loss.


--
Graeme Tait



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?374AC435.15C6C178>