From owner-freebsd-hackers  Wed Feb 25 11:22:24 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id LAA23685
          for freebsd-hackers-outgoing; Wed, 25 Feb 1998 11:22:24 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from Kitten.mcs.com (Kitten.mcs.com [192.160.127.90])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id LAA23674
          for <hackers@FreeBSD.ORG>; Wed, 25 Feb 1998 11:22:10 -0800 (PST)
          (envelope-from karl@Mars.mcs.net)
Received: from Mars.mcs.net (karl@Mars.mcs.net [192.160.127.85]) by Kitten.mcs.com (8.8.7/8.8.2) with ESMTP id NAA21026; Wed, 25 Feb 1998 13:21:46 -0600 (CST)
Received: (from karl@localhost) by Mars.mcs.net (8.8.7/8.8.2) id NAA13910; Wed, 25 Feb 1998 13:21:46 -0600 (CST)
Message-ID: <19980225132146.02016@mcs.net>
Date: Wed, 25 Feb 1998 13:21:46 -0600
From: Karl Denninger  <karl@mcs.net>
To: Wilko Bulte <wilko@yedi.iaf.nl>
Cc: Jay Nelson <jdn@acp.qiv.com>, blkirk@float.eli.net, hackers@FreeBSD.ORG
Subject: Re: SCSI Bus redundancy...
References: <Pine.BSF.3.96.980224194109.1380A-100000@acp.qiv.com> <199802251848.TAA01481@yedi.iaf.nl>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.84
In-Reply-To: <199802251848.TAA01481@yedi.iaf.nl>; from Wilko Bulte on Wed, Feb 25, 1998 at 07:48:31PM +0100
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

This is a tricky problem to solve "correctly".

I have seen several potential solutions, and all have problems.  I've
actually INSTALLED AND USED a couple of them; they cover what they are
designed to cover quite well, but aren't perfect.

Let's say you have two machines, one in "hot standby" mode, the other
active.  They monitor each other over a private interconnect.  Both are "on"
the disk bus (perhaps through an active/active RAID controller), but only
one is using it.

If the first fails, the second activates itself, fsck's the disks, mounts
them, changes its Ethernet MAC address to that of the failed machine and
comes online.  If the first failed due to a software problem and went down
"gracefully", unmounting the disks, the restart time is measured in seconds.
If it blew chunks then FSCK has to run - and you damn well better be using a
journaled filesystem or this is going to take a LONG time (ie: 20 minutes to
an hour if you have some large disk storage involved here).  

This is one reason, by the way, that LFS being in a "working" state is 
important to these kinds of efforts.

IBM has a solution that they've sold for quite some time based on AIX (which
inherently uses jfs, a journalled filesystem) which does exactly this.

So far, so good.

Now, where are the problems:

1)	What if the second machine THINKS the first is dead, but its wrong!
	This could be extremely bad.  Its one of the failure scenarios that
	the cluster people don't like to talk about, because the consequence
	of being "wrong" about this could be the destruction of the disk 
	packs involved.

	There ARE some solutions to this if you use a raw interface to the
	disks and each "checkpoints" to a specific sector on a regular
	basis.  You SHOULD be able to detect, reliably, whether the other
	machine is working this way.  But its a non-trivial problem to
	solve, and the risk of being wrong is that you trash the entire
	working storage set on the disk subsystem.

2)	Concurrent *filesystem* access under Unix is a real bitch.  I've yet
	to see a *good* solution to this problem.  I've seen lots of hacks,
	but no real solutions.  I consider concurrent RAW disk slice access
	to be next to worthless, but I understand that some DBMS companies
	find that "solution" ideal for their particular application.

What I've thought about for a long time is architecting an active/active
solution to this problem.  Its tricky as hell to do right, but you'd
basically have a bulletproof final installation in which you could take a
hammer to any *ONE* device of a redundant set in the final configuration 
and the noticable impact from the outside would be *zero*.

--
-- 
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/          | T1's from $600 monthly to FULL DS-3 Service
			     | NEW! K56Flex support on ALL modems
Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS
Fax:   [+1 312 803-4929]     | *SPAMBLOCK* Technology now included at no cost


On Wed, Feb 25, 1998 at 07:48:31PM +0100, Wilko Bulte wrote:
> As Jay Nelson wrote...
> > On Tue, 24 Feb 1998, Ben Kirkpatrick, ELI wrote:
> > 
> > >   I've been wondering about the scsi redundancy problems that come up now
> > >and then (read: I've been chewing on paint chips again).  What parts are
> > >failing?  In my experience, only disks have failed once installed;
> > >controllers have only failed during poor installations and very rare at
> > >that.  
> > >   But what I was really wondering, is this about have two SCSI cards on
> > >one scsi bus.  On one of my old adaptec's it _looks_ like I can change the
> > >controller from ID7 to anything else.  With a controller at say 6 and 7,
> > >would there be a way in software for both controllers to access the disks?
> > >Or even for the standby controller to just scan the bus now and then?
> > >   Okey, I'm going off the deep-end, back to my white-out (old-formula).
> > >
> > >--Ben Kirkpatrick
> 
> > This is normally done with differential controllers between two
> > different machines -- and, yes, it works. I don't think it's possible
> 
> See Digital Unix TruClusters, they indeed only want differential for
> the shared SCSI buses.
> 
> > with single ended controllers. Concurrent file access from two
> > different machines is a _lot_ more troublesome because of the locking
> > problems. I don't know of any standard Unices that support this out of
> > the box. It usually takes two special daemons that run on both
> > machines willing to communicate with each other.
> 
> Digital Unix TruClusters do DRD (distributed raw device) now. Things
> like Oracle Parallel Server love this. A cluster filesystem is another
> kettle of fish of course. But not impossible, see OpenVMS.
> 
> > If you want both controllers on the same machine for high
> > availability, you'll need to write some software to monitor status and
> > take the appropriate actions if there is a failure. Otherwise, I don't
> 
> See www.veritas.com for a number of whitepapers on High Availabilty.
> Veritas calls their product FirstWatch.
> 
> Wilko
> _     ______________________________________________________________________
>  |   / o / /  _  Bulte email: wilko @ yedi.iaf.nl http://www.tcja.nl/~wilko
>  |/|/ / / /( (_) Arnhem, The Netherlands - Do, or do not. There is no 'try'
> ---------------  Support your local daemons: run [Free,Net,Open]BSD Unix  --
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-hackers" in the body of the message

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message