Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Jul 2002 08:43:38 -1000
From:      Clifton Royston <cliftonr@lava.net>
To:        Andy Sporner <sporner@nentec.de>
Cc:        aaron g <click46@operamail.com>, freebsd-cluster@FreeBSD.ORG
Subject:   Re: SPREAD clusters
Message-ID:  <20020711084338.B18402@lava.net>
In-Reply-To: <3D2D483E.4040100@nentec.de>; from sporner@nentec.de on Thu, Jul 11, 2002 at 10:56:30AM %2B0200
References:  <20020709212404.16403.qmail@operamail.com> <3D2D483E.4040100@nentec.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jul 11, 2002 at 10:56:30AM +0200, Andy Sporner wrote:
> Hi,
> 
> I will try to answer both emails at once ;-) (wow! multitasking batch mode!)
> 
> I think a good clarification is in order. ;-)
> 
> My idea of a "perfect cluster" is one that applications don't realize
> that they are on one.  That in addition to achieving the five 9's of
> reliablity (99.999% uptime).
> 
> In my working experience I have seen the clustering system at Fermi-Lab
> in the early 90's and had my opinions about it.  I worked with clusters
> on Dynix/PTX (Sequent) and even gave back some enhancements since
> 1995, I formed a company to make load balancing applications in the mid
> '90's and the rights got sold to a company in Boston that makes 1U servers.
> Now I am working for German company making a very high speed network
> control switch (that can make complex routing decisions of network traffic)
> that can be used to front-end such a cluster as I am proposing--though a
> software solution will work equally well ;-)
>
> While working on Sequent clusters, I got familiar with the Numa-Q product
> and about it's workings.  I came to the conclusion that it only addressed
> the SMP bottleneck (Amdahl's law) but really didn't add that much more of
> reliablity.
> 
> So the idea for Phase 2 was to make a 'Numa' like system--which in
> effect it is, that removes the OS on the node as a single point of
> failure.  In their Numa architecture it was one instance of the OS
> across many physical nodes using special hardware to be able to
> address any page of memory across the complex making up a node.  The
> problem is that one member of the complex could bring down the entire
> system.
> 
> What I wanted to do was to start where they were, but have a separate
> O/S image on each node with a cooperative process space--yes like
> Mosix, but totally transparently.  When a system becomes too busy,
> rather than swapping a process out to disk, it can be swapped to
> another node.  Sort of like SMP (Symmetrical Multi-Processing) in a
> network.  If a node dies, just those processes that had memory there
> die, but the OS (cooperatively speaking) just goes on running--and
> rather than waiting for a reboot, the dead processes just get
> restarted again.  With such a system, the five 9's should be very
> easy to reach.
 
  Very nice.

> That being said, there are a lot of challenges--especially with
> respect to the system scheduler and the VM system that have to be
> addressed.  I have a rough concept that I have been going over though
> the last 4 years and have never had a chance to commit it to a
> document.  I suppose it is probably about time to do so.  I even came
> up with a way that network applications can survive a node move as
> well, though it requires a special protocol and a front-end device to
> achieve this.  For the sake of the front-end device and potential
> single points of failures, we have phase-1 of the clustering
> software, but ultimately, phase-2 should completely replace phase-1
> for everything else.
 
  I assume this will be targeted at the current -current track (5.x
release) to take advantage of the major rewrite of kernel scheduling
there?

> While speaking about phase-1, the goal is simple generic failover of
> applications.  There is a small feature that didn't cost much in the
> implementation to add a weight to the applications so that they could
> be started on nodes in a more intelligent manner, with respect to the
> resources on the machine. For the moment they are static (IE: the
> summation of the weights of already running applications are done to
> find out if enough resources are present to start a new one.  Instead
> of looking at the configured Maximum weight, the actual application
> usage (by merit of the CSE patch) can be collected instead.
>  From this jobs can be shut-down and restart on other nodes when the
> statistics on a node changed.

  Great explanation!

  This sounds like a great project.  I'm in agreement with both your
goals and the stages you're talking about to reach it.  I've had
daydreams of trying to hammer together something similar, but don't
have the experience with clustering that you've brought to the project. 
Addressing basic application failover in a structured way as you've
done does seem like the best place to start, and then go for the big
issues of moving processes, interacting with other kernels, etc.  I
agree that it wouldn't make much sense sinking too much effort into
load-balancing the *front* end (network connections) in software at
this stage, when there are hardware products out there that will do it
at a reasonable price.

  If you don't mind, maybe this sketch of the project you just gave
could be committed to the documentation as a starting point. 

  On the storage front, do you think it would be worthwhile to also
address in an upcoming phase the failover of shared storage access,
using something like the "Tertiary Disk project" node design at:
  <http://now.cs.berkeley.edu/Td/arch.html>;

  -- Clifton

-- 
    Clifton Royston  --  LavaNet Systems Architect --  cliftonr@lava.net
"What do we need to make our world come alive?  
   What does it take to make us sing?
 While we're waiting for the next one to arrive..." - Sisters of Mercy

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-cluster" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020711084338.B18402>