Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Jul 2002 10:56:30 +0200
From:      Andy Sporner <sporner@nentec.de>
To:        aaron g <click46@operamail.com>
Cc:        cliftonr@lava.net, freebsd-cluster@FreeBSD.ORG
Subject:   Re: SPREAD clusters
Message-ID:  <3D2D483E.4040100@nentec.de>
References:  <20020709212404.16403.qmail@operamail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi,

I will try to answer both emails at once ;-) (wow! multitasking batch mode!)

I think a good clarification is in order. ;-)

My idea of a "perfect cluster" is one that applications don't realize
that they are on one.  That in addition to achieving the five 9's of
reliablity (99.999% uptime).

In my working experience I have seen the clustering system at Fermi-Lab
in the early 90's and had my opinions about it.  I worked with clusters
on Dynix/PTX (Sequent) and even gave back some enhancements since
1995, I formed a company to make load balancing applications in the mid
'90's and the rights got sold to a company in Boston that makes 1U servers.
Now I am working for German company making a very high speed network
control switch (that can make complex routing decisions of network traffic)
that can be used to front-end such a cluster as I am proposing--though a
software solution will work equally well ;-)

While working on Sequent clusters, I got familiar with the Numa-Q product
and about it's workings.  I came to the conclusion that it only addressed
the SMP bottleneck (Amdahl's law) but really didn't add that much more of
reliablity.

So the idea for Phase 2 was to make a 'Numa' like system--which in effect
it is, that removes the OS on the node as a single point of failure.  In 
their
Numa architecture it was one instance of the OS across many physical nodes
using special hardware to be able to address any page of memory across the
complex making up a node.  The problem is that one member of the complex
could bring down the entire system.

What I wanted to do was to start where they were, but have a separate O/S
image on each node with a cooperative process space--yes like Mosix, but
totally transparently.  When a system becomes too busy, rather than swapping
a process out to disk, it can be swapped to another node.  Sort of like SMP
(Symmetrical Multi-Processing) in a network.  If a node dies, just those
processes that had memory there die, but the OS (cooperatively speaking)
just goes on running--and rather than waiting for a reboot, the dead 
processes
just get restarted again.  With such a system, the five 9's should be very
easy to reach.  

That being said, there are a lot of challenges--especially with respect to
the system scheduler and the VM system that have to be addressed.  I have
a rough concept that I have been going over though the last 4 years and
have never had a chance to commit it to a document.  I suppose it is
probably about time to do so.  I even came up with a way that network
applications can survive a node move as well, though it requires a special
protocol and a front-end device to achieve this.  For the sake of the 
front-end
device and potential single points of failures, we have phase-1 of the 
clustering
software, but ultimately, phase-2 should completely replace phase-1 for
everything else.

While speaking about phase-1, the goal is simple generic failover of
applications.  There is a small feature that didn't cost much in the 
implementation
to add a weight to the applications so that they could be started on nodes
in a more intelligent manner, with respect to the resources on the machine.
For the moment they are static (IE: the summation of the weights of already
running applications are done to find out if enough resources are 
present to start
a new one.  Instead of looking at the configured Maximum weight, the actual
application usage (by merit of the CSE patch) can be collected instead.  
 From this jobs can be shut-down and restart on other nodes when the
statistics on a node changed.

Bye!



Andy









These things have been
aaron g wrote:

>>From my limited knowledge of the project I beleive this is
>not out of the question. Infact it may have been Andy
>himself who made reference to the VMS like clustering
>technology. I'm not really in a place to give a definitive
>answer but I too am interested in a solution for the
>situation you describe.
>
>- aarong
>
>----- Original Message -----
>From: Clifton Royston <cliftonr@lava.net>
>Date: Tue, 9 Jul 2002 09:51:32 -1000
>To: Andy Sporner <sporner@nentec.de>
>Subject: Re: SPREAD clusters
>
>>  Is there a document explaining the scope of the project, what kinds
>>of problems it's intended to address, and the overall outline or
>>roadmap?  I'm having a hard time getting that from the URL you posted. 
>>(I'm also new to this list, obviously.)
>>
>>  Is the current project aimed at application failover and load-balancing
>>for specific applications, i.e. providing the software equivalent of a
>>"layer 4" or load balancing Ethernet switch?  
>>
>>  Or does it generically instantiate all network applications into the
>>same failover and load-balancing environment?
>>
>>  Or is it more like Mosix, in which servers join a kind of "hive mind"
>>where any processor can vfork() a process onto a different server with
>>more RAM/CPU available, but processes have to remain on the original
>>machine to do device I/O?
>>
>>  Or is it like Digital (R.I.P.s) Vax VMS or "TrueUNIX" clustering,
>>where for most purposes the clustered servers behaved like a single
>>machine, with shared storage, unified access to file systems and
>>devices, etc.?
>>
>>  My main practical interest is in the nitty-gritty of building
>>practical highly reliable and highly scalable mail server clusters,
>>both for mail delivery (SMTP,LMTP) and mail retrieval (POP, IMAP.) The
>>main challenge in doing this right now is dealing with the need for all
>>servers to have a coherent common view of the file systems where mail
>>is stored.  This means the cluster solution needs to include shared
>>storage, either via NFS or via some better mechanism which provides
>>reliable sharing of file systems between multiple servers and allows
>>for server failure without interruption of data access.
>>
>>  Is this kind of question outside the scope of the current project?
>>
>>  -- Clifton
>>
>>-- 
>>    Clifton Royston  --  LavaNet Systems Architect --  cliftonr@lava.net
>>"What do we need to make our world come alive?  
>>   What does it take to make us sing?
>> While we're waiting for the next one to arrive..." - Sisters of Mercy
>>
>>To Unsubscribe: send mail to majordomo@FreeBSD.org
>>with "unsubscribe freebsd-cluster" in the body of the message
>>
>
>    
>




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-cluster" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D2D483E.4040100>