Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Mar 1997 08:40:17 -0700
From:      "Russell L. Carter" <rcarter@consys.com>
To:        Mike Pritchard <mpp@freefall.freebsd.org>
Cc:        jkh@time.cdrom.com (Jordan K. Hubbard), hackers@freebsd.org
Subject:   Re: dup3() - I've thought it over and decided... 
Message-ID:  <199703191540.IAA26553@conceptual.com>
In-Reply-To: Your message of "Wed, 19 Mar 1997 05:57:58 PST." <199703191357.FAA22301@freefall.freebsd.org> 

next in thread | previous in thread | raw e-mail | index | archive | help

> 
> As for Cray's implementation, yes, it allows you to create a complete
> snapshot of the process, process group, or session.  At this point you
> could either kill the the proc/pgrp/session for later restart, or allow 
> it to keep running and only use the snapshot in case of a system crash.
> I was involved in some work on this that allowed you to checkpoint the 
> process on one machine and then restart it on another for load leveling 
> purposes.
> 
> It was used mainly for checkpoint/restart of long running batch
> jobs submitted via NQS, but it was usable with interactive jobs
> to a degree.  There was on-going work for better interactive
> support when I left Cray (see below).

There are some other interesting things you can do with this if you have it.  
Fault tolerant ORBs, for instance.  If you've got a mission critical long
running app with enough simplicity you can periodically checkpoint to reliable
storage and restart on another compatible system with a minimum of fuss
should you happen to have any of a myriad number of problems with your first
platform.  Deep Pockets that have things that sustain damage are funding stuff
like this right now :-)

I've spent part of the last month looking somewhat superficially into the 
issues, for
SGIs there's something called Hibernator that sorta works.  Cray does appear to
be the current state-of-the-art.

Couple checkpointing/process migration with a queuing system like Codine that
understands distributed environments like ORBs, PVM, MPI, etc.,
and you have the potential for a pretty fault tolerant, distributed computing 
resource based mainly on off-the-shelf hardware.

For long running apps that is, ISPs are a different problem.

-- 
Russell L. Carter

Voice:(520) 636-2600 FAX:(520) 636-2888          rcarter@consys.com
Conceptual Systems & Software,  P.O. Box 1129 Chino Valley AZ 86323
"Before sitting down, always look for ferrets."





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703191540.IAA26553>