From owner-freebsd-arch  Thu Jun 27 22: 1:58 2002
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 4847637B405
	for <arch@freebsd.org>; Thu, 27 Jun 2002 22:01:51 -0700 (PDT)
Received: from hotmail.com (f69.law3.hotmail.com [209.185.241.69])
	by mx1.FreeBSD.org (Postfix) with ESMTP id F146D43E06
	for <arch@freebsd.org>; Thu, 27 Jun 2002 22:01:50 -0700 (PDT)
	(envelope-from gat7634@hotmail.com)
Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC;
	 Thu, 27 Jun 2002 22:01:50 -0700
Received: from 149.99.118.139 by lw3fd.law3.hotmail.msn.com with HTTP;
	Fri, 28 Jun 2002 05:01:50 GMT
X-Originating-IP: [149.99.118.139]
From: "Gary Thorpe" <gat7634@hotmail.com>
To: nerd@xyz.com
Cc: arch@freebsd.org
Subject: Re: Larry McVoy's slides on cache coherent clusters
Date: Fri, 28 Jun 2002 01:01:50 -0400
Mime-Version: 1.0
Content-Type: text/plain; format=flowed
Message-ID: <F69e4fDGCl9FopYC98B0000006d@hotmail.com>
X-OriginalArrivalTime: 28 Jun 2002 05:01:50.0894 (UTC) FILETIME=[E9DB1CE0:01C21E60]
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

>From: nerd@xyz.com
>To: "Gary Thorpe" <gat7634@hotmail.com>
>CC: arch@FreeBSD.ORG
>Subject: Re: Larry McVoy's slides on cache coherent clusters
>Date: Thu, 27 Jun 2002 17:35:19 -0700
>
>So you know where I'm coming from, I used to be an engineer in the
>base OS group (I owned the disk driver) at Sequent, the company with
>the best NUMA product out there even if we went the way of Beta VCRs.
>
[...]
>
>We (Sequent) were the first and best implementation out there with our
>NUMA-Q line...  SGI & Sun both rely on huge memory backbones rather
>than finesse in software to achieve performance and they still fall
>short.  DG tried too but I've heard nothing of them of late, sort of
>like the US vice presidents (quick, name the last 4).
>
>NUMA buys you no redundancy in the real sense of the word, that is,
>the hardware architecture is more complex and thus more likely to
>fail.  Of course since you have a number of quads (or whatever an
>implementation may chose for the basic unit) once you've had a
>hardware fault you can easily remove a single quad and reboot.

This is by design. How about this scenario: if node A fails, its 
CPUs/memory/resources are marked as unusable, all the active tasks that were 
running that can be restarted from checkpointed data are migrated to new 
nodes and restarted, system continues.

>Unfortunately your uptime requirements have gone to hell the second a
>reboot is needed.  As far as scaling goes, you are right, code with
>minimal SMP awareness (Oracle) running on a top notch OS will scale
>incredibly well.
>
[...]
>
>*TO ME* clustering and single memory image are contradictory.  You
>cluster for redundancy, that is to get rid of any and all single
>points of failure.  If the janitor trips over a power cord thus taking
>a big bite out of your memory space you'll quickly realize that this
>is not redundancy.

From a hardware view, not really. Clusters typically use ethernet or some 
other network technology for inter-node communications. NUMA machines use a 
custom, high speed switching fabric (can this be considered a network? 
maybe...) to connect node memories into a hierarchy. The major difference is 
the performance, but *in theory* (and in my mind at least) they can be 
treated similarly. Nodes will have to work somewhat independently in a NUMA 
machine anyway, just like nodes in a traditional cluster. NUMA just makes a 
high speed cluster look like a single machine (at least for SGIs machines 
from what I can tell).

>At Sequent we found that the #1 key to scalability in a NUMA world was
>to NEVER move memory from one quad to the next.  This means that
>programs should try to migrate between procs on the same quad if
>possible, only move off quad as a last resort.  Memory allocation has
>to be very aware of the fact that it is running on a collection of SMP
>boxen with high costs to go from proc-to-proc and prohibitive costs to
>go from quad-to-quad.  Of course it follows that I/O must never be
>allowed to move over the memory backplane if possible.  We had quad
>aware routing at all layers of the I/O stack to achieve this.

This is analagous to how nodes in a cluster work! Same issues, except I/O 
can never be migarted in a cluster because of awful performance.

>Of course YMMV.  Last I looked neither Sun nor SGI had figured out how
>to squeeze the performance and scalability that we had.  IBM who
>bought, chewed up, and then threw Sequent away didn't seem to have the
>corporate acuity to realize that there were lessons to be learned from
>small companies.  Oh well, I'm bitter, sue me, no, forget that, IBM
>probably will.
>
>In another email on the same thread, Matt Dillon wrote:
>
> >NUMA then becomes just another, faster transport mechanism.  That is
> >the direction I believe the BSDs will take... transparent clustering
> >with NUMA transport, network transport, or a hybrid of both.
>
>Matt: If you don't have a single memory immage you don't have NUMA.
>If you do have it then the transport mechanism will be saturated just
>moving "RAM" around and will not be available for network, I/O or
>whatever else.

Nodes with local memory doing expensive inter-node communications. In a 
cluster, not taking this into account leads to 'network congestion' and poor 
performance (but this hierarchy is explicit to the OS and applications). In 
a NUMA machine, not taking this into account leads to switching fabric 
congestion and poor performance (but this hierarchy is hidden from 
applicatons but still explicit to the OS). Note: it is possible to make a 
cluster transparent to applications (MOSIX does this partly). If this is 
done, what it makes it hugely different from a NUMA 'machine' in terms of 
the way they work besides speed?

>
>-michael
>
>michael at michael dot galassi dot org


_________________________________________________________________
Join the world’s largest e-mail service with MSN Hotmail. 
http://www.hotmail.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message