From owner-freebsd-arch Thu Jun 27 22: 1:58 2002 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4847637B405 for ; Thu, 27 Jun 2002 22:01:51 -0700 (PDT) Received: from hotmail.com (f69.law3.hotmail.com [209.185.241.69]) by mx1.FreeBSD.org (Postfix) with ESMTP id F146D43E06 for ; Thu, 27 Jun 2002 22:01:50 -0700 (PDT) (envelope-from gat7634@hotmail.com) Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC; Thu, 27 Jun 2002 22:01:50 -0700 Received: from 149.99.118.139 by lw3fd.law3.hotmail.msn.com with HTTP; Fri, 28 Jun 2002 05:01:50 GMT X-Originating-IP: [149.99.118.139] From: "Gary Thorpe" To: nerd@xyz.com Cc: arch@freebsd.org Subject: Re: Larry McVoy's slides on cache coherent clusters Date: Fri, 28 Jun 2002 01:01:50 -0400 Mime-Version: 1.0 Content-Type: text/plain; format=flowed Message-ID: X-OriginalArrivalTime: 28 Jun 2002 05:01:50.0894 (UTC) FILETIME=[E9DB1CE0:01C21E60] Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG >From: nerd@xyz.com >To: "Gary Thorpe" >CC: arch@FreeBSD.ORG >Subject: Re: Larry McVoy's slides on cache coherent clusters >Date: Thu, 27 Jun 2002 17:35:19 -0700 > >So you know where I'm coming from, I used to be an engineer in the >base OS group (I owned the disk driver) at Sequent, the company with >the best NUMA product out there even if we went the way of Beta VCRs. > [...] > >We (Sequent) were the first and best implementation out there with our >NUMA-Q line... SGI & Sun both rely on huge memory backbones rather >than finesse in software to achieve performance and they still fall >short. DG tried too but I've heard nothing of them of late, sort of >like the US vice presidents (quick, name the last 4). > >NUMA buys you no redundancy in the real sense of the word, that is, >the hardware architecture is more complex and thus more likely to >fail. Of course since you have a number of quads (or whatever an >implementation may chose for the basic unit) once you've had a >hardware fault you can easily remove a single quad and reboot. This is by design. How about this scenario: if node A fails, its CPUs/memory/resources are marked as unusable, all the active tasks that were running that can be restarted from checkpointed data are migrated to new nodes and restarted, system continues. >Unfortunately your uptime requirements have gone to hell the second a >reboot is needed. As far as scaling goes, you are right, code with >minimal SMP awareness (Oracle) running on a top notch OS will scale >incredibly well. > [...] > >*TO ME* clustering and single memory image are contradictory. You >cluster for redundancy, that is to get rid of any and all single >points of failure. If the janitor trips over a power cord thus taking >a big bite out of your memory space you'll quickly realize that this >is not redundancy. From a hardware view, not really. Clusters typically use ethernet or some other network technology for inter-node communications. NUMA machines use a custom, high speed switching fabric (can this be considered a network? maybe...) to connect node memories into a hierarchy. The major difference is the performance, but *in theory* (and in my mind at least) they can be treated similarly. Nodes will have to work somewhat independently in a NUMA machine anyway, just like nodes in a traditional cluster. NUMA just makes a high speed cluster look like a single machine (at least for SGIs machines from what I can tell). >At Sequent we found that the #1 key to scalability in a NUMA world was >to NEVER move memory from one quad to the next. This means that >programs should try to migrate between procs on the same quad if >possible, only move off quad as a last resort. Memory allocation has >to be very aware of the fact that it is running on a collection of SMP >boxen with high costs to go from proc-to-proc and prohibitive costs to >go from quad-to-quad. Of course it follows that I/O must never be >allowed to move over the memory backplane if possible. We had quad >aware routing at all layers of the I/O stack to achieve this. This is analagous to how nodes in a cluster work! Same issues, except I/O can never be migarted in a cluster because of awful performance. >Of course YMMV. Last I looked neither Sun nor SGI had figured out how >to squeeze the performance and scalability that we had. IBM who >bought, chewed up, and then threw Sequent away didn't seem to have the >corporate acuity to realize that there were lessons to be learned from >small companies. Oh well, I'm bitter, sue me, no, forget that, IBM >probably will. > >In another email on the same thread, Matt Dillon wrote: > > >NUMA then becomes just another, faster transport mechanism. That is > >the direction I believe the BSDs will take... transparent clustering > >with NUMA transport, network transport, or a hybrid of both. > >Matt: If you don't have a single memory immage you don't have NUMA. >If you do have it then the transport mechanism will be saturated just >moving "RAM" around and will not be available for network, I/O or >whatever else. Nodes with local memory doing expensive inter-node communications. In a cluster, not taking this into account leads to 'network congestion' and poor performance (but this hierarchy is explicit to the OS and applications). In a NUMA machine, not taking this into account leads to switching fabric congestion and poor performance (but this hierarchy is hidden from applicatons but still explicit to the OS). Note: it is possible to make a cluster transparent to applications (MOSIX does this partly). If this is done, what it makes it hugely different from a NUMA 'machine' in terms of the way they work besides speed? > >-michael > >michael at michael dot galassi dot org _________________________________________________________________ Join the world’s largest e-mail service with MSN Hotmail. http://www.hotmail.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message