Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Jul 2001 11:43:52 +0200
From:      "Christopher R. Bowman" <boch@pactcorp.com>
To:        Leo Bicknell <bicknell@ufp.org>
Subject:   Re: MPP and new processor designs.
Message-ID:  <3B5E94D8.91032125@pactcorp.com>
References:  <01072419441206.00416@antiproton.ChrisBowman.com>

next in thread | previous in thread | raw e-mail | index | archive | help
"Leo Bicknell <bicknell@ufp.org>" wrote:
>
> A number of new chips have been released lately, along with some
> enhancements to existing processors that all fall into the same
> logic of parallelizing some operations.  Why, just today I ran
> across an article about http://www.theregister.co.uk/content/3/20576.html,
> which bosts 128 ALU's on a single chip.
> 
> This got me to thinking about an interesting way of using these
> chips.  Rather than letting the hardware parallelize instructions
> from a single stream, what about feeding it multiple streams of
> instructions.  That is, treat it like multiple CPU's running two
> (or more) processes at once.
> 
> I'm sure the hardware isn't quite designed for this at the moment
> and so it couldn't "just be done", but if you had say 128 ALU's
> most single user systems could dedicate one ALU to a process
> and never context switch, in the traditional sense.   For systems
> that run lots of processors the rate limiting on a single process
> wouldn't be a big issue, and you could gain lots of effiencies
> in the global aspect by not context-switching in the traditional
> sense.
> 
> Does anyone know of something like this being tried?  Traditional
> 2-8 way SMP systems probably don't have enough processors (I'm
> thinking 64 is a minimum to make this interesting) and require
> other glue to make multiple independant processors work together.
> Has anyone tried this with them all in one package, all clocked
> together, etc?
> 
> --
> Leo Bicknell - bicknell@ufp.org
> Systems Engineer - Internetworking Engineer - CCIE 3440
> Read TMBG List - tmbg-list-request@tmbg.org, www.tmbg.org
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-hackers" in the body of the message

As I work for the above mentioned processor company, I though I might
jump in here rather quickly and dispel an notion that you will be
running any type of Linux or Unix on these processors any time soon.

This chip is a reconfigurable data flow architecture with support for
control flow.  You really need to think about this chip in the dataflow
paradigm.  In addition you have to examine what the reporter said. 
While it is true that there are 128 ALUs on the chip and that it can
perform in the neighborhood of 15BOPs these are only ALUs, they are not
full processors.  They don't run a program as you would on a typical von
Neumann processor.  The ALUs don't even have a program counter (not to
mention MMUs).  Instead, to program one of these chips you tell each ALU
what function to perform and tell the ALUs how their input and output
ports are connected.  Then you sit back and watch as the data streams
through in a pipelined fashion. Because all ALUs operate in parallel you
can get some spectacular operations/second counts even at low
frequencies.  Think of it, even at only 100Mhz 100 ALUs operating in
parallel give you 10 Billion operations per second.

So, now let me move on to the real thrust of your argument which is
valid, and point you at the kind of hardware you are really talking
about. IBM is coming out with a POWER 4 that is just ungodly huge! (That
is a technical term in the trade).  2 Processors per chip, four chips
all together on a module 680 MILLION transistor in 20 sq. in, and if I
remember correctly something like 5000 pins!  Lots of interprocessor
bandwidth. I don't know how they are going to get decent yield on these
things, and you certainly aren't going to find them in your local
compUSA but then there you go.

Anyway the concept of not moving a process off a processor is in general
called processor affinity, and has other benefits aside from the
reduction of context switch.  Even if it took you zero time to swap all
the processor registers of 2 process from one processor to another you
still would have reduced performance since you would have to flush the
user space part of the TLBs (FreeBSD maps the kernel into each process
address space and this doesn't need to be flushed) and the caches on
virtually addressed caches (even on physically addressed caches where
you don't have to flush, you still don't have the right data load into
the new processors cache) and so you won't see the same through put.  So
in general affinity is good, but there are other problems. Suppose a
process has finished it's quantum and the only other runable process
hasn't been running on the now free processor, do you break the
affinity, or do you hope that the process currently running on your
preferred processor will sleep soon and it will be better for you to
idle the other processor and wait for you preferred one?  And you will
have to worry about migrating processes across the processors to load
balance in case you end up getting a few long lived processes all
sharing one processor while the other processors only have say one
process a piece.  Terry can probably point you to the right place to
read up about all this, but I think a company called sequent running
lots of 286s in parallel had some good technical success with this kind
of thing.

finally, I do think that perhaps we have hit the point of diminishing
returns with the current complexity of processors.  Part of the
Hennesy/Patterson approach to architecture that led to RISC was not
reduction of instructions sets because that is good as a goal in it's
own right, but rather a reduction of complexity as an engineering design
goal since this leads to faster product design cycles which allows you
to more aggressively target and take advantage of improving process
technology. I think that the time may come where we want to dump huge
caches and multiway super scalar processing since they take up lots of
die space and pay diminishing returns.  Perhaps in the future we would
be better off with 20 or 50 simple first generation MIPS type cores on a
chip.  In a large multi-user system with high availability of jobs you
might be able to leverage the design of the single core to truly high
aggregate performance. You would, of course, not do anything for the
single user workstation where you are only surfing or word processing,
but in a large commercial setting with lots of independent jobs you
might see better utilization of all that silicon by running more
processes slower.

---------
Christopher R. Bowman
crb@ChrisBowman.com

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B5E94D8.91032125>