Date: Wed, 25 Jul 2001 11:43:52 +0200 From: "Christopher R. Bowman" <boch@pactcorp.com> To: Leo Bicknell <bicknell@ufp.org> Subject: Re: MPP and new processor designs. Message-ID: <3B5E94D8.91032125@pactcorp.com> References: <01072419441206.00416@antiproton.ChrisBowman.com>
next in thread | previous in thread | raw e-mail | index | archive | help
"Leo Bicknell <bicknell@ufp.org>" wrote: > > A number of new chips have been released lately, along with some > enhancements to existing processors that all fall into the same > logic of parallelizing some operations. Why, just today I ran > across an article about http://www.theregister.co.uk/content/3/20576.html, > which bosts 128 ALU's on a single chip. > > This got me to thinking about an interesting way of using these > chips. Rather than letting the hardware parallelize instructions > from a single stream, what about feeding it multiple streams of > instructions. That is, treat it like multiple CPU's running two > (or more) processes at once. > > I'm sure the hardware isn't quite designed for this at the moment > and so it couldn't "just be done", but if you had say 128 ALU's > most single user systems could dedicate one ALU to a process > and never context switch, in the traditional sense. For systems > that run lots of processors the rate limiting on a single process > wouldn't be a big issue, and you could gain lots of effiencies > in the global aspect by not context-switching in the traditional > sense. > > Does anyone know of something like this being tried? Traditional > 2-8 way SMP systems probably don't have enough processors (I'm > thinking 64 is a minimum to make this interesting) and require > other glue to make multiple independant processors work together. > Has anyone tried this with them all in one package, all clocked > together, etc? > > -- > Leo Bicknell - bicknell@ufp.org > Systems Engineer - Internetworking Engineer - CCIE 3440 > Read TMBG List - tmbg-list-request@tmbg.org, www.tmbg.org > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-hackers" in the body of the message As I work for the above mentioned processor company, I though I might jump in here rather quickly and dispel an notion that you will be running any type of Linux or Unix on these processors any time soon. This chip is a reconfigurable data flow architecture with support for control flow. You really need to think about this chip in the dataflow paradigm. In addition you have to examine what the reporter said. While it is true that there are 128 ALUs on the chip and that it can perform in the neighborhood of 15BOPs these are only ALUs, they are not full processors. They don't run a program as you would on a typical von Neumann processor. The ALUs don't even have a program counter (not to mention MMUs). Instead, to program one of these chips you tell each ALU what function to perform and tell the ALUs how their input and output ports are connected. Then you sit back and watch as the data streams through in a pipelined fashion. Because all ALUs operate in parallel you can get some spectacular operations/second counts even at low frequencies. Think of it, even at only 100Mhz 100 ALUs operating in parallel give you 10 Billion operations per second. So, now let me move on to the real thrust of your argument which is valid, and point you at the kind of hardware you are really talking about. IBM is coming out with a POWER 4 that is just ungodly huge! (That is a technical term in the trade). 2 Processors per chip, four chips all together on a module 680 MILLION transistor in 20 sq. in, and if I remember correctly something like 5000 pins! Lots of interprocessor bandwidth. I don't know how they are going to get decent yield on these things, and you certainly aren't going to find them in your local compUSA but then there you go. Anyway the concept of not moving a process off a processor is in general called processor affinity, and has other benefits aside from the reduction of context switch. Even if it took you zero time to swap all the processor registers of 2 process from one processor to another you still would have reduced performance since you would have to flush the user space part of the TLBs (FreeBSD maps the kernel into each process address space and this doesn't need to be flushed) and the caches on virtually addressed caches (even on physically addressed caches where you don't have to flush, you still don't have the right data load into the new processors cache) and so you won't see the same through put. So in general affinity is good, but there are other problems. Suppose a process has finished it's quantum and the only other runable process hasn't been running on the now free processor, do you break the affinity, or do you hope that the process currently running on your preferred processor will sleep soon and it will be better for you to idle the other processor and wait for you preferred one? And you will have to worry about migrating processes across the processors to load balance in case you end up getting a few long lived processes all sharing one processor while the other processors only have say one process a piece. Terry can probably point you to the right place to read up about all this, but I think a company called sequent running lots of 286s in parallel had some good technical success with this kind of thing. finally, I do think that perhaps we have hit the point of diminishing returns with the current complexity of processors. Part of the Hennesy/Patterson approach to architecture that led to RISC was not reduction of instructions sets because that is good as a goal in it's own right, but rather a reduction of complexity as an engineering design goal since this leads to faster product design cycles which allows you to more aggressively target and take advantage of improving process technology. I think that the time may come where we want to dump huge caches and multiway super scalar processing since they take up lots of die space and pay diminishing returns. Perhaps in the future we would be better off with 20 or 50 simple first generation MIPS type cores on a chip. In a large multi-user system with high availability of jobs you might be able to leverage the design of the single core to truly high aggregate performance. You would, of course, not do anything for the single user workstation where you are only surfing or word processing, but in a large commercial setting with lots of independent jobs you might see better utilization of all that silicon by running more processes slower. --------- Christopher R. Bowman crb@ChrisBowman.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B5E94D8.91032125>