Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 25 Aug 2003 13:35:09 -0400 (EDT)
From:      Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
To:        "Daniel C. Sobral" <dcs@tcoip.com.br>
Cc:        current@freebsd.org
Subject:   Re: HTT on current
Message-ID:  <200308251735.h7PHZ9bd094222@khavrinen.lcs.mit.edu>
In-Reply-To: <3F4A43EA.9090500@tcoip.com.br>
References:  <JCEIKJMCANNPGKFKGLKLOENEDJAA.mikej@trigger.net> <3F4A1CE2.6080806@freebsd.org> <20030825164907.GA17503@dragon.nuxi.com> <3F4A43EA.9090500@tcoip.com.br>

next in thread | previous in thread | raw e-mail | index | archive | help
<<On Mon, 25 Aug 2003 14:14:18 -0300, "Daniel C. Sobral" <dcs@tcoip.com.br> said:

> There are two problems with HTT. First, L1/L2 cache issues. Second, the 
> virtual CPUs are not independent, and there are many cases where 
> instructions in one virtual CPU stall the other. So take, for example, 
> the case of a userland application on CPU0 stalling the kernel on CPU1.

I don't think that this is quite stated right.  The problem is that
the P4 is not very wide to begin with, and it's very hard to optimize
well for that 23-stage pipeline.[1]  So if you have a thread with lots
of latent ILP (either because you did a good job optimizing it for a
four-way superscalar, or because you did a bad job scheduling it and
are depending on the processor to make up for the naive optimization),
it is bound to run more slowly when some of the functional units it
could have used are taken by another thread of execution.  But some
sorts of applications can benefit, if the application can be
decomposed into threads that exercise different FUs (for example, one
thread that is memory intensive and one thread that is compute
intensive).  The challenge then is to make sure that they always get
scheduled on the same processor at the same time.

The key to getting good performace on an SMT architecture with an
arbitrary instruction mix is more functional units.  The never-built
Alpha EV8, which was to be an eight-way superscalar with four-way SMT
and a wide memory bus, would be much easier with which to achieve
optimum performance.

-GAWollman

[1] That's why the Athlon gets more instructions per cycle: it has a
much shallower pipeline and more functional units, so it can execute
naively-optimized, ILP-heavy code much faster without stalling.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200308251735.h7PHZ9bd094222>