Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 25 Apr 95 16:22:40 MDT
From:      terry@cs.weber.edu (Terry Lambert)
To:        bakul@netcom.com (Bakul Shah)
Cc:        mycroft@ai.mit.edu, hackers@FreeBSD.org
Subject:   Re: benchmark hell..
Message-ID:  <9504252222.AA02359@cs.weber.edu>
In-Reply-To: <199504251951.MAA27349@netcom9.netcom.com> from "Bakul Shah" at Apr 25, 95 12:50:59 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> > The idea here is that by default, processes are assumed to not use the
> > FPU, and thus by default the FPU trap results in code fixup rather than
> > FPU operations.  Thus it does not matter which FPU operation is used
> > first.
> 
> Isn't it easier to just assume that everyone may
> _eventually_ use FP and simply initialize the FP state on
> exec?  If they never use FP there is no loss.  One time
> initialization hit is not worth worrying about.  There are
> no FPU operations involved (you just copy the initial state
> from somewhere and set the `stale' bit).

No.  The difference is whether you have to save/restore state on
context swicth or not.  If you know the process doesn't use the FPU,
you can leave it alone, hoping that the process that doe use it will
be the next FPU using process to run.

This is typically the case, and it takes the FPU overhead out of each
of the processes not actually using the FPU.

Putting a stale state in there means you have to switch because you expect
that the process will use the FPU.

The difference is the difference between a startup initialization that
may not be necessary plus a context switch for every process that runs
vs a runtime fixup one time (that is no more expensive than the startup
hit) and which is only paid by processes actually using the FPU.

Here's the lowdown:

o	The FPU has a "last used by" state variable
o	When the first FPU using process is run, this variable is
	set to its PID.
o	When a context switch occurs, the FPU state is left unchanged
	if the process being switched to is a non FPU using process,
	or if the process being switched to is the one that "last used"
	the FPU.
o	A process has a flag that tells if it has ever used the FPU.
	If it has, the flag is set, if it hasn't, the flag is not set.
o	The flag is set by the linker based on FPU instructions being
	used (floagged by the objec files or by linking the math lib
	or whatever.  The flag would be set on the process at exec time.
o	Alternately, the  flag is only set on the first FPU use attempt.
	You cheat to get this information by causing processes without
	the flag to be set to cause an "emulator" that isn't an emulator
	to be trapped to by the FPU access.  This means the trap
	vector needs switching between FPU using and non FPU using
	programs.
o	It's probably a tossup whether it is more expensive to incorrectly
	predict an FPU instruction containing program will execute one
	of the FPU instructions in an average code path, or whether the
	extra switching from presetting the "uses the FPU" flag on the
	process results in more overhead.
o	When a switch is done to an FPU using program that is not the
	last FPU using program, the las program's state is saved in
	the per processstruct for that program; also the current FPU
	state is restored for the program.
o	The code is two additional compares on context switch for FPU
	using programs and one additional compare instead of an FPU
	state save and restore for each non-FPU using programs.
o	If you use the preflagging method at link time (my preference),
	then there is not an extra overhead of altering the trap vector
	on transition between fPU using and non-FPU using programs;
	otherwise, there is, and the overhead does not occur in normal
	usages (which is typlically non-FPU using programs-to-non-FPU
	using program context switching).
o	The benfit in defeating benchmarks (make no bones about it,
	that's what's going on) is that a single process benchmark that
	uses FPU instructions results in no FPU context switches because
	there is a single FPU using program.
o	The overall benefit in benchmarks not using the FPU and sensitive
	to the context switch overhead is that you trade a single compare
	insterad of the FPU switch overhead on a per process FPU context
	save and restore.

If you're worried about benchmarks, you should do this sort of thing.  For
Non FPU using installations, the overall benefit will be there, but for
average FPU using installations, the benefit will be reduced by a single
additional compare.




> This can even be considered a `cleaner" solution!  When
> someone else is using the FPU, _you_ don't have it!  So your
> choices are to either take it away from this someone else or
> emulate!!  Emulation may even be cheaper than save/restore
> of FPU state for some simple FP operation (once or twice)!
> Though, probably not worth microoptimizing like this.

I'd agree; I'd also say that this type of splitting, while possible in
Sequent hardware, is a *bad* precedent for SMP, which largely assumes
processer homogeneity.  I'd say that SMP is a goal and that ASMP is
not, and therefore inequal processr resources in a multiprocesser box
are a case that probably should not be handled by default.


					Terry Lambert
					terry@cs.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9504252222.AA02359>