Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 12 Jan 2002 01:28:03 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Bruce Evans <bde@zeta.org.au>
Cc:        Peter Wemm <peter@wemm.org>, Alfred Perlstein <bright@mu.org>, Kelly Yancey <kbyanc@posi.net>, Nate Williams <nate@yogotech.com>, Daniel Eischen <eischen@pcnet1.pcnet.com>, Dan Eischen <eischen@vigrid.com>, Archie Cobbs <archie@dellroad.org>, arch@FreeBSD.ORG
Subject:   Re: Request for review: getcontext, setcontext, etc
Message-ID:  <3C4001A3.5ECCAEB9@mindspring.com>
References:  <20020112152622.W4598-100000@gamplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Bruce Evans wrote:
> (*) It may not be all that good.  It was good on old machines when 108
> bytes was a lot of memory and moving the state in and out of the FPU
> was slow too.  It is possible that the logic to avoid doing the switch
> takes longer than always doing it, but not all that likely because logic
> speed is increasing faster than memory speed and new machines have more
> state to save (512 (?) bytes for SSE).

Correct me if my math is wrong, but let's run with this...

If I have a 2GHz CPU and 133MHz memory, then we are talking a 16:1
slowdown for a transfer of 512 bytes from register to L1 to L2
to main memory for an FPU state spill.

Assuming a 64 bit data path, then we are talking a minimum of
3 * 512/(64/8) * (16:1) or 3k (3076) clocks to save the damn FPU
state off to main memory (a store in a loop is 3 clocks ignoring
the setup and crap, right?).  Add another 3k clocks to bring it
back.

Best case, God loves us, and we spill and restore from L1
without an IPI or an invalidation, and without starting the
thread on a CPU other than the one where it was suspended, and
all spills are to cacheable write-through pages.  That's a 16
times speed increase because we get to ignore the bus speed
differential, or 3 * 512/(65/8) * 2 = (6k/16) = 384 clocks.

So it seems to me that it is *incredibly* expensive to do the
FPU save and restore, considering what *else* I could be doing
with those clock cycles.

With an average instruction time of 6 clocks (erring on the
side of caution), the question is "can we perform the logic
for the avoidance in 64 or less instructions?"  I think the
answer is "yes", even if we throw in half a dozen uncached
memory references to main memory as part of the process and
take the 16:1 hit on each of them (that would be 96 clocks
in memory references, leaving us 288/6 = 38 instructions to
massage whatever we got back from those references).

Am I totally off base here?  If we assume L1 for all spill
restore, and uncached main memory references for checks, we
are even *better off* doing the checks when the ratio drops
to 8:1 (e.g. a 1GHz processor with 133MHz memory), right?

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C4001A3.5ECCAEB9>