Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 12 Jan 2002 21:40:20 +1100 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Terry Lambert <tlambert2@mindspring.com>
Cc:        Peter Wemm <peter@wemm.org>, Alfred Perlstein <bright@mu.org>, Kelly Yancey <kbyanc@posi.net>, Nate Williams <nate@yogotech.com>, Daniel Eischen <eischen@pcnet1.pcnet.com>, Dan Eischen <eischen@vigrid.com>, Archie Cobbs <archie@dellroad.org>, <arch@FreeBSD.ORG>
Subject:   Re: Request for review: getcontext, setcontext, etc
Message-ID:  <20020112205919.E5372-100000@gamplex.bde.org>
In-Reply-To: <3C4001A3.5ECCAEB9@mindspring.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 12 Jan 2002, Terry Lambert wrote:

> Bruce Evans wrote:
> > (*) It may not be all that good.  It was good on old machines when 108
> > bytes was a lot of memory and moving the state in and out of the FPU
> > was slow too.  It is possible that the logic to avoid doing the switch
> > takes longer than always doing it, but not all that likely because logic
> > speed is increasing faster than memory speed and new machines have more
> > state to save (512 (?) bytes for SSE).
>
> Correct me if my math is wrong, but let's run with this...
>
> If I have a 2GHz CPU and 133MHz memory, then we are talking a 16:1
> slowdown for a transfer of 512 bytes from register to L1 to L2
> to main memory for an FPU state spill.
>
> Assuming a 64 bit data path, then we are talking a minimum of
> 3 * 512/(64/8) * (16:1) or 3k (3076) clocks to save the damn FPU
> state off to main memory (a store in a loop is 3 clocks ignoring
> the setup and crap, right?).  Add another 3k clocks to bring it
> back.
>
> Best case, God loves us, and we spill and restore from L1
> without an IPI or an invalidation, and without starting the
> thread on a CPU other than the one where it was suspended, and
> all spills are to cacheable write-through pages.  That's a 16
> times speed increase because we get to ignore the bus speed
> differential, or 3 * 512/(65/8) * 2 = (6k/16) = 384 clocks.

This seems to be off by a bit.  Actual timing on an Athlon1600
overclocked a little gives the following times for some crtical
parts of context switching for each iteration of instructions in
a loop (not counting 2 cycles of loop overhead):

pushal; popal:             9 cycles
pushl %ds; popl %ds:      21 cycles
fxsave; fxrstor:         105 cycles
fnsave; frstor:          264 cycles

This certainly hits the L1 cache almost every time.  So the 512-byte L1
case "only" takes 105 cycles, not 384, but the 108-byte L1 case takes
much longer.  fxsave/fxrstor is so fast that I don't quite believe the
times -- it saves 16 times as much state as pushal/popal in less than
12 times as much time.

> So it seems to me that it is *incredibly* expensive to do the
> FPU save and restore, considering what *else* I could be doing
> with those clock cycles.

I agree that fnsave/frstor are still incredibly expensive if the
above times are correct.  fxsave/fxrstor is only credibly expensive.
However, the overheads for fnsave/frstor are small compared with
the overheads for the !*#*$% segment registers.  We switch 3 segment
registers explicitly and 2 implicitly on every switch to the kernel.
According to the above, this has the same overhead as 1 fxsave/frstor.
It gets done much more often than context switches.  I hoped to get
rid of the 2 expicit segment register switches, but couldn't keep
up with the forces of bloat that added a 3rd.  Now I don't notice
this bloat unless I count cycles and forget that a billion of them
is a lot :-).

> With an average instruction time of 6 clocks (erring on the
> side of caution), the question is "can we perform the logic
> for the avoidance in 64 or less instructions?"  I think the
> answer is "yes", even if we throw in half a dozen uncached
> memory references to main memory as part of the process and
> take the 16:1 hit on each of them (that would be 96 clocks
> in memory references, leaving us 288/6 = 38 instructions to
> massage whatever we got back from those references).

The Xdna trap to do load the state if we guessed wrong about the
next timeslice not using the FPU takes about 200 instructions
including several slow ones like iret, so we don't get near 38
instructions in all cases although we could (Xdna can be written
in about 10 instructions if it doesn't go through trap() and
other general routines).

Bruce


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020112205919.E5372-100000>