Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 12 Jan 2002 12:34:29 -0800
From:      Peter Wemm <peter@wemm.org>
To:        Bruce Evans <bde@zeta.org.au>
Cc:        Terry Lambert <tlambert2@mindspring.com>, Alfred Perlstein <bright@mu.org>, Kelly Yancey <kbyanc@posi.net>, Nate Williams <nate@yogotech.com>, Daniel Eischen <eischen@pcnet1.pcnet.com>, Dan Eischen <eischen@vigrid.com>, Archie Cobbs <archie@dellroad.org>, arch@FreeBSD.ORG
Subject:   Re: Request for review: getcontext, setcontext, etc 
Message-ID:  <20020112203429.EE98738CC@overcee.netplex.com.au>
In-Reply-To: <20020112205919.E5372-100000@gamplex.bde.org> 

next in thread | previous in thread | raw e-mail | index | archive | help
Bruce Evans wrote:
> On Sat, 12 Jan 2002, Terry Lambert wrote:
> 
> > Bruce Evans wrote:
> > > (*) It may not be all that good.  It was good on old machines when 108
> > > bytes was a lot of memory and moving the state in and out of the FPU
> > > was slow too.  It is possible that the logic to avoid doing the switch
> > > takes longer than always doing it, but not all that likely because logic
> > > speed is increasing faster than memory speed and new machines have more
> > > state to save (512 (?) bytes for SSE).
> >
> > Correct me if my math is wrong, but let's run with this...
> >
> > If I have a 2GHz CPU and 133MHz memory, then we are talking a 16:1
> > slowdown for a transfer of 512 bytes from register to L1 to L2
> > to main memory for an FPU state spill.
> >
> > Assuming a 64 bit data path, then we are talking a minimum of
> > 3 * 512/(64/8) * (16:1) or 3k (3076) clocks to save the damn FPU
> > state off to main memory (a store in a loop is 3 clocks ignoring
> > the setup and crap, right?).  Add another 3k clocks to bring it
> > back.
> >
> > Best case, God loves us, and we spill and restore from L1
> > without an IPI or an invalidation, and without starting the
> > thread on a CPU other than the one where it was suspended, and
> > all spills are to cacheable write-through pages.  That's a 16
> > times speed increase because we get to ignore the bus speed
> > differential, or 3 * 512/(65/8) * 2 = (6k/16) = 384 clocks.
> 
> This seems to be off by a bit.  Actual timing on an Athlon1600
> overclocked a little gives the following times for some crtical
> parts of context switching for each iteration of instructions in
> a loop (not counting 2 cycles of loop overhead):
> 
> pushal; popal:             9 cycles
> pushl %ds; popl %ds:      21 cycles
> fxsave; fxrstor:         105 cycles
> fnsave; frstor:          264 cycles
> 
> This certainly hits the L1 cache almost every time.  So the 512-byte L1
> case "only" takes 105 cycles, not 384, but the 108-byte L1 case takes
> much longer.  fxsave/fxrstor is so fast that I don't quite believe the
> times -- it saves 16 times as much state as pushal/popal in less than
> 12 times as much time.

Well, fxsave/fxrstor were specifically designed so that this could all be
done with burst transfers.  fxsave/fxrstor are possibly doing 256 bit wide
transfers to/from the L1 cache.  Also dont forget that the fast save/
restore operations were designed with strict alignment requirements so that
a whole bunch of checks can be skipped at runtime that fnsave/frstor have
to still deal with.

> > So it seems to me that it is *incredibly* expensive to do the
> > FPU save and restore, considering what *else* I could be doing
> > with those clock cycles.
> 
> I agree that fnsave/frstor are still incredibly expensive if the
> above times are correct.  fxsave/fxrstor is only credibly expensive.
> However, the overheads for fnsave/frstor are small compared with
> the overheads for the !*#*$% segment registers.  We switch 3 segment
> registers explicitly and 2 implicitly on every switch to the kernel.
> According to the above, this has the same overhead as 1 fxsave/frstor.
> It gets done much more often than context switches.  I hoped to get
> rid of the 2 expicit segment register switches, but couldn't keep
> up with the forces of bloat that added a 3rd.  Now I don't notice
> this bloat unless I count cycles and forget that a billion of them
> is a lot :-).

Heh.  That reminds me, I need to talk over some IPI vector tweaks
with you.  I had forgotten that segment register operations were so bad.

Hmm.  What are they again?  I see %ds, %es and %fs.  I assume the two
implicit ones were %cs and %ss.  Which had you hoped to remove?  What *is*
%es used for anyway?

> > With an average instruction time of 6 clocks (erring on the
> > side of caution), the question is "can we perform the logic
> > for the avoidance in 64 or less instructions?"  I think the
> > answer is "yes", even if we throw in half a dozen uncached
> > memory references to main memory as part of the process and
> > take the 16:1 hit on each of them (that would be 96 clocks
> > in memory references, leaving us 288/6 = 38 instructions to
> > massage whatever we got back from those references).
> 
> The Xdna trap to do load the state if we guessed wrong about the
> next timeslice not using the FPU takes about 200 instructions
> including several slow ones like iret, so we don't get near 38
> instructions in all cases although we could (Xdna can be written
> in about 10 instructions if it doesn't go through trap() and
> other general routines).

Hmm, that is good to know too.

Cheers,
-Peter
--
Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au
"All of this is for nothing if we don't go to the stars" - JMS/B5


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020112203429.EE98738CC>