From owner-freebsd-arch  Sat Jan 12 12:34:35 2002
Delivered-To: freebsd-arch@freebsd.org
Received: from rwcrmhc51.attbi.com (rwcrmhc51.attbi.com [204.127.198.38])
	by hub.freebsd.org (Postfix) with ESMTP id 2FF5437B417
	for <arch@FreeBSD.ORG>; Sat, 12 Jan 2002 12:34:31 -0800 (PST)
Received: from peter3.wemm.org ([12.232.27.13]) by rwcrmhc51.attbi.com
          (InterMail vM.4.01.03.27 201-229-121-127-20010626) with ESMTP
          id <20020112203430.IVQL10951.rwcrmhc51.attbi.com@peter3.wemm.org>
          for <arch@FreeBSD.ORG>; Sat, 12 Jan 2002 20:34:30 +0000
Received: from overcee.netplex.com.au (overcee.wemm.org [10.0.0.3])
	by peter3.wemm.org (8.11.0/8.11.0) with ESMTP id g0CKYUs73731
	for <arch@FreeBSD.ORG>; Sat, 12 Jan 2002 12:34:30 -0800 (PST)
	(envelope-from peter@wemm.org)
Received: from wemm.org (localhost [127.0.0.1])
	by overcee.netplex.com.au (Postfix) with ESMTP
	id EE98738CC; Sat, 12 Jan 2002 12:34:29 -0800 (PST)
	(envelope-from peter@wemm.org)
X-Mailer: exmh version 2.5 07/13/2001 with nmh-1.0.4
To: Bruce Evans <bde@zeta.org.au>
Cc: Terry Lambert <tlambert2@mindspring.com>,
	Alfred Perlstein <bright@mu.org>, Kelly Yancey <kbyanc@posi.net>,
	Nate Williams <nate@yogotech.com>,
	Daniel Eischen <eischen@pcnet1.pcnet.com>,
	Dan Eischen <eischen@vigrid.com>, Archie Cobbs <archie@dellroad.org>,
	arch@FreeBSD.ORG
Subject: Re: Request for review: getcontext, setcontext, etc 
In-Reply-To: <20020112205919.E5372-100000@gamplex.bde.org> 
Date: Sat, 12 Jan 2002 12:34:29 -0800
From: Peter Wemm <peter@wemm.org>
Message-Id: <20020112203429.EE98738CC@overcee.netplex.com.au>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Bruce Evans wrote:
> On Sat, 12 Jan 2002, Terry Lambert wrote:
> 
> > Bruce Evans wrote:
> > > (*) It may not be all that good.  It was good on old machines when 108
> > > bytes was a lot of memory and moving the state in and out of the FPU
> > > was slow too.  It is possible that the logic to avoid doing the switch
> > > takes longer than always doing it, but not all that likely because logic
> > > speed is increasing faster than memory speed and new machines have more
> > > state to save (512 (?) bytes for SSE).
> >
> > Correct me if my math is wrong, but let's run with this...
> >
> > If I have a 2GHz CPU and 133MHz memory, then we are talking a 16:1
> > slowdown for a transfer of 512 bytes from register to L1 to L2
> > to main memory for an FPU state spill.
> >
> > Assuming a 64 bit data path, then we are talking a minimum of
> > 3 * 512/(64/8) * (16:1) or 3k (3076) clocks to save the damn FPU
> > state off to main memory (a store in a loop is 3 clocks ignoring
> > the setup and crap, right?).  Add another 3k clocks to bring it
> > back.
> >
> > Best case, God loves us, and we spill and restore from L1
> > without an IPI or an invalidation, and without starting the
> > thread on a CPU other than the one where it was suspended, and
> > all spills are to cacheable write-through pages.  That's a 16
> > times speed increase because we get to ignore the bus speed
> > differential, or 3 * 512/(65/8) * 2 = (6k/16) = 384 clocks.
> 
> This seems to be off by a bit.  Actual timing on an Athlon1600
> overclocked a little gives the following times for some crtical
> parts of context switching for each iteration of instructions in
> a loop (not counting 2 cycles of loop overhead):
> 
> pushal; popal:             9 cycles
> pushl %ds; popl %ds:      21 cycles
> fxsave; fxrstor:         105 cycles
> fnsave; frstor:          264 cycles
> 
> This certainly hits the L1 cache almost every time.  So the 512-byte L1
> case "only" takes 105 cycles, not 384, but the 108-byte L1 case takes
> much longer.  fxsave/fxrstor is so fast that I don't quite believe the
> times -- it saves 16 times as much state as pushal/popal in less than
> 12 times as much time.

Well, fxsave/fxrstor were specifically designed so that this could all be
done with burst transfers.  fxsave/fxrstor are possibly doing 256 bit wide
transfers to/from the L1 cache.  Also dont forget that the fast save/
restore operations were designed with strict alignment requirements so that
a whole bunch of checks can be skipped at runtime that fnsave/frstor have
to still deal with.

> > So it seems to me that it is *incredibly* expensive to do the
> > FPU save and restore, considering what *else* I could be doing
> > with those clock cycles.
> 
> I agree that fnsave/frstor are still incredibly expensive if the
> above times are correct.  fxsave/fxrstor is only credibly expensive.
> However, the overheads for fnsave/frstor are small compared with
> the overheads for the !*#*$% segment registers.  We switch 3 segment
> registers explicitly and 2 implicitly on every switch to the kernel.
> According to the above, this has the same overhead as 1 fxsave/frstor.
> It gets done much more often than context switches.  I hoped to get
> rid of the 2 expicit segment register switches, but couldn't keep
> up with the forces of bloat that added a 3rd.  Now I don't notice
> this bloat unless I count cycles and forget that a billion of them
> is a lot :-).

Heh.  That reminds me, I need to talk over some IPI vector tweaks
with you.  I had forgotten that segment register operations were so bad.

Hmm.  What are they again?  I see %ds, %es and %fs.  I assume the two
implicit ones were %cs and %ss.  Which had you hoped to remove?  What *is*
%es used for anyway?

> > With an average instruction time of 6 clocks (erring on the
> > side of caution), the question is "can we perform the logic
> > for the avoidance in 64 or less instructions?"  I think the
> > answer is "yes", even if we throw in half a dozen uncached
> > memory references to main memory as part of the process and
> > take the 16:1 hit on each of them (that would be 96 clocks
> > in memory references, leaving us 288/6 = 38 instructions to
> > massage whatever we got back from those references).
> 
> The Xdna trap to do load the state if we guessed wrong about the
> next timeslice not using the FPU takes about 200 instructions
> including several slow ones like iret, so we don't get near 38
> instructions in all cases although we could (Xdna can be written
> in about 10 instructions if it doesn't go through trap() and
> other general routines).

Hmm, that is good to know too.

Cheers,
-Peter
--
Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au
"All of this is for nothing if we don't go to the stars" - JMS/B5


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message