Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 14 Jan 2002 13:31:20 +1100 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Peter Jeremy <peter.jeremy@alcatel.com.au>
Cc:        Terry Lambert <tlambert2@mindspring.com>, Peter Wemm <peter@wemm.org>, Alfred Perlstein <bright@mu.org>, Kelly Yancey <kbyanc@posi.net>, Nate Williams <nate@yogotech.com>, Daniel Eischen <eischen@pcnet1.pcnet.com>, Dan Eischen <eischen@vigrid.com>, Archie Cobbs <archie@dellroad.org>, <arch@FreeBSD.ORG>
Subject:   Re: Request for review: getcontext, setcontext, etc
Message-ID:  <20020114120026.S3794-100000@gamplex.bde.org>
In-Reply-To: <20020114074238.S561@gsmx07.alcatel.com.au>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 14 Jan 2002, Peter Jeremy wrote:

> On 2002-Jan-12 21:40:20 +1100, Bruce Evans <bde@zeta.org.au> wrote:
> >This seems to be off by a bit.  Actual timing on an Athlon1600
> >overclocked a little gives the following times for some crtical
> >parts of context switching for each iteration of instructions in
> >a loop (not counting 2 cycles of loop overhead):
> >
> >pushal; popal:             9 cycles
> >pushl %ds; popl %ds:      21 cycles
> >fxsave; fxrstor:         105 cycles
> >fnsave; frstor:          264 cycles
>
> I can think of a possible reason: The FPU knows when it has been used
> vs just having executed fninit.  In the latter case, all it needs to
> save is "I've been initialised".

There are a lot of reasons.  Peter Wemm mentioned some alignment issues
and not doing an (in)convenient implicit fninit.  The 512-byte SSE
context is more than half "reserved", and this most of half is not
written to by fxsave and presumably not read from by frstor (I didn't
check the manual for this; I just fxsave'ed over 0xff bytes and checked
what changed).

> Also the FPU architecture includes
> "used" flags associated with each register - possibly the f*save
> instructions don't flush unused registers.  Do the above numbers
> change when you push real data into the FP registers?

The manual seems to say that fxsave is clever about this, but it
doesn't seem to be.  Loading registers didn't make any difference to
the times with the following register contents:

    all 1.0's
    6 1.0's, one 0.0, one +Inf, and a masked DIVZ exception for creating
       the +Inf
    all xmm registers explicitly loaded with 0

OTOH, with the DIVZ unmasked (but never trapped for), fxsave/frstor
takes 187 cycles.  I think fnsave/frstor is normally slowed down by
(unmasked only?) exceptions too.

> Also, how expensive is a DNA trap?  Would it be cheaper overall to

I counted 200 kernel instructions for non-SMP in -current.  Half of
these are for recent pessimizations in -current: crhold() and crfree()
and locking for these take about 100 instructions.  I haven't measured
the expense in real time.  The basic (mostly unnecessary) overhead is
the same as for pagefault traps.  Mmap latency on this machine is
153nsec on the same machine according to lmbench2.

> always load FPU context on a switch - this is more expensive for
> processes that don't use FP, but saves a DNA trap per context switch
> (assuming they use FP in that slice) for those that do.

Not overall, since most timeslices don't use the FPU (at least for
processes that I run :-).

> To add some further numbers, in December 1999, I did some measurements
> on FP switching by patching npx.c.  This was on a PII-266 running then
> -current.  (The original e-mail was sent to -arch on Mon, 20 Dec 1999
> 07:34:06 +1100 in a thread titled "Concrete plans for ucontext/
> mcontext changes around 4.0" - I don't have the message-id available).
>
>   ctxt     DNA    FP
>  swtch    traps  swtch
> 1754982  281557  59753  build world and a few CVS operations [1]
>   79044   18811  10341  gnuplot and xv in parallel [2]
>     800     138    130  parallel FP-intensive progs [3].
>
> In the above, `ctxt swtch' is the number of context switches counted
> via vm.stats.sys.v_swtch.  `DNA traps' is the number of device not
> available traps registered and `FP swtch' is the number of DNA traps
> where the FP context loaded is different to that saved on the
> preceeding context switch.

That's a lot more DNA traps than I would have expected for buildworld
and a bit less than I would have expected for the others.  I guess many
of the ones for buildworld are for the FP in setjmp() for jumps that
are never taken.

220000 extra FP context switches at 264 cycles each would increase my
buildworld time by a whole 0.34 seconds or 0.025%.  There may be more
important things to optimize :-).

Bruce


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020114120026.S3794-100000>