From owner-freebsd-arch Sun Jan 13 18:32:54 2002 Delivered-To: freebsd-arch@freebsd.org Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by hub.freebsd.org (Postfix) with ESMTP id 2F7C037B419 for ; Sun, 13 Jan 2002 18:32:44 -0800 (PST) Received: from bde.zeta.org.au (bde.zeta.org.au [203.2.228.102]) by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id NAA02626; Mon, 14 Jan 2002 13:32:02 +1100 Date: Mon, 14 Jan 2002 13:31:20 +1100 (EST) From: Bruce Evans X-X-Sender: To: Peter Jeremy Cc: Terry Lambert , Peter Wemm , Alfred Perlstein , Kelly Yancey , Nate Williams , Daniel Eischen , Dan Eischen , Archie Cobbs , Subject: Re: Request for review: getcontext, setcontext, etc In-Reply-To: <20020114074238.S561@gsmx07.alcatel.com.au> Message-ID: <20020114120026.S3794-100000@gamplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, 14 Jan 2002, Peter Jeremy wrote: > On 2002-Jan-12 21:40:20 +1100, Bruce Evans wrote: > >This seems to be off by a bit. Actual timing on an Athlon1600 > >overclocked a little gives the following times for some crtical > >parts of context switching for each iteration of instructions in > >a loop (not counting 2 cycles of loop overhead): > > > >pushal; popal: 9 cycles > >pushl %ds; popl %ds: 21 cycles > >fxsave; fxrstor: 105 cycles > >fnsave; frstor: 264 cycles > > I can think of a possible reason: The FPU knows when it has been used > vs just having executed fninit. In the latter case, all it needs to > save is "I've been initialised". There are a lot of reasons. Peter Wemm mentioned some alignment issues and not doing an (in)convenient implicit fninit. The 512-byte SSE context is more than half "reserved", and this most of half is not written to by fxsave and presumably not read from by frstor (I didn't check the manual for this; I just fxsave'ed over 0xff bytes and checked what changed). > Also the FPU architecture includes > "used" flags associated with each register - possibly the f*save > instructions don't flush unused registers. Do the above numbers > change when you push real data into the FP registers? The manual seems to say that fxsave is clever about this, but it doesn't seem to be. Loading registers didn't make any difference to the times with the following register contents: all 1.0's 6 1.0's, one 0.0, one +Inf, and a masked DIVZ exception for creating the +Inf all xmm registers explicitly loaded with 0 OTOH, with the DIVZ unmasked (but never trapped for), fxsave/frstor takes 187 cycles. I think fnsave/frstor is normally slowed down by (unmasked only?) exceptions too. > Also, how expensive is a DNA trap? Would it be cheaper overall to I counted 200 kernel instructions for non-SMP in -current. Half of these are for recent pessimizations in -current: crhold() and crfree() and locking for these take about 100 instructions. I haven't measured the expense in real time. The basic (mostly unnecessary) overhead is the same as for pagefault traps. Mmap latency on this machine is 153nsec on the same machine according to lmbench2. > always load FPU context on a switch - this is more expensive for > processes that don't use FP, but saves a DNA trap per context switch > (assuming they use FP in that slice) for those that do. Not overall, since most timeslices don't use the FPU (at least for processes that I run :-). > To add some further numbers, in December 1999, I did some measurements > on FP switching by patching npx.c. This was on a PII-266 running then > -current. (The original e-mail was sent to -arch on Mon, 20 Dec 1999 > 07:34:06 +1100 in a thread titled "Concrete plans for ucontext/ > mcontext changes around 4.0" - I don't have the message-id available). > > ctxt DNA FP > swtch traps swtch > 1754982 281557 59753 build world and a few CVS operations [1] > 79044 18811 10341 gnuplot and xv in parallel [2] > 800 138 130 parallel FP-intensive progs [3]. > > In the above, `ctxt swtch' is the number of context switches counted > via vm.stats.sys.v_swtch. `DNA traps' is the number of device not > available traps registered and `FP swtch' is the number of DNA traps > where the FP context loaded is different to that saved on the > preceeding context switch. That's a lot more DNA traps than I would have expected for buildworld and a bit less than I would have expected for the others. I guess many of the ones for buildworld are for the FP in setjmp() for jumps that are never taken. 220000 extra FP context switches at 264 cycles each would increase my buildworld time by a whole 0.34 seconds or 0.025%. There may be more important things to optimize :-). Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message