Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 14 Mar 2008 00:34:27 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Jeff Roberson <jroberson@chesapeake.net>
Cc:        arch@freebsd.org, Peter Wemm <peter@wemm.org>, David Xu <davidxu@freebsd.org>
Subject:   Re: amd64 cpu_switch in C.
Message-ID:  <20080313230809.W32527@delplex.bde.org>
In-Reply-To: <20080312211834.T1091@desktop>
References:  <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> <e7db6d980803120125y41926333hb2724ecd07c0ac92@mail.gmail.com> <20080313124213.J31200@delplex.bde.org> <20080312211834.T1091@desktop>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 12 Mar 2008, Jeff Roberson wrote:

> On Thu, 13 Mar 2008, Bruce Evans wrote:
>
>> On Wed, 12 Mar 2008, Peter Wemm wrote:
>> 
>>> On Tue, Mar 11, 2008 at 9:14 PM, David Xu <davidxu@freebsd.org> wrote:
>>>> Jeff Roberson wrote:
>>>> > http://people.freebsd.org/~jeff/amd64.diff
>>>> 
>>>>  This is a good idea.
>> 
>> I wouldn't have expected it to make much difference.  On i386 UP,
>> cpu_switch() normally executes only 48 instructions for in-kernel
>> context switches in my version of 5.2 and only 61 instructions in
>> -current.  ~5.2 differs from 5.2 here in only in not having to
>> switch %eflags.  This saves 4 instructions but much more in cycles,
>> especially in P4 where accesses to %eflags are very slow.  5.2 would
>> take 52 instructions, and -current has bloated by 9 instructions
>> relative to 5.2.
>
> More expensive than the raw instruction count is:
>
> 1)  The mispredicted branches to deal with all of the optional state and 
> features that are not always saved.

This is unlikely to matter, and apparently doesn't, at least in simple
benchmarks, since the C version has even more branches.  Features that
are rarely used cause branches that are usually perfectly predicted.

> 2)  The cost of extra icache for getting over all of those unused 
> instructions, unaligned jumps, etc.

Again, if this were the cause of slowness then it would affect the C
version more, since the C version is larger.

In fact, the benchmark is probably too simple to show the cost of
branches.  Just doing sched_yield() in a loop gives the following
atypical behaviour which may be atypical enough for the larger branch
and cache costs for the C version to not have much effect:
- it doesn't go near most of the special cases, so branches are
   predictable (always non-special) and are thus predicted provided
   (a) the CPU actually does reasonably good branch prediction, and
   (b) the branch predictions fit in the branch prediction cache
       (reasonably good branch prediction probably requires such a
       cache).
- it doesn't touch much icache or dcache or branch-cache, so
   everything probably stays cached.

If just the branch-cache were thrashed, then reasonably good dynamic
branch prediction is impossible and things would be slow.  In the C
version, you use predict_true() and predict_false() a lot.  This
might improve static branch prediction but makes little difference
if the branch cache is working.

The C version uses lots of non-inline function calls.  Just the
branches for this would have a significant overhead if the branches
are mispredicted.  I think you are depending on gcc's auto-inlining
of static functions which are only called once to avoid the full
cost of the function calls.

> I haven't looked at i386 very closely lately but on amd64 the wrmsrs for 
> fs/gsbase are very expensive.  On my 2ghz dual core opteron the optimized 
> switch seems to take about 100ns.  The total switch from userspace to 
> userspace is about 4x that.

Probably avoiding these is the only significant large between all
the versions.  You use predict_false() for executing them.  Are fsbase
and gsbase really usually constant across processes?

400nS is about what I get for i386 on 2.2GHz A64 UP too (6.17 S for
./yield 1000000 10).  getpid() on this machine takes 180nS so it is
unreasonable to expect sched_yield() to take much less than a few hundred
nS.

Some perfmon output for ./yield 100000 10:

% # s/kx-ls-microarchitectural-resync-by-self-mod-code 
% 0
% # s/kx-ls-buffer2-full 
% 909905
% # s/kx-ls-retired-cflush-instructions 
% 0
% # s/kx-ls-retired-cpuid-instructions 
% 0
% # s/kx-dc-accesses 
% 496436422
% # s/kx-dc-misses 
% 11102024

11 cache dmisses per yield.  Probably the main cause of slowness (main
memory latency on this machine is 42 nsec so 11 cache misses takes
462 of the 617 nS per call?).

% # s/kx-dc-refills-from-l2 
% 0
% # s/kx-dc-refills-from-system 
% 0
% # s/kx-dc-writebacks 
% 0
% # s/kx-dc-l1-dtlb-miss-and-l2-dtlb-hits 
% 3459100
% # s/kx-dc-l1-and-l2-dtlb-misses 
% 2138231
% # s/kx-dc-misaligned-references 
% 87
% # s/kx-dc-microarchitectural-late-cancel-of-an-access 
% 73146415
% # s/kx-dc-microarchitectural-early-cancel-of-an-access 
% 236927303
% # s/kx-bu-cpu-clk-unhalted 
% 1303921314
% # s/kx-ic-fetches 
% 236207869
% # s/kx-ic-misses 
% 22988

Insignificant icache misses.

% # s/kx-ic-refill-from-l2 
% 18979
% # s/kx-ic-refill-from-system 
% 4191
% # s/kx-ic-l1-itlb-misses 
% 0
% # s/kx-ic-l1-l2-itlb-misses 
% 1619297
% # s/kx-ic-instruction-fetch-stall 
% 1034570822
% # s/kx-ic-return-stack-hit 
% 20822416
% # s/kx-ic-return-stack-overflow 
% 5870
% # s/kx-fr-retired-instructions 
% 701240247
% # s/kx-fr-retired-ops 
% 1163464391
% # s/kx-fr-retired-branches 
% 121636370
% # s/kx-fr-retired-branches-mispredicted 
% 2761910
% # s/kx-fr-retired-taken-branches 
% 93488548
% # s/kx-fr-retired-taken-branches-mispredicted 
% 2848315

2.8 branches mispredicted per call.

# s/kx-fr-retired-far-control-transfers 
% 2000934

1 int0x80 and 1 iret per shched_yield(), and apparentlty not much else.

% # s/kx-fr-retired-resync-branches 
% 936968
% # s/kx-fr-retired-near-returns 
% 19008374
% # s/kx-fr-retired-near-returns-mispredicted 
% 784103

0.8 returns mispredicted per call.

% # s/kx-fr-retired-taken-branches-mispred-by-addr-miscompare 
% 721241
% # s/kx-fr-interrupts-masked-cycles 
% 658462615

Ugh, this is from spinlocks bogusly masking interrupts.  More than half
the cycles have interrupts masked.  This at least shows that lots of
time is being spent near cpu_switch() with a spinlock held.

% # s/kx-fr-interrupts-masked-while-pending-cycles 
% 9365

Since the CPU is reasonably fast, interrupts aren't masked for very long
each time.  This maximum is still 4.5 uS.

% # s/kx-fr-hardware-interrupts 
% 63
% # s/kx-fr-decoder-empty 
% 247898696
% # s/kx-fr-dispatch-stalls 
% 589228741
% # s/kx-fr-dispatch-stall-from-branch-abort-to-retire 
% 39894120
% # s/kx-fr-dispatch-stall-for-serialization 
% 44037193
% # s/kx-fr-dispatch-stall-for-segment-load 
% 134520281

134 cyles per call.  This may be more for ones in syscall() generally.
I think each segreg load still costs ~20 cycles.  Since this is on
i386, there are 6 per call (%ds, %es and %fs save and restore), plus
%ss save and which might not be counted here.  134 is a lot -- about
60nS of the 180nS for getpid().

% # s/kx-fr-dispatch-stall-when-reorder-buffer-is-full 
% 18648001
% # s/kx-fr-dispatch-stall-when-reservation-stations-are-full 
% 121485247
% # s/kx-fr-dispatch-stall-when-fpu-is-full 
% 19
% # s/kx-fr-dispatch-stall-when-ls-is-full 
% 203578275
% # s/kx-fr-dispatch-stall-when-waiting-for-all-to-be-quiet 
% 63136307
% # s/kx-fr-dispatch-stall-when-far-xfer-or-resync-br-pending 
% 6994131

>> In-kernel switches are not a very typical case since they don't load
>> %cr3...
>
> We've been working on amd64 so I can't comment specifically about i386 costs. 
> However, I definitely agree that cpu_switch() is not the greatest overhead in 
> the path.  Also, you have to load cr3 even for kernel threads because the 
> page directory page or page directory pointer table at %cr3 can go away once 
> you've switched out the old thread.

I don't see this.  The switch is avoided if %cr3 wouldn't change, which
I think usually or always happens for switches between kernel threads.

>> The asm code already saves only call-saved registers for both i386 and
>> amd64.  It saves call-saved registers even when it apparently doesn't
>> use them (lots more of these on amd64, while on i386 it uses more
>> call-saved registers than it needs to, apparently since this is free
>> after saving all call-saved registers).  I think saving more than is
>> needed is the result of confusion about what needs to be saved and/or
>> what is needed for debugging.
>
> It has to save all of the callee saved registers in the PCB because they will 
> likely differ from thread to thread.  Failing to save and restore them could 
> leave you returning with the registers having different values and corrupt 
> the calling function.

Yes, I had forgotten the detail of how the non-local flow of control can
change the registers (the next call to the function in the context of
the switched-to-process may have different values in the registers due
to changes to the registers in callers).

All that can be done differently here is saving all the registers on the
stack (except %esp) in the usual way.  This would probably be faster on
old i386's using pushal or pushl, but on amd64 pushal is not available,
and on Athlons generally (before Barcelona?) it is faster not to use pushl,
so on amd64 the registers should be saved using movl and then it is just
as easy to put them in the pcb as on the stack.

>>> The good news is that this tuning is finally being done.  It should
>>> have been done in 2003 though...
>> 
>> How is this possible with (according to my theory) most of the context
>> switch cost being for %cr3 and upper layers?  Unchanged amd64 has only
>> a few more costs than i386.  Mainly 3 unconditional wrmsr's and 2
>> unconditional rdmsr's for managing gsbase and fsbase.  I thought that
>> these were hard to avoid and anyway not nearly as expensive as %cr3 loads.
>
> %cr3 is actually a lot less expensive these days with page table flush 
> filters and the PG_G bit.  We were able to optimize away setting the msrs in 
> the case that the previous values match the new values.  Apparently the 
> hardware doesn't optimize this case so we have to do comparisons ourselves.
>
> That was a big chunk of the optimization.  Static branch hints, reordering 
> code, possibly reordering for better pipeline scheduling in peter's asm, etc. 
> provide the rest.

All the old i386 asm and probably clones of it on amd64 is certainly not
optimized globally for anything newer than an i386 (barely even an i486).
This rarely matters however.  It lost more on Pentium-1's, but now out of
order execution and better branch prediction hides most inefficiencies.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080313230809.W32527>