Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 7 Jul 2018 03:55:35 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        John Baldwin <jhb@freebsd.org>
Cc:        rgrimes@freebsd.org, Warner Losh <imp@bsdimp.com>,  Hans Petter Selasky <hselasky@freebsd.org>,  src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org,  svn-src-head@freebsd.org
Subject:   Re: svn commit: r336025 - in head/sys: amd64/include i386/include
Message-ID:  <20180707031245.J2611@besplex.bde.org>
In-Reply-To: <1f87b7ba-3b59-e710-00b0-91a4b0e4e5b4@FreeBSD.org>
References:  <201807061552.w66Fq0FX052931@pdx.rh.CN85.dnsmgr.net> <1f87b7ba-3b59-e710-00b0-91a4b0e4e5b4@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 6 Jul 2018, John Baldwin wrote:

> On 7/6/18 8:52 AM, Rodney W. Grimes wrote:
>> ...
>> Trivial to fix this with
>> +#if defined(SMP) || !defined(_KERNEL) || defined(KLD_MODULE) || !defined(KLD_UP_MODULES)
>
> This is not worth it.  Note that we already use LOCK always in userland
> which is probably far more prevalent than the use in modules.
>
> Previously atomics in modules were _function calls_ just to avoid the LOCK.
> Having the LOCK prefix present even on UP is probably far more efficient
> than a function call.

No, the lock prefix is less efficient.

IIRC, on very old systems (~PPro), lock prefixes cost 20 cycles in the UP
case.  On AthlonXP, they cost about 19 cycles, but function calls (written
in C) only cost about 6 cycles.  This depends on pipelining, and my
test is perhaps too simple since it uses a loop where the pipelinig
works especially well (it executes 2 or 3 function calls in parallel).

Actually timing on AthlonXP UP:
- asm loop: 2 cycles/iteration
- "incl mem" in asm loop: 5.85 cycles (but with less alignment, only 3.25
   cycles)
- "lock; incl mem" in asm loop: 18.9 cycles
- function call in C loop to C function doing "incl mem" in asm: 8.35 cycles
- function call in C loop to C function doing "lock; incl mem" in asm: 24.95
   cycles.

Newer CPUs have better pipelining.  On Haswell, this gives the strange
behaviour that the function call written in C is slightly faster than
inline code written in asm:

Actual timing on Haswell SMP:
- asm loop: 1.16 cycles/iteration
- "incl mem" in asm loop: 6.95 cycles
- "lock; incl mem" in asm loop: 19.00 cycles
- function call in C loop to C function doing "incl mem" in asm: 6 cycles
- function call in C loop to C function doing "lock; incl mem" in asm: 26.00
   cycles.

The C code with the function call executes:

loop:
 	call	incl
 	incl:
 		pushl	%ebp
 		movl	%ebp,%esp
 		[lock;] incl mem
 		leave
 		ret
 	incl	%ebx
 	cmpl	$4080000000-1,%ebx
 	jbe	done

I didn't even compile with -fframe-pointer or try clang which would do
excessive unrolling.  -fframe-pointer takes 3 extra instructions in
incl, but these take no extra time.

In non-benchmark use, there would be more args for the function call so
and the scheduling would be very different so the timing might be very
different.  I expect the function call would be insignificantly slower
except in micro-benchmarks,

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180707031245.J2611>