Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 22 Jun 2013 12:35:33 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Gleb Smirnoff <glebius@FreeBSD.org>
Cc:        svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Konstantin Belousov <kib@FreeBSD.org>, Bruce Evans <brde@optusnet.com.au>
Subject:   Re: svn commit: r252032 - head/sys/amd64/include
Message-ID:  <20130622110352.J2033@besplex.bde.org>
In-Reply-To: <20130621135427.GA1214@FreeBSD.org>
References:  <201306201430.r5KEU4G5049115@svn.freebsd.org> <20130621065839.J916@besplex.bde.org> <20130621081116.E1151@besplex.bde.org> <20130621090207.F1318@besplex.bde.org> <20130621064901.GS1214@FreeBSD.org> <20130621184140.G848@besplex.bde.org> <20130621135427.GA1214@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 21 Jun 2013, Gleb Smirnoff wrote:

> On Fri, Jun 21, 2013 at 09:02:36PM +1000, Bruce Evans wrote:
> B> Not if it is a 32-bit increment on 32-bit systems, as it should be.
> B>
> B> I said to use a daemon to convert small (16 or 32 bit) counters into
> B> larger (32 or 64 bit) ones.  It is almost as efficient to call the
> B> accumulation code directly:

Please trim quotes more.

>[... lots of technical details]

> May be you are right and an implementation with a "daemon" would be
> more efficient for i386. But daemon isn't needed at all on 64-bit
> platforms. So I doubt it is worth coding it.

Of course it would be more efficient on i386.  Probably it would be
more efficient on amd64 too.  The main counter array would be half
the size, so takes less cache, less write buffers, and smaller instructions.

> Frankly speaking, we already do not consider the i386 as a platform to
> build high performance solutions. The need for counter(9) in TCP/IP
> stack was demonstrated on modern amd64 hardware. Thus, our motivation
> was to make:

I need to complain about lost performance in i386 more then :-).  i386
is faster for anything that doesn't need large memory.

>  - a lossless and faster implementation for amd64
>  - a lossless implementation for !amd64
>  - if we have an option to avoid critical section on some arch, good
>    let's avoid it.

I already gave an implementation to avoid the critical section.  But using
the critical section seems to be the second best method.  The best method
is using 32-bit counters and a daemon.

Here are considerably expanded tests, with noninline tests dropped.
Summary of times on Athlon64:

simple increment:                               4-7 cycles (1)
simple increment preceded by feature test:      5-8 cycles (1)
simple 32-bit increment:                        4-7 cycles (2)
correct 32-bit increment (addl to mem):         5.5-7 cycles (3)
inlined critical section:                       8.5 cycles (4)
better inlined critical section:                7 cycles (5)
correct unsigned 32-bit inc of 64-bit counter:  4-7 cycles (6)
"improve" previous to allow immediate operand:  5+ cycles
correct signed 32-bit inc of 64-bit counter:    8.5-9 cycles (7)
correct 64-bit inc of 64-bit counter:           8-9 cycles (8)
-current method (cmpxchg8b):                   18 cycles

(1) Unusable due to races.
(2) It's no faster to do a simple increment of a 32-bit counter than a
     64-bit one!  This indicates delicate pipelining/memory access
     behaviour.  The Athlon64 timing for addl to memory is 4 cycles
     latency.  The tests show this being achieved.  But since it is only
     latency and an Athlon64 can do something like 1 64-bit store and
     1 64-bit load per cycle (or 2 64-bit loads but not 2 64-bit stores
     per cycle?), there is plenty of time to do the adcl to memory and
     also the loop overhead in not-quite-the-same 4 cycles, so that the
     loop throughput is 1 per 4 cycles.  In practical/non-loop use, you
     would care even less about the store latancy.  Compilers often mess
     this up to get 7 cycles instead of 4.  -O2 generally gives worst
     results, by unrolling the loop.  Since the rolled loop can excecute
     in 4 cycles, unrolling it can only pessimize it.
(3) The single 32-bit addl, combined with a daemon, is what should be used.
     But compilers generated worse code for this than the similar but more
     branchy code in (6).
(4) The critical section method is quite fast when inlined.
(5) The critical section method is even faster when optimized.  This is
     what should be used if you don't want the complications for the
     daemon.
(6) Call the daemon directly ("daemon" code is now just the current code
     in an extern function.  The extern function is a dummy that is never
     called in the tests.  Just to simulate the branch prediction penalties).
     Some restrictions on the increment.
(7) Same as (6), with fewer restrictions on the increment.  This uses
     cmpxchg just like the -current method uses cmpxchg8b, provided a
     32-bit increment is sufficient.  Races are detected as the low
     bits changing.  We don't care if the high bits change while we
     are changing the low bits.
(8) Same as (7), with no restrictions on the increment.  The full version
     somehow turned out to have a better fastest case than (7).  This is
     without even an optimization for constant increments that would
     eliminate the runtime check in most cases.  The extra code for the
     runtime check makes things faster :-).

% #include <stdint.h>
% #include <stdio.h>
% 
% static inline void
% counter_64_inc_8b(volatile uint64_t *p, int64_t inc)
% {
% 
% 	__asm __volatile(
% 	"movl	%%ds:(%%esi),%%eax\n\t"
% 	"movl	%%ds:4(%%esi),%%edx\n"
% "1:\n\t"
% 	"movl	%%eax,%%ebx\n\t"
% 	"movl	%%edx,%%ecx\n\t"
% 	"addl	(%%edi),%%ebx\n\t"
% 	"adcl	4(%%edi),%%ecx\n\t"
% 	"cmpxchg8b %%ds:(%%esi)\n\t"
% 	"jnz	1b"
% 	:
% 	: "S" (p), "D" (&inc)
% 	: "memory", "cc", "eax", "edx", "ebx", "ecx");
% }

I also don't like the style of this asm.  A better style is used below.

The memory clobber pessimization in the above is not used below.  The
effects of memory clobbers are visible in the test as the loop counter
and 'inc' value not staying in registers.  cmpxchg8b uses so many
registers that it is hard to keep things in registers anyway, so the
pessimization is smaller in the above.  However, since the add* and/or
cmpxchg* instructions have such a large latency, there is plenty of
time to reload all the loop variables without increasing the loop
latency.  In practical use, spilling all the local variables may have
a higher cost, since the latency of the counter add is unimportant.
Anyway, the asms should be optimized for throughput, not for latency.
Unfortunately, loop tests emphasize latency.  The only general ways
that I know of to optimize throughput for non-loops are:
- use highest-throughput instructions, of course
- minimise the number of micro-instructions
- minimize dependencies.  The one for addl to memory is problematic here.
   Optimal code would schedule the load for this addl early in a free
   load slot.  Then do add and the store at leisure (their latency doesn't
   matter).  The addl to memory instruction does precisely the wrong thing
   here.  The CPU can only sometimes reschedule it well, by seeing it early
   enough to start the load early.  The addq used in the amd64 has the same
   problem.  If possibly, separate non-volatile asms for a load and the
   rest should be used.  The compiler hopefully schedules these well.
   Then amd64 would need cmpchg* too, to handle the rare case where the
   counter changes between the separate asms.

Reducing latency from 4 cycles to 1-3, or not letting it increase to 20
on i386, may be of no practical importance, but I like to understand
the details and either use the perfect algorithm or a simple algorithm.
These tests indicate that the amd64 algorithm is simple and but very
good, while the i386 algorithm is complicated to give negative benefits.

% 
% uint32_t cpu_feature = 1;
% 
% /*
%  * These shouldn't be volatile.  Volatile just avoids breaking the simulation
%  * by preventing optimizations that aren't possible for pcpu variables.
%  */
% typedef volatile uint64_t *counter_u64_t;
% typedef volatile uint32_t *counter_u32_t;
% 
% static inline void
% counter_u64_add(counter_u64_t c, int64_t inc)
% {
% 
% #if 1
% 	/* 18 cycles on A64. */
% 	if ((cpu_feature & 1) == 1) {
% 		counter_64_inc_8b(c, inc);
% 	}
% #elif 0
% 	/* 5-8 cycles on A64 (gcc-4.2.1 much slower than gcc-3.3.3).. */
% 	if ((cpu_feature & 1) == 1) {
% 		*c += inc;
% 	}
% #else
% 	/* 4-7 cycles on A64. */
% 	*c += inc;
% #endif
% }
% 
% static inline void
% counter_u32_add(counter_u32_t c, int32_t inc)
% {
% 
% #if 1
% 	/* 5.5-7 cycles on A64. */
% 	__asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
% #else
% 	/* 4-7 cycles on A64. */
% 	*c += inc;
% #endif
% }
% 
% struct thread
% {
% 	uint32_t td_critnest;
% } td0;
% 
% volatile struct thread *td = &td0;
% 
% void
% dummy_accum(void)
% {
% }
% 
% static inline void
% alt_counter_u64_add(counter_u64_t c, int64_t inc)
% {
% #if 1
% 	/* 8.5 cycles on A64. */
% 	td->td_critnest++;
% 	__asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
% 	td->td_critnest++;
% #elif 1
% 	/* 7 cycles on A64. */
% 	uint32_t ocritnest;
% 
% 	ocritnest = td->td_critnest;
% 	td->td_critnest = ocritnest + 1;
% 	__asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
% 	td->td_critnest = ocritnest;
% #elif 0
% 	/* Support only 32-bit non-negative increments. */
% 	/* 4-7 cycles on A64 (varies with gcc version, -O* and -static. */
% 	__asm __volatile("	\n\
% 	addl	%1,%%ds:%0	\n\
% 	jnc	8f		\n\
% 	call	dummy_accum	\n\
% 8: ;				\n\
% 	"
% 	: "=m" (*c) : "r" ((uint32_t)inc));
% #elif 0
% 	/* As in previous, but allow immediate operand. */
% 	/* 4-7 cycles on A64 (varies with gcc version, -O* and -static. */
% 	/*
% 	 * Using the immediate operand tends to be slower!
% 	 *
% 	 * But note that we don't use the memory clobber pessimization.
% 	 * This allows the inc operand (and the loop counter) to stay
% 	 * in registers across the call here, so the register operand
% 	 * is naturally faster.  gcc should prefer the register operand
% 	 * anyway in this case, but is not that smart.  It even prefers
% 	 * the register operand when the immediate operand is disparaged.
% 	 * Sometimes the optimization (just loop alignment?) accidentally
% 	 * compensates for making the worst choice.
% 	 */
% 	/* 5 cycles on A64. */
% 	__asm __volatile("	\n\
% 	addl	%1,%%ds:%0	\n\
% 	jnc	8f		\n\
% 	call	dummy_accum	\n\
% 8: ;				\n\
% 	"
% 	: "=m,m" (*c) : "?i,r" ((uint32_t)inc));
% #elif 0
% 	/* As in previous, but go through a register.  Neg. incr. now work. */
% 	/* 8.5-9 cycles on A64. */
% 	uint32_t exp, src;
% 
% 	__asm __volatile("	\n\
% 	movl	%%ds:%4,%1	\n\
% 1:				\n\
% 	movl	%1,%2		\n\
% 	addl	%3,%2		\n\
% 	jnc	7f		\n\
% 	call	dummy_accum	\n\
% 	jmp	8f		\n\
% 7:				\n\
% 	cmpxchgl %2,%%ds:%0	\n\
% 	jne	1b		\n\
% 8: ;				\n\
% 	"
% 	: "=m" (*c),		/* 0 */
% 	  "=a" (exp),		/* 1 */
% 	  "=r" (src)		/* 2 */
% 	: "ir" ((uint32_t)inc),	/* 3 */ /* don't try to disparage "i" */
% 	  "m" (*c)		/* 4 */
% 	);
% #else
% 	/* Full 64-bit version. */
% 	/* 8-9 cycles on A64. */
% 	uint32_t exp, src;
% 
% 	__asm __volatile("	\n\
% 	/* Runtime testing of %4 is wasteful if inc is constant. */	\n\
% 	/* Especially with the high register pressure. */		\n\
% 	testl	%4,%4		\n\
% 	je	1f		\n\
% 	call	dummy_accum	\n\
% 	jmp	8f		\n\
% 1:				\n\
% 	movl	%%ds:%5,%1	\n\
% 1:				\n\
% 	movl	%1,%2		\n\
% 	addl	%3,%2		\n\
% 	jnc	7f		\n\
% 	call	dummy_accum	\n\
% 	jmp	8f		\n\
% 7:				\n\
% 	cmpxchgl %2,%%ds:%0	\n\
% 	jne	1b		\n\
% 8: ;				\n\
% 	"
% 	: "=m" (*c),		/* 0 */
% 	  "=a" (exp),		/* 1 */
% 	  "=r" (src)		/* 2 */
% 	: "r" ((uint32_t)inc),	/* 3 */
% 	  "r" ((uint32_t)(inc >> 32)),	/* 4 */
% 	  "m" (*c)		/* 5 */
% 	);
% #endif
% }
% 
% uint64_t mycounter[1];
% uint32_t my32counter[1];
% 
% int
% main(void)
% {
% 	unsigned i;
% 
% 	for (i = 0; i < 2010168339; i++)	/* sysctl -n machdep.tsc_freq */
% #if 0
% 		counter_u64_add(mycounter, 1);
% #elif 1
% 		alt_counter_u64_add(mycounter, 1);
% #else
% 		counter_u32_add(my32counter, 1);
% #endif
% 	printf("%ju\n", (uintmax_t)mycounter[0]);
% 	return (mycounter[0] == 2010168339 ? 0 : 1);
% }

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130622110352.J2033>