Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 22 Jun 2013 13:37:58 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, Gleb Smirnoff <glebius@freebsd.org>, Konstantin Belousov <kib@freebsd.org>, src-committers@freebsd.org
Subject:   Re: svn commit: r252032 - head/sys/amd64/include
Message-ID:  <20130622124832.S2347@besplex.bde.org>
In-Reply-To: <20130622110352.J2033@besplex.bde.org>
References:  <201306201430.r5KEU4G5049115@svn.freebsd.org> <20130621065839.J916@besplex.bde.org> <20130621081116.E1151@besplex.bde.org> <20130621090207.F1318@besplex.bde.org> <20130621064901.GS1214@FreeBSD.org> <20130621184140.G848@besplex.bde.org> <20130621135427.GA1214@FreeBSD.org> <20130622110352.J2033@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 22 Jun 2013, I wrote:

> ...
> Here are considerably expanded tests, with noninline tests dropped.
> Summary of times on Athlon64:
>
> simple increment:                               4-7 cycles (1)
> simple increment preceded by feature test:      5-8 cycles (1)
> simple 32-bit increment:                        4-7 cycles (2)
> correct 32-bit increment (addl to mem):         5.5-7 cycles (3)
> inlined critical section:                       8.5 cycles (4)
> better inlined critical section:                7 cycles (5)
> correct unsigned 32-bit inc of 64-bit counter:  4-7 cycles (6)
> "improve" previous to allow immediate operand:  5+ cycles
> correct signed 32-bit inc of 64-bit counter:    8.5-9 cycles (7)
> correct 64-bit inc of 64-bit counter:           8-9 cycles (8)
> -current method (cmpxchg8b):                   18 cycles

corei7 (freefall) has about the same timing as Athlon64, but core2
(ref10-i386) is 3-4 cycles slower for the tests that use cmpxchg.

> (4) The critical section method is quite fast when inlined.
> (5) The critical section method is even faster when optimized.  This is
>    what should be used if you don't want the complications for the
>    daemon.

Oops, I forgot that critical sections are much slower in -current than
in my version.  They probably take 20-40 cycles for the best case, and
can't easily be tested in userland since they disable interrupts in
hardware.  My versions disable interrupts in software.

> ...
> % % static inline void
> % alt_counter_u64_add(counter_u64_t c, int64_t inc)
> % {
> % #if 1
> % 	/* 8.5 cycles on A64. */
> % 	td->td_critnest++;
> % 	__asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
> % 	td->td_critnest++;

Oops, one increment should be a decrement.

> % #elif 1
> % 	/* 7 cycles on A64. */
> % 	uint32_t ocritnest;
> % 
> % 	ocritnest = td->td_critnest;
> % 	td->td_critnest = ocritnest + 1;
> % 	__asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
> % 	td->td_critnest = ocritnest;
> % #elif 0

Even in my version, I have to check for unmasked interrupts when td_critnest
is reduced to 0.  At least the test for being reduced to 0 can be very fast,
since the reduced value is loaded early and can be tested early.

Further tests confirm that incl and incq are pipelined normally on at
least corei7 and core2.  In the loop test, freefall can do 4 independent
addq's to memory faster than it can do 1 :-).  It can do 6 independent
addq's to memory in the same time that it can do 1.  After that, the
loop overhead prevents geting the complete bandwidth of the memory
system.  However, 6 addq's to the same memory location take a little
more than 6 times longer than 1.  Multiple increments of the same counter
one after the other are probably rare, but the counter API makes it harder
to coaelsce them if they occur, and the implementation using independent
asms ensures that the compiler cannot coalesce them.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130622124832.S2347>