Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 23 Jun 2013 10:30:04 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, Gleb Smirnoff <glebius@freebsd.org>, src-committers@freebsd.org
Subject:   Re: svn commit: r252032 - head/sys/amd64/include
Message-ID:  <20130623073004.GX91021@kib.kiev.ua>
In-Reply-To: <20130622124832.S2347@besplex.bde.org>
References:  <201306201430.r5KEU4G5049115@svn.freebsd.org> <20130621065839.J916@besplex.bde.org> <20130621081116.E1151@besplex.bde.org> <20130621090207.F1318@besplex.bde.org> <20130621064901.GS1214@FreeBSD.org> <20130621184140.G848@besplex.bde.org> <20130621135427.GA1214@FreeBSD.org> <20130622110352.J2033@besplex.bde.org> <20130622124832.S2347@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--jl/VStiZFoZxPJyo
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Jun 22, 2013 at 01:37:58PM +1000, Bruce Evans wrote:
> On Sat, 22 Jun 2013, I wrote:
>=20
> > ...
> > Here are considerably expanded tests, with noninline tests dropped.
> > Summary of times on Athlon64:
> >
> > simple increment:                               4-7 cycles (1)
> > simple increment preceded by feature test:      5-8 cycles (1)
> > simple 32-bit increment:                        4-7 cycles (2)
> > correct 32-bit increment (addl to mem):         5.5-7 cycles (3)
> > inlined critical section:                       8.5 cycles (4)
> > better inlined critical section:                7 cycles (5)
> > correct unsigned 32-bit inc of 64-bit counter:  4-7 cycles (6)
> > "improve" previous to allow immediate operand:  5+ cycles
> > correct signed 32-bit inc of 64-bit counter:    8.5-9 cycles (7)
> > correct 64-bit inc of 64-bit counter:           8-9 cycles (8)
> > -current method (cmpxchg8b):                   18 cycles
>=20
> corei7 (freefall) has about the same timing as Athlon64, but core2
> (ref10-i386) is 3-4 cycles slower for the tests that use cmpxchg.
You only tested 32 bit, right ? Note that core2-class machines have
at least one cycle penalty for decoding any instruction with REX prefix.

>=20
> > (4) The critical section method is quite fast when inlined.
> > (5) The critical section method is even faster when optimized.  This is
> >    what should be used if you don't want the complications for the
> >    daemon.
>=20
> Oops, I forgot that critical sections are much slower in -current than
> in my version.  They probably take 20-40 cycles for the best case, and
> can't easily be tested in userland since they disable interrupts in
> hardware.  My versions disable interrupts in software.
The critical sections do not disable the interrupts.  Only the thread
local counter is incremented.  Leaving the section could be complicated
though.

>=20
> > ...
> > % % static inline void
> > % alt_counter_u64_add(counter_u64_t c, int64_t inc)
> > % {
> > % #if 1
> > % 	/* 8.5 cycles on A64. */
> > % 	td->td_critnest++;
> > % 	__asm __volatile("addl %1,%%ds:%0" : "=3Dm,m" (*c) : "?i,r" (inc));
> > % 	td->td_critnest++;
>=20
> Oops, one increment should be a decrement.
>=20
> > % #elif 1
> > % 	/* 7 cycles on A64. */
> > % 	uint32_t ocritnest;
> > %=20
> > % 	ocritnest =3D td->td_critnest;
> > % 	td->td_critnest =3D ocritnest + 1;
> > % 	__asm __volatile("addl %1,%%ds:%0" : "=3Dm,m" (*c) : "?i,r" (inc));
> > % 	td->td_critnest =3D ocritnest;
> > % #elif 0
>=20
> Even in my version, I have to check for unmasked interrupts when td_critn=
est
> is reduced to 0.  At least the test for being reduced to 0 can be very fa=
st,
> since the reduced value is loaded early and can be tested early.
>=20
> Further tests confirm that incl and incq are pipelined normally on at
> least corei7 and core2.  In the loop test, freefall can do 4 independent
> addq's to memory faster than it can do 1 :-).  It can do 6 independent
> addq's to memory in the same time that it can do 1.  After that, the
> loop overhead prevents geting the complete bandwidth of the memory
> system.  However, 6 addq's to the same memory location take a little
> more than 6 times longer than 1.  Multiple increments of the same counter
> one after the other are probably rare, but the counter API makes it harder
> to coaelsce them if they occur, and the implementation using independent
> asms ensures that the compiler cannot coalesce them.

I think that the naive looping on Core i7+ to measure the latencies
of the instructions simply does not work.  The Nehalem hardware started
to provide the loop detector for the instruction queue inside the decoder,
which essentially makes the short loops executed from the microopcode
form.  We never use counters in the tight loops.

--jl/VStiZFoZxPJyo
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.20 (FreeBSD)

iQIcBAEBAgAGBQJRxqP8AAoJEJDCuSvBvK1Br7UP+wS+Ts43sFcG9SE4QnbDXwLP
AFAaPRnRRqHiP4xkQ4UAW74n/8/X2+yQmYLCH+4EXVjKhovLTWRQJ5DLZeHbFR6I
8f8S6y0VZF4C/hpvq7zrItSmoYHc+H2N6H1KCtuTi2KyqhyZNCA5AgeJiqrdMRs3
DMguAb8wM7cYsmO6VT/A+kQSxeFqIzhtZcQkllKYu405z8R3Q4ihhzBAwb/2VHuT
f0Nos1V0ZR9UUWUBe1yRcbZwFMP95FGTpyeTEC+azv+QlIIpZAO4gCQGW7GjtmP/
B5a79puOV9Jsv92v9koHjeXUskFtihO35IAP4ngCZA5A73wu816kWQL9YVPeYHLT
Ev29rUnZajYK8JJnTniAgoJpjwaUIm2vDRNDWpv84+I+MDzSvMTkBrzIumSLVQQ+
eqTM6e63k3hTB1rNQp45sPhXRB3gr0OD+3W41odu/1shnfxfKnYSOUTwRc1Zc7Mm
7Evrzg1vN57aDD/9eEWDXjxdmy/gccRYzyLqAqpMK9NChQiRtxd9kOTmaGMBLD5Y
0gGM4Gwwq2BiASv6Zk2klRGJ5FgXp+xVNAMoaVNM+FQMVSIiiuF8UJA9vW43Ttz5
9Mfr4ywTDBhjfx30qDe98aPMht68kGCm8A3KyqkxXESZS2M1pRC/WgAHLSFvCOcS
uZ/Bqwwc5bhN/JmD3hZG
=eC03
-----END PGP SIGNATURE-----

--jl/VStiZFoZxPJyo--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130623073004.GX91021>