Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 25 Jun 2013 23:58:26 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, Gleb Smirnoff <glebius@FreeBSD.org>, src-committers@FreeBSD.org
Subject:   Re: svn commit: r252032 - head/sys/amd64/include
Message-ID:  <20130625205826.GM91021@kib.kiev.ua>
In-Reply-To: <20130625190352.P986@besplex.bde.org>
References:  <20130621135427.GA1214@FreeBSD.org> <20130622110352.J2033@besplex.bde.org> <20130622124832.S2347@besplex.bde.org> <20130622174921.I3112@besplex.bde.org> <20130623073343.GY91021@kib.kiev.ua> <20130623181458.J2256@besplex.bde.org> <20130624170849.GH91021@kib.kiev.ua> <20130625102023.K899@besplex.bde.org> <20130625062039.GJ91021@kib.kiev.ua> <20130625190352.P986@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--zqa1LDeV72wF37a9
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Jun 25, 2013 at 08:14:41PM +1000, Bruce Evans wrote:
> On Tue, 25 Jun 2013, Konstantin Belousov wrote:
>=20
> > Updates to the counter cannot be done from the interrupt context.
>=20
> This is fragile, however.  It prevents using counters for things like
> counting interrupts.  Most interrupt counting is now done directlyly
> and doesn't use PCPU_INC().  i386/i386 has just 1 use of PCPU_INC().
> It is to count traps in machdep.c.  Traps include nonmaskable
> interrupts.  Even hard-disabling of interrupts doesn't prevent these.
> Otherwise, PCPU is used mainly for vm counters.  E.g., for pagefaults.
> Now the trap is not an interrupt, so it shouldn't occur in the middle
> of the counter update and the PCPU_INC() can safely be replaced by
> a counter, but this is not clear.
Traps are not performance critical in the sense that there is no need to co=
unt
up to 1-10G traps per second.  Anyway, as Gleb said, there is no point in
optimizing the i386 kernel. =20

>=20
> > The fetch problem of sometime missing the carry bit is present on all
> > 32bit arches now, I think.
>=20
> Also the zeroing problem.  Since there are no atomic ops in sight, zeroing
> has several races.  My transferring code uses atomic ops on only 1 CPU,
> so it has half of these races.  See below.
Right, zeroing should also use atomic 64bit writes on i386.  Updated
patch is at end.

>=20
> >>     The arm implementation of atomic.h is also relevant and is especia=
lly
> >>     interesting.  I looked at it after mdf@ pointed out ARM_RAS_START.
> >>     This seems to only be used by arm in userland.  It is even larger =
and
> >>     slower and more complicated than critical sections, so it doesn't =
seem
> >>     to be useful for counters.  In the kernel, arm has several
> >>     implementations depending on machine capabilities.  The ifdefs are
> >>     hard to understand.  On some machine, it seems to use the equivale=
nt
> >>     of a cmpxchg loop.  On others, it just disables interrupts.  Disab=
ling
> >>     interrupts is presumably better than a critical section, and userl=
and
> >>     has to use ARM_RAS_START for some arches because neither disabling
> >>     interrupts not critical sections are available in userland.  None
> >>     of this care helps avoid the races in pcpu.h, since pcpu.h intenti=
onally
> >>     avoids using atomic ops.  None of this care helps avoid the races =
in
> >>     counter.h or make counter.h efficient, since counter.h uses its own
> >>     methods.  To reach the same quality as atomic.h, counter.h would h=
ave
> >>     to copy half of the ifdefs and methods in atomic.h.
> > BTW, why do you claim that disabling interrupts is faster and better me=
thod
> > to disable preemption than critical section ?
>=20
> For arm, it is because if the critical section method were better, then
> arm should use it for atomic ops too.
Of course not.  Atomics must be atomic WRT the interrupts as well.
Using critical sections only provides atomicity against preemption
but not against interruption.

The arm looks quite deficient anyway.  From the arm/include/atomic.h,
it seems that the arch did not have (usable) atomics until v6.  And
ARM does not support SMP in the stock HEAD still.

>=20
> For i386, it is because it reduces the window in which the value is inval=
id
> (other CPUs can see the invalid value), and is easier to use and shorter
> (2-3 bytes) and thus better suited to inline code.  My examples of inline
> code mostly used it in rarely-executed cases where its speed doesn't matt=
er
> and it is just easier to use.  Actual testing shows that it is fairly
> efficient, so can be used in all cases.  On Athlon64, it is faster than
> cmpxchg8b but slower than an inline optimized critical section.  This
> testing is in userland, where there may be an extra penalty for using
> cli/sti/pushfl/popfl:
>=20
> % static inline void
> % counter_u64_add(counter_u64_t c, int64_t inc)
> % {
> %=20
> % #if 1
> % 	/* 20 cycles on A64. */
> % 	/* 31 cycles on A64 with lock prefix. */
> % 	/* 35 cycles on A64 with mfence prefix. */
> % 	if ((cpu_feature & 1) =3D=3D 1) {
> % 		counter_64_inc_8b(c, inc);
> % 	}
>=20
> Old tests said that this takes 18 cycles.  It now takes 20.
>=20
> Adding a lock or mfence prefix to the cmpxchg8b gives the times stated
> above.  This shows how bad using cmpxch8b is.  We could use atomic ops
> for everything and only be 50% slower.  But we shouldn't use either
> cmpxchg8b or atomic ops.  We should use 32-bit addl, which takes about
> 4 cycles.
Use of cmpxchg8b provides the soundness to the algorithm, by using
atomic 64 bit loads and stores.

As I stated, I am not going to spent time microoptimizing for i386.
Current implementation is good enough.

>=20
> % struct thread
> % {
> % 	uint32_t td_critnest;
> % 	uint32_t td_owepreempt;
> % } td0;
> %=20
> % volatile struct thread *td =3D &td0;
> %=20
> % void
> % dummy_accum(void)
> % {
> % }
> %=20
> % void
> % fake_mi_switch(void)
> % {
> % }
>=20
> Added more emulation for a complete critical section emulation.
>=20
> % static inline void
> % alt_counter_u64_add(counter_u64_t c, int64_t inc)
> % {
> % #if 0
> % 	/* 8.5 cycles on A64. */
> % 	td->td_critnest++;
> % 	__asm __volatile("addl %1,%%ds:%0" : "=3Dm,m" (*c) : "?i,r" (inc));
> % 	td->td_critnest++;
> % #elif 0
> % 	/* 7 cycles on A64. */
> % 	uint32_t ocritnest;
> %=20
> % 	ocritnest =3D td->td_critnest;
> % 	td->td_critnest =3D ocritnest + 1;
> % 	__asm __volatile("addl %1,%%ds:%0" : "=3Dm,m" (*c) : "?i,r" (inc));
> % 	td->td_critnest =3D ocritnest;
>=20
> Old too-simple critical section emulation.
>=20
> % #elif 0
> % 	/* 11 cycles on A64. */
> % 	uint32_t ocritnest;
> %=20
> % 	ocritnest =3D td->td_critnest;
> % 	td->td_critnest =3D ocritnest + 1;
> % 	__asm __volatile("addl %1,%%ds:%0" : "=3Dm,m" (*c) : "?i,r" (inc));
> % 	if (ocritnest =3D=3D 0) {
> % 		td->td_critnest =3D ocritnest;
> % 		if (td->td_owepreempt)
> % 			fake_mi_switch();
> % 	} else
> % 		td->td_critnest =3D ocritnest;
>=20
> Full critical section emulation.  Everything inlined and optimized.  With
> the -current extern critical section functions, this would be much slower.
> Probably in between the slowness of unlocked cmpxchg8b and locked cmpxch8=
b.
>=20
> % #elif 0
> % 	/* 13 cycles on A64. */
> % 	uint32_t ocritnest;
> %=20
> % 	ocritnest =3D td->td_critnest;
> % 	td->td_critnest =3D ocritnest + 1;
> % 	__asm __volatile("addl %%eax,%%ds:%0; adcl %%edx,4+%0"
> % 	    : "=3Dm" (*c) : "A" (inc));
> % 	if (ocritnest =3D=3D 0) {
> % 		td->td_critnest =3D ocritnest;
> % 		if (td->td_owepreempt)
> % 			fake_mi_switch();
> % 	} else
> % 		td->td_critnest =3D ocritnest;
>=20
> Same as above, except for a full 64-bit addition.  Unfortunately, the
> extra adcl doesn't seem to be done in parallel, since it takes 2 cycles
> longer.  The code is rather large and might be hard to fit well in
> pipelines.
>=20
> % #elif 0
> % 	/* 9 cycles on A64. */
> % 	__asm __volatile("addl %%eax,%%ds:%0; adcl %%edx,4+%0"
> % 	    : "=3Dm" (*c) : "A" (inc));
>=20
> Another new test.  Do a full 64-bit add it a racy way.  This didn't
> pipeline well either.  In minor variations of this on same A64 CPU,
> I saw phenoma like a single addl taking 5.5 cycles but this addl
> followed by an adcl $0 taking only 4 cycles.  4 cycles is what they
> should take, and 2 instructions being faster than 1 makes some sense
> since the 2 instructions fill a 64-bit word so the hardware will have
> to do less combining of stores with existing contents of write buffers.
>=20
> Oops, I forgot one %%ds prefix here and below.
>=20
> % #elif 0
> % 	/* 12 cycles on A64. */
> % 	__asm __volatile("cli; addl %%eax,%%ds:%0; adcl %%edx,4+%0; sti"
> % 	    : "=3Dm" (*c) : "A" (inc));
>=20
> "cli/sti" is very fast on Athlon64.  This is the fastest complete 64-bit
> update method not using a daemon or special cases found so far for Athlon=
64.
> cli/sti also took 12 cycles without the adcl.
>=20
> % #elif 0
> % 	/* 20 cycles on A64. */
> % 	__asm __volatile(
> % 	    "pushfl; cli; addl %%eax,%%ds:%0; adcl %%edx,4+%0; popfl"
> % 	    : "=3Dm" (*c) : "A" (inc));
>=20
> Unfortunately, cli/sti is fragile, so we might need a reentrant switch
> like this, and pushfl/popfl is quite slow.
>=20
> % #elif 1
> % 	/* 16 cycles on A64. */
> % 	__asm __volatile(
> % 	    "pushfl; cli; addl %%eax,%%ds:%0; adcl %%edx,4+%0; popl %%ecx; tes=
tl $0x200,%%ecx; je 8f; sti; 8:;"
> % 	    : "=3Dm" (*c) : "A" (inc) : "ecx");
>=20
> Not doing the full popfl is not so slow.  Hopefully this is faster in the
> kernel since it involves fewer privilege checks.
>=20
> This method is 20% faster than the cmpx8chgb method.  On core2, the
> cmpxchg8b medhod only takes 14 cycles due to not having a store-to-load
> mismatch penalty, so it would be faster than disabling interrupts if
> disabling interrupts has the same slowness.
>=20
> Full test program at the end.
>=20
> >> The slowness that I care about is only in the counter update.  Anything
> >> complicated there will be slow.
> >>
> >> My current best design:
> >> - use ordinary mutexes to protect counter fetches in non-per-CPU conte=
xts.
> >> - use native-sized or always 32-bit counters.  Counter updates are done
> >>    by a single addl on i386.  Fix pcpu.h on arches other than amd64 and
> >>    i386 and use the same method as there.
> >> - counter fetches add the native-sized pcpu counters to 64-bit non-pcpu
> >>    counters, when the native-size counters are in danger of overflowing
> >>    or always, under the mutex.  Transferring data uses an ordinary
> >>    atomic_cmpset.  To avoid ifdefs, always use u_int pcpu counters.
> >>    The 64-bit non-pcpu counters can easily be changed to pcpu counters
> >>    if the API is extended to support pcpu data.
> > Mutex cannot be used for counters, I believe this was already discussed
> > to the end.  The goal of the counters design is to avoid the unneccesary
> > snoop traffic in the SMP system, mutex makes the hot line bouncing.
> > This was well understood and benchmarked.
>=20
> Yes it can.  See my code.  The mutex is only accessed infrequently.
>=20
> >> - run a daemon every few minutes to fetch all the counters, so that
> >>    the native-sized counters are in no danger of overflowing on systems
> >>    that don't run statistics programs often enough to fetch the counte=
rs
> >>    to actually use.
> > Daemon have to
> > 1. track all counters in some global registry
> > 2. run separate thread on each cpu.
>=20
> Not very difficult.
>=20
> > I will add the missed MD bits to the current patch for i386, leaving
> > the known issue for fetches on other 32bit arches for now.
>=20
> Also leaving the known issue of zeroing being broken for all arches.
> Using a daemon is a good way to fix this too.  You have to run the
> daemon on each CPU and wait for it to complete whenever a counter
> is zeroed.
Let me restate. If somebody wants to provide a complete working
implementation for i386 which is more optimized than cmpxchg8b, I
have no objections. I will not spent a time on this, though, since I
consider such activity quite useless. And, I should note that e.g. the
daemon-based implementation must be arch-specific, to not pessimise
amd64.

[Skipped the test program]
I retested several cases on SandyBridge i7 machine. There, the cmpxchg8b
version without barrier or lock gives 12 cycles, the incomplete
emulation of the critical section is executed in 7 cycles, and cli/sti
in 24.

That said, I am completely happy with the performance of the cmpxchg8b.

diff --git a/sys/amd64/include/counter.h b/sys/amd64/include/counter.h
index b37a4b8..95e0c58 100644
--- a/sys/amd64/include/counter.h
+++ b/sys/amd64/include/counter.h
@@ -36,6 +36,37 @@ extern struct pcpu __pcpu[1];
 #define	counter_enter()	do {} while (0)
 #define	counter_exit()	do {} while (0)
=20
+#ifdef IN_SUBR_COUNTER_C
+static inline uint64_t
+counter_u64_read_one(uint64_t *p, int cpu)
+{
+
+	return (*(uint64_t *)((char *)p + sizeof(struct pcpu) * cpu));
+}
+
+static inline uint64_t
+counter_u64_fetch_inline(uint64_t *p)
+{
+	uint64_t r;
+	int i;
+
+	r =3D 0;
+	for (i =3D 0; i < mp_ncpus; i++)
+		r +=3D counter_u64_read_one((uint64_t *)p, i);
+
+	return (r);
+}
+
+static inline void
+counter_u64_zero_inline(counter_u64_t p)
+{
+	int i;
+
+	for (i =3D 0; i < mp_ncpus; i++)
+		*(uint64_t *)((char *)p + sizeof(struct pcpu) * i) =3D 0;
+}
+#endif
+
 #define	counter_u64_add_protected(c, i)	counter_u64_add(c, i)
=20
 static inline void
diff --git a/sys/i386/include/counter.h b/sys/i386/include/counter.h
index 3e93b36..de90d85 100644
--- a/sys/i386/include/counter.h
+++ b/sys/i386/include/counter.h
@@ -67,6 +67,85 @@ counter_64_inc_8b(uint64_t *p, int64_t inc)
 	: "memory", "cc", "eax", "edx", "ebx", "ecx");
 }
=20
+#ifdef IN_SUBR_COUNTER_C
+static inline uint64_t
+counter_u64_read_one_8b(uint64_t *p)
+{
+	uint32_t res_lo, res_high;
+
+	__asm __volatile(
+	"movl	%%eax,%%ebx\n\t"
+	"movl	%%edx,%%ecx\n\t"
+	"cmpxchg8b	(%2)"
+	: "=3Da" (res_lo), "=3Dd"(res_high)
+	: "SD" (p)
+	: "cc", "ebx", "ecx");
+	return (res_lo + ((uint64_t)res_high << 32));
+}
+
+static inline uint64_t
+counter_u64_fetch_inline(uint64_t *p)
+{
+	uint64_t res;
+	int i;
+
+	res =3D 0;
+	if ((cpu_feature & CPUID_CX8) =3D=3D 0) {
+		/*
+		 * The machines without cmpxchg8b are not SMP.
+		 * Disabling the preemption provides atomicity of the
+		 * counter reading, since update is done in the
+		 * critical section as well.
+		 */
+		critical_enter();
+		for (i =3D 0; i < mp_ncpus; i++) {
+			res +=3D *(uint64_t *)((char *)p +
+			    sizeof(struct pcpu) * i);
+		}
+		critical_exit();
+	} else {
+		for (i =3D 0; i < mp_ncpus; i++)
+			res +=3D counter_u64_read_one_8b((uint64_t *)((char *)p +
+			    sizeof(struct pcpu) * i));
+	}
+	return (res);
+}
+
+static inline void
+counter_u64_zero_one_8b(uint64_t *p)
+{
+
+	__asm __volatile(
+	"movl	(%0),%%eax\n\t"
+	"movl	4(%0),%%edx\n"
+	"xorl	%%ebx,%%ebx\n\t"
+	"xorl	%%ecx,%%ecx\n\t"
+"1:\n\t"
+	"cmpxchg8b	(%0)\n\t"
+	"jnz	1b"
+	:
+	: "SD" (p)
+	: "memory", "cc", "eax", "edx", "ebx", "ecx");
+}
+
+static inline void
+counter_u64_zero_inline(counter_u64_t c)
+{
+	int i;
+
+	if ((cpu_feature & CPUID_CX8) =3D=3D 0) {
+		critical_enter();
+		for (i =3D 0; i < mp_ncpus; i++)
+			*(uint64_t *)((char *)c + sizeof(struct pcpu) * i) =3D 0;
+		critical_exit();
+	} else {
+		for (i =3D 0; i < mp_ncpus; i++)
+			counter_u64_zero_one_8b((uint64_t *)((char *)c +
+			    sizeof(struct pcpu) * i));
+	}
+}
+#endif
+
 #define	counter_u64_add_protected(c, inc)	do {	\
 	if ((cpu_feature & CPUID_CX8) =3D=3D 0) {		\
 		CRITICAL_ASSERT(curthread);		\
diff --git a/sys/kern/subr_counter.c b/sys/kern/subr_counter.c
index a98ed40..4c26ca9 100644
--- a/sys/kern/subr_counter.c
+++ b/sys/kern/subr_counter.c
@@ -29,34 +29,28 @@ __FBSDID("$FreeBSD$");
=20
 #include <sys/param.h>
 #include <sys/systm.h>
-#include <sys/counter.h>
 #include <sys/kernel.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 #include <vm/uma.h>
+
+#define IN_SUBR_COUNTER_C
+#include <sys/counter.h>
 =20
 static uma_zone_t uint64_pcpu_zone;
=20
 void
 counter_u64_zero(counter_u64_t c)
 {
-	int i;
=20
-	for (i =3D 0; i < mp_ncpus; i++)
-		*(uint64_t *)((char *)c + sizeof(struct pcpu) * i) =3D 0;
+	counter_u64_zero_inline(c);
 }
=20
 uint64_t
 counter_u64_fetch(counter_u64_t c)
 {
-	uint64_t r;
-	int i;
=20
-	r =3D 0;
-	for (i =3D 0; i < mp_ncpus; i++)
-		r +=3D *(uint64_t *)((char *)c + sizeof(struct pcpu) * i);
-
-	return (r);
+	return (counter_u64_fetch_inline(c));
 }
=20
 counter_u64_t

--zqa1LDeV72wF37a9
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.20 (FreeBSD)

iQIcBAEBAgAGBQJRygRxAAoJEJDCuSvBvK1BzLAQAKEiHRtb3kRGFrn6ZRTpqWXC
aW1lcAKlLWQzHD4VbWru+T3/2ohrUBAkQweSgMQR9GL3TFc+vHtKOUwFEoyT0B/X
93VG9Xs9S1i+ZcrcRdDPyfFgTeq4n/X54o6nbPlDRUl+Y29NWBKHIuGQuCvrClmh
2spZjCfGGMwg5fYZYzeYmSBgHzoz4t1AKJSNINB/swVliwA90UV6RXKj9JVMD7rc
BykIFeAs87SvJAO/TbZnrRT0bH8dZKYGBip12dzdvba2d2OPUxL62NzIoGD7G5iR
8zZXeaUItjD+ohc1UanCUP6fYPkDdHu5YUbuqsXK/+W5lHYSOMt0QfnjfMZHRQ62
/rdrLF4nG/V/qTpe8Vs9J30fCt4A5nPx7uXdjeuX85pwQ8E7Cyi/8o1xyut9nUke
Eu2+VzA+8vOD8rUa4seznUxdCRu2gjZtr4ZMo0klfedo7psChW8NV9JF/nW2jaBW
mmOyvz1eY85jXwHDZWO71lEAn7BJB1s8EewQZmjcTeu2p+AiLlvC6V+Athv4852o
9LVMfu9U/CH7MxlaDKzLq9SRwcEdR7pzCn8ZTjSyCANSsVUYapKd4St3nEC66MgG
X+MZzs5Wwtyaxs34O91OnZM7VUQQ6KKdzQxtWnZbmfupB+BPLy8/MxQBEAIx+i7J
XXBRlaRO254f+qN9GY71
=1CSO
-----END PGP SIGNATURE-----

--zqa1LDeV72wF37a9--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130625205826.GM91021>