Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 26 Jun 2013 11:42:39 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, Gleb Smirnoff <glebius@FreeBSD.org>, src-committers@FreeBSD.org, Bruce Evans <brde@optusnet.com.au>
Subject:   Re: svn commit: r252032 - head/sys/amd64/include
Message-ID:  <20130626092955.B891@besplex.bde.org>
In-Reply-To: <20130625205826.GM91021@kib.kiev.ua>
References:  <20130621135427.GA1214@FreeBSD.org> <20130622110352.J2033@besplex.bde.org> <20130622124832.S2347@besplex.bde.org> <20130622174921.I3112@besplex.bde.org> <20130623073343.GY91021@kib.kiev.ua> <20130623181458.J2256@besplex.bde.org> <20130624170849.GH91021@kib.kiev.ua> <20130625102023.K899@besplex.bde.org> <20130625062039.GJ91021@kib.kiev.ua> <20130625190352.P986@besplex.bde.org> <20130625205826.GM91021@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 25 Jun 2013, Konstantin Belousov wrote:

> On Tue, Jun 25, 2013 at 08:14:41PM +1000, Bruce Evans wrote:
>> On Tue, 25 Jun 2013, Konstantin Belousov wrote:
>>
>>> Updates to the counter cannot be done from the interrupt context.
>>
>> This is fragile, however.  It prevents using counters for things like
>> counting interrupts.  Most interrupt counting is now done directlyly
>> and doesn't use PCPU_INC().  i386/i386 has just 1 use of PCPU_INC().
>> It is to count traps in machdep.c.  Traps include nonmaskable
>> interrupts.  Even hard-disabling of interrupts doesn't prevent these.
>> ...
> Traps are not performance critical in the sense that there is no need to count
> up to 1-10G traps per second.

The trap count increment is not a performance problem, but a correctness one.
Disabling interrupts for a multi-word increment doesn't work for NMIs even
for !SMP or pcpu accesses with SMP.

> Anyway, as Gleb said, there is no point in
> optimizing the i386 kernel.

I said that there is every point in optimizing the i386 kernel.  This
applies even more to other 32-bit arches.  Some CPUs are much slower
than modern x86's.  They shouldn't be slowed down more by inefficient
KPIs.

>>> The fetch problem of sometime missing the carry bit is present on all
>>> 32bit arches now, I think.
>>
>> Also the zeroing problem.  Since there are no atomic ops in sight, zeroing
>> has several races.  My transferring code uses atomic ops on only 1 CPU,
>> so it has half of these races.  See below.
> Right, zeroing should also use atomic 64bit writes on i386.  Updated
> patch is at end.

It's getting even more complicated and further from what I want.  It also
doesn't work.  Disabling interrupts doesn't work for the SMP case.  The
zeroing needs to run in a per-CPU daemon.

>>>>     The arm implementation of atomic.h is also relevant and is especially
>>>>     interesting.  I looked at it after mdf@ pointed out ARM_RAS_START.
>>>>     This seems to only be used by arm in userland.  It is even larger and
>>>>     slower and more complicated than critical sections, so it doesn't seem
>>>>     to be useful for counters.  In the kernel, arm has several
>>>>     implementations depending on machine capabilities.  The ifdefs are
>>>>     hard to understand.  On some machine, it seems to use the equivalent
>>>>     of a cmpxchg loop.  On others, it just disables interrupts.  Disabling
>>>>     interrupts is presumably better than a critical section, and userland
>>>>     has to use ARM_RAS_START for some arches because neither disabling
>>>>     interrupts not critical sections are available in userland.  None
>>>>     of this care helps avoid the races in pcpu.h, since pcpu.h intentionally
>>>>     avoids using atomic ops.  None of this care helps avoid the races in
>>>>     counter.h or make counter.h efficient, since counter.h uses its own
>>>>     methods.  To reach the same quality as atomic.h, counter.h would have
>>>>     to copy half of the ifdefs and methods in atomic.h.
>>> BTW, why do you claim that disabling interrupts is faster and better method
>>> to disable preemption than critical section ?
>>
>> For arm, it is because if the critical section method were better, then
>> arm should use it for atomic ops too.
> Of course not.  Atomics must be atomic WRT the interrupts as well.
> Using critical sections only provides atomicity against preemption
> but not against interruption.

Indeed, atomics would be fragile if they couldn't be used in interrupt
handlers.  So in a non-optimized implementation it is safe to disable
interrupts in hardware.  An optimized solution would accept the fragility
like the counter increment does, and have use special atomic ops that
do more for the few variables that are accessed by interrupt handlers.

> The arm looks quite deficient anyway.  From the arm/include/atomic.h,
> it seems that the arch did not have (usable) atomics until v6.  And
> ARM does not support SMP in the stock HEAD still.

Disabling interrupts only gives atomicity for non-SMP.  Thus the cases
where this method is used indicate an arch that doesn't support SMP.
More interestingly for the current discussion, arm and other several
other arches apparently don't have interrupt-safe load-modify-store
instructions even for register-sized memory variables.  Thus they
need to either disable interrupts or use a cmpxchg-type instruction
even to add to register-sized variables.

>> % static inline void
>> % counter_u64_add(counter_u64_t c, int64_t inc)
>> % {
>> %
>> % #if 1
>> % 	/* 20 cycles on A64. */
>> % 	/* 31 cycles on A64 with lock prefix. */
>> % 	/* 35 cycles on A64 with mfence prefix. */
>> % 	if ((cpu_feature & 1) == 1) {
>> % 		counter_64_inc_8b(c, inc);
>> % 	}
>>
>> Old tests said that this takes 18 cycles.  It now takes 20.
>>
>> Adding a lock or mfence prefix to the cmpxchg8b gives the times stated
>> above.  This shows how bad using cmpxch8b is.  We could use atomic ops
>> for everything and only be 50% slower.  But we shouldn't use either
>> cmpxchg8b or atomic ops.  We should use 32-bit addl, which takes about
>> 4 cycles.
> Use of cmpxchg8b provides the soundness to the algorithm, by using
> atomic 64 bit loads and stores.

So does using a correct algorithm.

> As I stated, I am not going to spent time microoptimizing for i386.
> Current implementation is good enough.

I disagree.  I'm not really trying to micro-optimize it, but to avoid
pessimizing it.  To replaces PCPU_ADD on register-sized variables, you
meed to be at least as efficient as well as more correct.  Arches with
slower CPUs benefit even more from not pessimizations imposed by MI
KPIs.

[>]*...

>>> I will add the missed MD bits to the current patch for i386, leaving
>>> the known issue for fetches on other 32bit arches for now.
>>
>> Also leaving the known issue of zeroing being broken for all arches.
>> Using a daemon is a good way to fix this too.  You have to run the
>> daemon on each CPU and wait for it to complete whenever a counter
>> is zeroed.
> Let me restate. If somebody wants to provide a complete working
> implementation for i386 which is more optimized than cmpxchg8b, I
> have no objections. I will not spent a time on this, though, since I
> consider such activity quite useless. And, I should note that e.g. the
> daemon-based implementation must be arch-specific, to not pessimise
> amd64.

No, the daemon runs rarely, so its efficiency doesn't matter.

> [Skipped the test program]
> I retested several cases on SandyBridge i7 machine. There, the cmpxchg8b
> version without barrier or lock gives 12 cycles, the incomplete
> emulation of the critical section is executed in 7 cycles, and cli/sti
> in 24.

cli/sti seems to be slow on SandyBridge.  Not too surprising.  I checked
the old Athlon manual for timing.  It confirms what I measured:

     cli:     4 cycles
     sti:     4 cycles (so using cli/sti gives 12)
     pushfl:  4 cycles
     popfl:  15 cycles (so this is too slow to use; using it gave 19)

freefall (corei7) has much the same timing as you measured for the others.
It takes another cycle (8 altogether) for the complete emulation of the
critical section.

But a simple register-sized add-to-memory with daemon support is better.
It takes 4 cycles on modern and not-so-modern x86.  It also gives less
register pressure on 32-bit systems.  It also allows expansion or
shrinking to any size of counter (larger than the register or int size)
without changing the MD part of the KPI, with small modifications to
the daemon and fetching code.

> diff --git a/sys/amd64/include/counter.h b/sys/amd64/include/counter.h
> index b37a4b8..95e0c58 100644
> --- a/sys/amd64/include/counter.h
> +++ b/sys/amd64/include/counter.h
> @@ -36,6 +36,37 @@ extern struct pcpu __pcpu[1];
> + ...
> +static inline void
> +counter_u64_zero_inline(counter_u64_t p)
> +{
> +	int i;
> +
> +	for (i = 0; i < mp_ncpus; i++)
> +		*(uint64_t *)((char *)p + sizeof(struct pcpu) * i) = 0;
> +}
> +#endif
> +
> #define	counter_u64_add_protected(c, i)	counter_u64_add(c, i)

This is the same as the MI zeroing that it replaces, so it has the same
bugs:
- it stores to pcpu memory from a non-pcpu context (unless it is called
   from a pcpu daemon).  Since nothing uses locked instructions, the other
   CPUs might not see the changes here.  When the changes are made
   concurrently, there is a race to decide the final value.  It may be
   the value stored here, or the old value plus an increment stored in
   pcpu context.  This gives an error of up to 2**64-1.
- it assumes that the compiler generates 64-bit accesses for 64-bit
   stores.  Implement and use the -fperverse-stores option to check for
   such bugs :-).  It does things like storing multiple times with various
   sizes and orders and many misaligned sub-objects.

counter_u64_read_one_8b() on amd64 depends on -fno-pervervse-loads.

> diff --git a/sys/i386/include/counter.h b/sys/i386/include/counter.h
> index 3e93b36..de90d85 100644
> --- a/sys/i386/include/counter.h
> +++ b/sys/i386/include/counter.h
> @@ -67,6 +67,85 @@ counter_64_inc_8b(uint64_t *p, int64_t inc)
> 	: "memory", "cc", "eax", "edx", "ebx", "ecx");
> }
>
> +#ifdef IN_SUBR_COUNTER_C
> +static inline uint64_t
> +counter_u64_read_one_8b(uint64_t *p)
> +{
> +	uint32_t res_lo, res_high;
> +
> +	__asm __volatile(
> +	"movl	%%eax,%%ebx\n\t"
> +	"movl	%%edx,%%ecx\n\t"
> +	"cmpxchg8b	(%2)"
> +	: "=a" (res_lo), "=d"(res_high)
> +	: "SD" (p)
> +	: "cc", "ebx", "ecx");
> +	return (res_lo + ((uint64_t)res_high << 32));
> +}

This is sufficiently MD, so it works with -fperverse-loads.

We accept the minor races that it gives.  Locking for read is nonsense
in general since values may become stale immediately after they are
read.  But we especially don't care about reading stale values for
counters.  I think x86 memory ordering guarantees that the stale values
don't go backwards by becoming staler.  We care about them having this
property.

> +static inline uint64_t
> +counter_u64_fetch_inline(uint64_t *p)
> +{
> +	uint64_t res;
> +	int i;
> +
> +	res = 0;
> +	if ((cpu_feature & CPUID_CX8) == 0) {
> +		/*
> +		 * The machines without cmpxchg8b are not SMP.

Should assert this somewhere.

> +		 * Disabling the preemption provides atomicity of the
> +		 * counter reading, since update is done in the
> +		 * critical section as well.
> +		 */
> +		critical_enter();
> +		for (i = 0; i < mp_ncpus; i++) {
> +			res += *(uint64_t *)((char *)p +
> +			    sizeof(struct pcpu) * i);
> +		}
> +		critical_exit();

Seems OK.  Works with -fperverse-loads.

> +static inline void
> +counter_u64_zero_one_8b(uint64_t *p)
> +{
> +
> +	__asm __volatile(
> +	"movl	(%0),%%eax\n\t"
> +	"movl	4(%0),%%edx\n"
> +	"xorl	%%ebx,%%ebx\n\t"
> +	"xorl	%%ecx,%%ecx\n\t"
> +"1:\n\t"
> +	"cmpxchg8b	(%0)\n\t"
> +	"jnz	1b"
> +	:
> +	: "SD" (p)
> +	: "memory", "cc", "eax", "edx", "ebx", "ecx");
> +}

Better than for amd64.  It works with -fperverse-stores, but has the
race accessing pcpu data from another CPU.

> +
> +static inline void
> +counter_u64_zero_inline(counter_u64_t c)
> +{
> +	int i;
> +
> +	if ((cpu_feature & CPUID_CX8) == 0) {
> +		critical_enter();
> +		for (i = 0; i < mp_ncpus; i++)
> +			*(uint64_t *)((char *)c + sizeof(struct pcpu) * i) = 0;
> +		critical_exit();

Works for the !SMP case, as for fetches.

> +	} else {
> +		for (i = 0; i < mp_ncpus; i++)
> +			counter_u64_zero_one_8b((uint64_t *)((char *)c +
> +			    sizeof(struct pcpu) * i));
> +	}

Inherits the races in counter_u64_zero_one_8b().

This should also try harder to set all the counters to 0 at the same time.
Loop a few times if necessary.  Not very important, since the result of
losing this race is much the same as reading a stale counter.  But there
is one case that matters, and is not really a race: suppose the first
counter increment after zeroing, or even the first counter increment
after initialization, is a decrement, say by 1.  Suppose that all the
other races are fixed, so that the increment occurs strictly after
the zeros.  Then it gives a counter value of 0xffffffffffffffff.  As
usual, the best fix is to not use unsigned types for anything.  Let
the decrement give the value -1.  Anything reporting this value should
use a signed format and be less surprised by it being -1 than by it
being 0xffffffffffffffff or 18446744073709551615.  Another fix is to
add a "small" bias to counters.  Instead of zeroing, set each pcpu counter
to the "small" value 0x10000000000000000 divided by the number of CPUs.
This should prevent each counter overflowing on decrements.  After adding
up the counters, convert results that are less than the combined bias to 0.

> diff --git a/sys/kern/subr_counter.c b/sys/kern/subr_counter.c
> ...

The full MD fix will be much larger.  Critical sections only work for
the !SMP case.  Loads and stores need something like x86 cmpxch8b
and that might not exist on arches that support SMP.  It is easiest
to use atomic ops for the loads and stores.  Efficiency is unimportant
for thes accesses since they are rare.  By using the arch's atomic ops,
you get the best that it can do.  All arches that support SMP must
have atomic ops for 32-bit loads and stores, but not necessarily for
64-bit loads and stores.  This is another reason not to use 64-bit
counters in the MI part of the KPI.

I already gave an implementation (that doesn't quite work, since it
has the same pcpu race as above) using only standard KPIs: with minor
updates to it:

%  	struct mtx *mp;
%  	uint64_t *ap;
%  	uint64_t r;
%  	volatile int *p;
%  	int v;
% 
%  	mp = ctomtxp(c);		/* might not need 1 per counter */
%  	ap = ctoap(c);			/* accumulation counter for c */

The accumulation counter is locked by the counter mutex.

%  	mtx_lock(nmp);
%  	r = *ap;
%  	for (i = 0; i < mp_ncpus; i++) {
%  		p = (u_int *)((char *)c + sizeof(struct pcpu) * i);
%  		v = *p;			/* don't care if it is stale */

Buggy.  Broken with -fperverse-loads.  This needs to be a MD access
in asm, like you did above, but since it is only 32 bits, it can be a
single load instruction.
   (BTW, the i386 implementation of atomic.h is similarly broken with
    -fperverse-loads in the !SMP case.  It uses asm in the SMP cases,
    but in the !SMP case it lets the compiler decide the load
    instruction[s].  It also uses a volatile pointer.)
Volatile should stop -fperverse-loads from doing anything, but there
are no guarantees.  I don't like volatile, since using it mainly hides
the bugs that -fperverse-* would find.  Here only suitable MD accesses
are guaranteed to work.  I depend on the cmpset having a memory clobber
to tell the compiler to reload *p later.

The easiest fix is to use atomic_load_int().  This is MI and the best
currently available but unnecessarily slow.  Not extremely slow, since
this code is rarely executed.  Easier than adding pcpu_atomic_load_int()
to all arches.  PCPU_GET() has an inaccessible KPI, and is known to
be buggy.

Except for the above bugs, read accesses to *p are "locked" by it being
pcpu and by non-pcpu readers like here not caring about its exact
value.

%  		if (v >= 0x80000000) {
%  			/* Transfer only when above critical level. */
%  			while (atomic_cmpset_rel_int(p, v, 0) == 0)
%  				v = *p;	/* still don't care if it is stale */

Write accesses to *p need more careful locking.  This was supposed to be
locked by the atomic op combined with *p being pcpu.  But that doesn't
work, since this is a non-pcpu access and the pcpu accesses for increments
don't use atomic ops.

To work, the transfer needs to be done in pcpu context by the daemon.  Then
it doesn't need atomic ops, but using the same atomic cmpset as above is
convenient and not too slow since it rarely executes.  It saves having
to write pcpu_atomic_cmpset_int() for all arches.

%  			*ap += v;
%  		}
%  		r += v;
%  	}
%  	mtx_unlock(cmptxp);
%  	return (r);

Zeroing is similar to transferring.  The daemon will have to be kicked
so that it runs on each CPU and transfers all counters.  This sets
all pcpu counters to 0.  The settings are transient and the pcpu counters
may become nonzero before completion, but usually they won't become very
far from 0.  Wait for the daemon to complete.  Clear the top-level
counter (now the API only supports clearing 1) while holding the mutex.
Add up all the pcpu counters for the counter being cleared normally, and
repeat until the full count is close enough to zero.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130626092955.B891>