Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 22 Jun 2013 18:58:15 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, Gleb Smirnoff <glebius@freebsd.org>, Konstantin Belousov <kib@freebsd.org>, src-committers@freebsd.org
Subject:   Re: svn commit: r252032 - head/sys/amd64/include
Message-ID:  <20130622174921.I3112@besplex.bde.org>
In-Reply-To: <20130622124832.S2347@besplex.bde.org>
References:  <201306201430.r5KEU4G5049115@svn.freebsd.org> <20130621065839.J916@besplex.bde.org> <20130621081116.E1151@besplex.bde.org> <20130621090207.F1318@besplex.bde.org> <20130621064901.GS1214@FreeBSD.org> <20130621184140.G848@besplex.bde.org> <20130621135427.GA1214@FreeBSD.org> <20130622110352.J2033@besplex.bde.org> <20130622124832.S2347@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 22 Jun 2013, I wrote:

> On Sat, 22 Jun 2013, I wrote:
>
>> ...
>> Here are considerably expanded tests, with noninline tests dropped.
>> Summary of times on Athlon64:
>> 
>> simple increment:                               4-7 cycles (1)
>> simple increment preceded by feature test:      5-8 cycles (1)
>> simple 32-bit increment:                        4-7 cycles (2)
>> correct 32-bit increment (addl to mem):         5.5-7 cycles (3)
>> inlined critical section:                       8.5 cycles (4)
>> better inlined critical section:                7 cycles (5)
>> correct unsigned 32-bit inc of 64-bit counter:  4-7 cycles (6)
>> "improve" previous to allow immediate operand:  5+ cycles
>> correct signed 32-bit inc of 64-bit counter:    8.5-9 cycles (7)
>> correct 64-bit inc of 64-bit counter:           8-9 cycles (8)
>> -current method (cmpxchg8b):                   18 cycles
>
> corei7 (freefall) has about the same timing as Athlon64, but core2
> (ref10-i386) is 3-4 cycles slower for the tests that use cmpxchg.
>
>> (4) The critical section method is quite fast when inlined.
>> (5) The critical section method is even faster when optimized.  This is
>>    what should be used if you don't want the complications for the
>>    daemon.
>
> Oops, I forgot that critical sections are much slower in -current than
> in my version.  They probably take 20-40 cycles for the best case, and
> can't easily be tested in userland since they disable interrupts in
> hardware.  My versions disable interrupts in software.

Re-oops, I was looking at the wrong tree.  -current is similar to my version
here.

Anyway, almost all this is moot.  The counter fetch is inherently
non-atomic, on i386 so it is pointless to do the counter update
atomically at the per-CPU level.  Losing the race from non-atomic
counter updates is very unlikely since the race window is about 2 nsec
wide and occurs only on approximately every 4 billionth update.  Losing
the race in the fetch is a little more likely.  The latter might be
handled by a heuristic, and anything for this works for the update
too.  It seems to be good enough to detect counters going backwards
and ignore the new value then.  That takes more space for the record
of the previous values than I like.

Non-atomic fetches from subr_counter.c:

% void
% counter_u64_zero(counter_u64_t c)
% {
% 	int i;
% 
% 	for (i = 0; i < mp_ncpus; i++)
% 		*(uint64_t *)((char *)c + sizeof(struct pcpu) * i) = 0;
% }

Non-atomic stores are more of a problem.  Races in stores can create
permanently invalid counter values.  Perhaps i386 memory ordering is
strong enough to help.  But I think it doesn't help much.  It would
locked cmpxchg8b in the i386 update and maybe here too to avoid the
races, and the lock would defeat the point of pcpu counters.

% 
% uint64_t
% counter_u64_fetch(counter_u64_t c)
% {
% 	uint64_t r;
% 	int i;
% 
% 	r = 0;
% 	for (i = 0; i < mp_ncpus; i++)
% 		r += *(uint64_t *)((char *)c + sizeof(struct pcpu) * i);
% 
% 	return (r);
% }

This code depends on 64-bit loads being atomic.  They just aren't on
32-bit arches.  The stores for the updates may go out in any order,
and might be stale.  The reads here may be in any order, especially
since this is written in C.  We don't care much about stale values, but
want monotonically increasing values.  Negative counter increments
really shouldn't be supported, since they cause complications
everywhere.  Here they break the heuristic that a negative-going fetched
value means a lost race.

So the i386 version be simply "addl; adcl" to memory.  Each store in
this is atomic at the per-CPU level.  If there is no carry, then the
separate stores are equivalent to adding separate nonnegative values and
the counter value is valid at all times.  If there is carry, then the
separate stores are equivalent to adding a negative value followed by
a larger positive value.  The counter transiently goes backwards, but
no information is lost and the counter is eventually set to the correct
forward-going value.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130622174921.I3112>