Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 2 Apr 2013 16:57:22 +0200
From:      Luigi Rizzo <rizzo@iet.unipi.it>
To:        Gleb Smirnoff <glebius@FreeBSD.org>
Cc:        arch@FreeBSD.org, kib@FreeBSD.org
Subject:   Re: [CFR][CFT] counter(9): new API for faster and raceless counters
Message-ID:  <20130402145722.GA9161@onelab2.iet.unipi.it>
In-Reply-To: <20130402141717.GK76816@glebius.int.ru>
References:  <20130401115128.GZ76816@FreeBSD.org> <20130402013758.GA96904@onelab2.iet.unipi.it> <20130402141717.GK76816@glebius.int.ru>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Apr 02, 2013 at 06:17:17PM +0400, Gleb Smirnoff wrote:
>   Luigi,
> 
> On Tue, Apr 02, 2013 at 03:37:58AM +0200, Luigi Rizzo wrote:
> L> API:
...
> L> - there are several places which use multiple counters
> L>   (e.g. packet and byte counters, global and per flow/socket),
> L>   so i wonder if it may help to provide a "protected" version of
> L>   counter_u64_add() that requires the critical_enter/exit
> L>   only once. Something like
...
> Here is patch for review. It adds 4 more primitives:
> 
> counter_enter();
> counter_u64_add_prot(c, x);
> counter_u64_subtract_prot(c, x);
> counter_exit();

thanks for the patch. I have three more comments:

- is it really necessary to have the "subtract" version ?
  Couldn't one just make "x" an int64_t ? or it gives
  too many warnings at runtime maybe ?

- (this can be fixed later) in the i386 version, counter_enter()
  and counter_exit() have an if statement which may become quite
  expensive if mispredicted. Also, the test is repeated 3 times in
  counter_u64_add() (enter/add/exit). Hopefully the compiler
  optimizes out the extra calls, but the previous version seemed
  more readable. Anyways, at some point we should figure out
  whether putting likely/unlikely annotations on the result of
  (cpu_feature & CPUID_CX8) may improve performance where it matters.

- do you plan to provide an API to initialize a counter to 0 or a
  specific value ? I suppose this is done implicitly on allocation,
  but there are cases (e.g. ipfw) where the application explicitly
  zeroes counters.

> L> BENCHMARK:
> L> 
> L> > I've got a simple benchmark. A syscall that solely updates a counter is
> L> > implemented. Number of processes is spawned, equal to number of cores,
> L> > each process binds itself to a dedicated CPU and calls the syscall 10^6
> L> > times and exits. Parent wait(2)s for them all and I measure real time of
> L> 
> L> - I am under the impression that these benchmarks are dominated
> L>   by the syscall time, and the new counters would exhibit a lot
> L>   better relative performance (compared to racy or atomic)
> L>   by doing 10..100 counter ops per syscall. Any chance to retry
> L>   the test in this configuration ?
> 
> Ok, as you wish.
> 
> Apparently compiler optimises things like:
> 
> 	for (int i = 0; i < rounds; i++)
> 		the_counter += v;
> 
> To avoid optimisations here I declared the_counter as volatile. Is the
> benchmark fair after that? Anyway, here are results for (rounds == 2000000000):
> 
> x counter_u64_add(), result == 2000000000 * ncpus
> + racy increment, result == 2022232241 (!!!)
> +------------------------------------------------------------------------------+
> |  x                                                                   +       |
> |  x                                                                   +       |
> |  x                                                                   ++      |
> |  x                                                                   ++      |
> |  x        x                                                         +++     +|
> ||_MA__|                                                             |_MA_|    |
> +------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x  10             5          5.44          5.01         5.053    0.13605963
> +  10          8.16          8.53           8.2         8.235    0.10721215
> Difference at 95.0% confidence
>         3.182 +/- 0.115089
>         62.9725% +/- 2.27764%
>         (Student's t, pooled s = 0.122488)
> 
> So 63% speedup, not speaking on the fact that in such a tight loop 98% of
> parallel updates are lost on racy counter :)
> 
> A tight loop with atomic_add() is 22 times (twenty two times) slower than
> new counter. I didn't bother to run ministat :)

yes i think this really makes justice of the improvements of the new code
(i am a bit unclear on what actual test you ran / how many counter_u64_add()
per syscall you have, but i do expect the racy counters to be much slower
and much less reliable, and the 20x slowdown with atomics is completely
expected.)

cheers
luigi



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130402145722.GA9161>