Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 7 Jun 2012 07:14:06 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des@des.no>
Cc:        Gianni <gianni@FreeBSD.org>, John Baldwin <jhb@FreeBSD.org>, Alan Cox <alc@rice.edu>, Alexander Kabaev <kan@FreeBSD.org>, Attilio Rao <attilio@FreeBSD.org>, Konstantin Belousov <kib@FreeBSD.org>, freebsd-arch@FreeBSD.org, Konstantin Belousov <kostikbel@gmail.com>
Subject:   Re: Fast vs slow syscalls (Re: Fwd: [RFC] Kernel shared variables)
Message-ID:  <20120607064951.C1106@besplex.bde.org>
In-Reply-To: <864nqovoek.fsf@ds4.des.no>
References:  <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2%2BoYo%2BwwT4ipA@mail.gmail.com> <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> <201206051222.12627.jhb@freebsd.org> <20120605171446.GA28387@onelab2.iet.unipi.it> <20120606040931.F1050@besplex.bde.org> <864nqovoek.fsf@ds4.des.no>

next in thread | previous in thread | raw e-mail | index | archive | help
  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--0-925939591-1339017246=:1106
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE

On Wed, 6 Jun 2012, [utf-8] Dag-Erling Sm=C3=B8rgrav wrote:

> Bruce Evans <brde@optusnet.com.au> writes:
>> Dag-Erling Sm=C3=B8rgrav <des@des.no> writes:
>>> getpid(): 10,000,000 iterations in 24,400 ms
>>> gettimeofday(0, 0): 10,000,000 iterations in 54,104 ms
>>> raise(0): 10,000,000 iterations in 1,284,593 ms
>> That's one slow system or broken units.
>
> Broken units, these are microseconds not milliseconds.  Sorry.
>
>> After adjusting by factors of 1000 here and there, this format is still
>> hard to parse.  I like the format of nsec/operation.  24400 10 million
>> operations in 24400 moroseconds seems to scale to 2.44 nsec/call (if 1
>> moro =3D 1 micro).  But that is impossibly fast, unless getpid() is
>> inlined to a load of the shared variable (it may also need the load to
>> be moved outside the loop).  I can't see any reasonable adjustment that
>> gives 24.4 nsec/call.
>
> #define ITERATIONS 10000000
>
>    struct timeval start, end;
>    int i;
>
>    gettimeofday(&start, NULL);
>    for (i =3D 0; i < ITERATIONS; ++i)
>        getpid();
>    gettimeofday(&end, NULL);

Now 2.44 nsec/call makes sense, but you really should add some volatiles
here to ensure that getpid() is not optimized away.  I get 3.48-3.49
nsec/call on an Athlon64 2GHz (the ratio of the times is almost exactly
proportional to the clock freqencies, so the times in cycles must be
almost identical.

> On Linux, gcc 4.4.6 compiles this to:
>
>   # gettimeofday(&start, NULL)
>   0x000000000040064b <+23>:    lea    -0x20(%rbp),%rax
>   0x000000000040064f <+27>:    mov    $0x0,%esi
>   0x0000000000400654 <+32>:    mov    %rax,%rdi
>   0x0000000000400657 <+35>:    callq  0x400500 <gettimeofday@plt>
>
>   # i =3D 0
>   0x000000000040065c <+40>:    movl   $0x0,-0x4(%rbp)
>   0x0000000000400663 <+47>:    jmp    0x40066e <main+58>
>
>   # getpid()
>   0x0000000000400665 <+49>:    callq  0x400520 <getpid@plt>
>
>   # ++i
>   0x000000000040066a <+54>:    addl   $0x1,-0x4(%rbp)
>
>   # i < ITERATIONS
>   0x000000000040066e <+58>:    cmpl   $0x98967f,-0x4(%rbp)
>   0x0000000000400675 <+65>:    jle    0x400665 <main+49>
>
>   # gettimeofday(&end, NULL)
>   0x0000000000400677 <+67>:    lea    -0x30(%rbp),%rax
>   0x000000000040067b <+71>:    mov    $0x0,%esi
>   0x0000000000400680 <+76>:    mov    %rax,%rdi
>   0x0000000000400683 <+79>:    callq  0x400500 <gettimeofday@plt>
>
> The code generated by gcc 4.2.1 on FreeBSD is almost identical:
> ...

SO it loops OK, but we can't see what getpid() does.  It must not be
doing much.

> I don't know why gcc 4.4.6 loads &start / &end into %rax before copying
> it to %esi instead of loading it directly into %esi like 4.2.1 does.  I
> used the same command line (gcc -Wall -Wextra syscall.c) in both cases.

Probably unimportant (buried in loop overhead).

Program for 3.48-3.49 nsec:

% volatile int gpid;

It isn't volatile, but declaring it volatile prevents gcc-3.3.1 optimizing
away the whole call to getpid() (this reduces the time to 0.99 nsec =3D 2
cycles (2 cycles is the minimum loop overhead on most current x86)).

%=20
% int
% getpid(void)
% {
% =09return gpid;
% }
%=20
% main()
% {
% =09int i;
%=20
% =09for (i =3D 0; i < 1000000000; i++)
% =09=09getpid();
% }

Compiling with cc -O -fomit-frame-pointer gives:

% 08048520 <getpid>:
%  8048520:=09a1 0c 97 04 08       =09mov    0x804970c,% eax
%  8048525:=09c3                   =09ret=20
%  8048526:=0989 f6                =09mov    % esi,%esi
%=20
% 08048528 <main>:
%  8048528:=0955                   =09push   % ebp
%  8048529:=0989 e5                =09mov    % esp,%ebp
%  804852b:=0953                   =09push   % ebx
%  804852c:=0983 ec 04             =09sub    $0x4,% esp
%  804852f:=0983 e4 f0             =09and    $0xfffffff0,% esp
%  8048532:=09bb 00 00 00 00       =09mov    $0x0,% ebx
%  8048537:=0990                   =09nop=20
%  8048538:=09e8 e3 ff ff ff       =09call   8048520 <getpid>
%  804853d:=0943                   =09inc    % ebx
%  804853e:=0981 fb ff c9 9a 3b    =09cmp    $0x3b9ac9ff,% ebx
%  8048544:=097e f2                =09jle    8048538 <main+0x10>
%  8048546:=098b 5d fc             =09mov    0xfffffffc(% ebp),%ebx
%=20
%  8048549:=09c9                   =09leave=20
%  804854a:=09c3                   =09ret=20
%  804854b:=0990                   =09nop

-fomit-frame-pointer gives nicer object code but has no effect on the
runtime.

gettimeofday() needs several branches for null pointers, so it much slower
even before it does useful work.  Your system has an indirection or 2
for shared libraries (1 for the function call and maybe more for the global
pid), so it is doing well for getpid() to be no slower in cycles.  kib's
version has lots of layering (function calls and indirections inherited fro=
m
the kernel version where they are more needed) that might make it get to th=
e
useful work at about the same time Linux has done it and returned.

5.4104 nsec/call for gettimeofday() is impossible if there is any
rdtsc() hardware call or much layering.  rdtsc() takes 9-12 cycles on
AthlonXP and Athlon64, but 40+ cycles on Phenom+ and on most (?) Intel
CPUs and on most CPUs where it is P-state invariant (it is apparently
as hard or harder to synchronize in hardware as in software).  So Linux
can't be calling it to get 5.4104 nsec/call.  But calling and using
it should only take another 13-20 nsec at 3 GHz.  Excessive generality
in the software parts probably adds 10-20 nsec to this.  ISTR measuring
29 nsec (60+ cycles) for binuptime() Athlon XP.  That's with the hardware
part taking about 12 cycles.  gettimeofday()'s poor API adds a lot to
this.

Bruce
--0-925939591-1339017246=:1106--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120607064951.C1106>