From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 28 13:23:07 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 72F13D32
 for <arch@freebsd.org>; Fri, 28 Dec 2012 13:23:07 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail12.syd.optusnet.com.au (mail12.syd.optusnet.com.au
 [211.29.132.193])
 by mx1.freebsd.org (Postfix) with ESMTP id D250A8FC0A
 for <arch@freebsd.org>; Fri, 28 Dec 2012 13:23:06 +0000 (UTC)
Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au
 (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26])
 by mail12.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qBSDMv22002882
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Sat, 29 Dec 2012 00:22:58 +1100
Date: Sat, 29 Dec 2012 00:22:57 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
Subject: Re: UPDATE Re: making use of userland dtrace on FreeBSD
In-Reply-To: <20121227190904.GL82219@kib.kiev.ua>
Message-ID: <20121228224312.X1054@besplex.bde.org>
References: <50D49DFF.3060803@ixsystems.com> <50DBC7E2.1070505@mu.org>
 <CAGE5yCq46NFKKzSUZq=jz0NwEnWdjPTK_0fpZ+wWV9FA0BSQCg@mail.gmail.com>
 <50DBD193.7080505@mu.org>
 <CAGE5yCrnoNhOh3VaYU3bO6BwA=bpxD5QzkZvD+HaUwvXNQ+Ufw@mail.gmail.com>
 <50DBE0DB.6090804@ixsystems.com> <20121227214354.V965@besplex.bde.org>
 <20121227190904.GL82219@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=e5de0tV/ c=1 sm=1 a=EG0SoA9ZrYwA:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=BXM4HPcYP8wA:10
 a=3e9C-FsJo0l-C2I13FEA:9 a=CjuIK1q_8ugA:10 a=2yTQJ0OkpDuyj7eE:21
 a=vRPkp040kyyWXpux:21 a=1gajL0UBtqThFe74:21 a=bxQHXO5Py4tHmhUgaywp5w==:117
Cc: "arch@freebsd.org" <arch@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Dec 2012 13:23:07 -0000

On Thu, 27 Dec 2012, Konstantin Belousov wrote:

> On Thu, Dec 27, 2012 at 11:39:44PM +1100, Bruce Evans wrote:
>> After working around these bugs by putting the functions in separate files
>> (and removing the now-unneeded volatiles):
>>
>> main.c:
>> % void foo(void);
>> %
>> % int
>> % main(void)
>> % {
>> % 	int i;
>> %
>> % 	for (i = 0; i < 100000000; i++)
>> % 		foo();
>> % }
>>
>> foo.c:
>> % void bar(void);
>> %
>> % void
>> % foo(void)
>> % {
>> % 	bar();
>> % }
>>
>> bar.c:
>> % void
>> % bar(void)
>> % {
>> % }
>>
>> we can seem how much the frame pointer optimization is saving: this
>> now takes 0.43 seconds with clang and 0.87 seconds with gcc.  It
>> is weird that the gcc time increased from 0.65 seconds to 0.87
>> despite doing less.  After adding back the volatiles, the times
>> are 0.43 seconds with clang and 0.85 seconds with gcc -- doing
>> more gave a small optimization, but didn't recover 0.65 seconds.
>> There is apparently some magic alignment or misalignment which
>> costs or saves about the same as omitting the frame pointer.
>> Finally, with gcc -O -fomit-frame-pointer, the program takes 0.60
>> seconds, and with gcc -O2 -fomit-frame-pointer, it takes 0.49
>> seconds, and with gcc -O2, it takes 0.49 seconds (this really doesn't
>> omit frame pointers, so omitting the frame pointer saves nothing),
>> With cc -O -fno-omit-frame-pointer, it takes 0.43 seconds, but this
>> case is just broken -- the -fno-omit-frame-pointer is silently ignored :-(.

> I do not believe this measurement is indicative.

Yes, since this program is too simple to be representative.

> i386 is
> register-starved architecture. Using the frame pointer means that
> you are left with only 6 registers instead of 7. For the PIC code,
> there are 5 vs. 6. It is real code that does something more than
> incrementing the same variable which could get the performance hit with
> -fno-omit-frame-pointer for i386. But on i386 use of the frame pointer
> is ABI mandated.

Register starvation is another thing that makes very little difference.
But here is another non-representative program that goes to the oppositie
extreme to get register starvation.  The result is that omitting the
frame pointer is a small pessimization on i386 and makes no difference
on amd64:

main.c:
% int foo(int, int, int, int, int, int, int, int, int, int);
% 
% volatile int mf;
% 
% int
% main(void)
% {
% 	int i;
% 
% 	for (i = 0; i < 100; i++)
% 		mf += foo(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
% }

bar.c:
% int
% bar(int a, int b, int c, int d, int e, int f, int g, int h, int i, int j)
% {
% 	int n, r;
% 
% 	r = 0;
% 	for (n = 0; n < 1000; n++) {
% 		r += a + b + c + d + e + f + g + h + i + j;
% 		a += b;
% 		b = -b;
% 		c += d;
% 		d = -d;
% 		e += f;
% 		f = -f;
% 		g += h;
% 		h = -h;
% 		i += j;
% 		j = -j;
% 	}
% 	return (r);
% }

foo.c:
% int bar(int, int, int, int, int, int, int, int, int, int);
% 
% int
% foo(int a, int b, int c, int d, int e, int f, int g, int h, int i, int j)
% {
% 	int n, r;
% 
% 	r = 0;
% 	for (n = 0; n < 1000; n++)
% 		r += bar(a, b, c, d, e, f, g, h, i, j);
% 	return (r);
% }

i386 times on ref10-i386:
gcc -O2 -o f main.c bar.c foo.c:                         0.81 seconds
gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.81 seconds
gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer     0.83 seconds
  cc -O2 -o f main.c bar.c foo.c:                         1.11 seconds
  cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 1.11 seconds
  cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer     1.11 seconds

0.81 seconds is 15.08 cycles/iteration in the inner loop.   0.83 seconds
is 15.45 cycles/iteration.  1.11 seconds is 20.67 cycles/iteration.

The inner loop has 12 variables.  Since i386 has only 6 or 7 integer
registers, these can't be kept in registers.  Checking the generated
code in the inner loop in bar() shows that the source code is
complicated enough to prevent significant optimizations.  It has the
following number of memory references:

gcc -O2 -o f main.c bar.c foo.c:                         7(r) 1(w) 4(r+w) = 16
gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: not counted
gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer     7(r) 1(w) 6(r+w) = 20
  cc -O2 -o f main.c bar.c foo.c:                         13(r) 13(w) 2(r+w) = 30
  cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: not counted
  cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer     not counted

Although gcc -fomit-frame-pointer gives 4 fewer memory references, it
runs 0.37 cycles/iteration slower.  Apparently the extra memory
references are always executed in parallel, so they are free. References
relative to %ebp are 1 byte smaller than ones relative to %esp.  Apparently
this is enough to avoid 0.37 cycles of penalties for the larger instructions.

cc (clang) generates remarkably bad code for this.  This is shown by all
metrics:
- more instructions
- less use of read-modify-write instructions.  Where gcc generates lots of
   addl's to memory variables, clang likes to load the memory variables,
   add to them, and write them back.  In theory this is no slower unless
   the larger instruction space is too large, since it takes the same number
   of memory references and should reduct to almost the same micro-ops .
   But the way clang does it, somehow also gives almost twice as many memory
   references (30 instead of 16).
- many more many references
- lots of spills which are carefully annotated by clang.  It documents 15
   spills and 13 reloads.

amd64 times on ref10-amd64:
gcc -O2 -o f main.c bar.c foo.c:                         0.37 seconds
gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.37 seconds
gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer     0.37 seconds
  cc -O2 -o f main.c bar.c foo.c:                         0.41 seconds
  cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.41 seconds
  cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer     0.41 seconds

Now both compilers generate very nice code for bar(), with the variables
in the inner loop all in registers, and not so nice code for foo() (it
takes just 1 more instruction for all the arithmetic in the inner loop
in bar() than for calling bar()).  clang somehow finds a way to be
slower here too.  -fomit-frame-pointer now makes no difference at all,
since it only takes about 3 extra instructions for every call to bar().
I made the inner loop in bar() too heavyweight to test it (oops).
Another oops is that I forgot to modify the modification of the
variables in bar().  It gets optimized away, and the code in bar() is
only so very nice because bar() reduces to adding up the 10 variables.

Now with the inner loop in bar() removed and the number of iterations in
foo() multiplied by 1000 to compensate:

amd64 times on ref10-amd64:
gcc -O2 -o f main.c bar.c foo.c:                         0.64 seconds
gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.76 seconds
gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer     0.64 seconds
  cc -O2 -o f main.c bar.c foo.c:                         0.67 seconds
  cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.67 seconds
  cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer     0.61 seconds

-fomit-frame-pointer is finally giving an optimization.  It does so even
for clang, and this is weird because for clang it only affects the
non-leaf functions main() and foo() for which there are only 1+100
frame pointer initializations and finalizations.  Having a frame pointer
is apparently pessimizing foo() by changing the instructions that it uses
to call bar().

i386 times on ref10-i386:
gcc -O2 -o f main.c bar.c foo.c:                         1.24 seconds
gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 1.24 seconds
gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer     1.16 seconds
  cc -O2 -o f main.c bar.c foo.c:                         1.11 seconds
  cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 1.11 seconds
  cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer     1.23 seconds

Now the frame pointer in bar() has obvious costs for gcc.  But the
change in the time for clang is even weirder than the changes above:
for clang, the frame pointer is always omitted in the leaf function
bar(); there are only 1+100 frame pointer initializations and
finalizations, as for clang on amd64.  On amd64, the frame pointer
in foo() gave an optimization from 0.67 to 0.61 seconds, but on i386
it gives a pessimization from 1.11 seconds to 1.23 seconds.

> For amd64, there is no so high pressure on the register file, but I do
> not know that much debugging tools which expect the frame pointer on
> amd64 or could detect and use it if present. It is only ddb for our
> kernel and dtrace for solaris and freebsd, gdb definitely does not.

I was originally going to say that the number of registers is almost
as irrelevant for performance as -fomit-frame-pointer :-).  amd64 is
typically slightly slower than i386 despite its extra registers and
"optimized" ABI with parameters passed in registers, since the
optimizations aren't quite enough to recover from the bloat of 64-bit
pointers and function parameters.  But my non-representative examples
made amd64 about twice as fast.  And to demonstrate loss due to the
frame pointer increasing register pressure significantly, I should
have tried an example where everything fits in registers with
-fomit-frame-pointer but not without.  Such an example would be even
more non-representative.

Bruce