From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 13:23:07 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 72F13D32 for ; Fri, 28 Dec 2012 13:23:07 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail12.syd.optusnet.com.au (mail12.syd.optusnet.com.au [211.29.132.193]) by mx1.freebsd.org (Postfix) with ESMTP id D250A8FC0A for ; Fri, 28 Dec 2012 13:23:06 +0000 (UTC) Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail12.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qBSDMv22002882 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 29 Dec 2012 00:22:58 +1100 Date: Sat, 29 Dec 2012 00:22:57 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov Subject: Re: UPDATE Re: making use of userland dtrace on FreeBSD In-Reply-To: <20121227190904.GL82219@kib.kiev.ua> Message-ID: <20121228224312.X1054@besplex.bde.org> References: <50D49DFF.3060803@ixsystems.com> <50DBC7E2.1070505@mu.org> <50DBD193.7080505@mu.org> <50DBE0DB.6090804@ixsystems.com> <20121227214354.V965@besplex.bde.org> <20121227190904.GL82219@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=e5de0tV/ c=1 sm=1 a=EG0SoA9ZrYwA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=BXM4HPcYP8wA:10 a=3e9C-FsJo0l-C2I13FEA:9 a=CjuIK1q_8ugA:10 a=2yTQJ0OkpDuyj7eE:21 a=vRPkp040kyyWXpux:21 a=1gajL0UBtqThFe74:21 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: "arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2012 13:23:07 -0000 On Thu, 27 Dec 2012, Konstantin Belousov wrote: > On Thu, Dec 27, 2012 at 11:39:44PM +1100, Bruce Evans wrote: >> After working around these bugs by putting the functions in separate files >> (and removing the now-unneeded volatiles): >> >> main.c: >> % void foo(void); >> % >> % int >> % main(void) >> % { >> % int i; >> % >> % for (i = 0; i < 100000000; i++) >> % foo(); >> % } >> >> foo.c: >> % void bar(void); >> % >> % void >> % foo(void) >> % { >> % bar(); >> % } >> >> bar.c: >> % void >> % bar(void) >> % { >> % } >> >> we can seem how much the frame pointer optimization is saving: this >> now takes 0.43 seconds with clang and 0.87 seconds with gcc. It >> is weird that the gcc time increased from 0.65 seconds to 0.87 >> despite doing less. After adding back the volatiles, the times >> are 0.43 seconds with clang and 0.85 seconds with gcc -- doing >> more gave a small optimization, but didn't recover 0.65 seconds. >> There is apparently some magic alignment or misalignment which >> costs or saves about the same as omitting the frame pointer. >> Finally, with gcc -O -fomit-frame-pointer, the program takes 0.60 >> seconds, and with gcc -O2 -fomit-frame-pointer, it takes 0.49 >> seconds, and with gcc -O2, it takes 0.49 seconds (this really doesn't >> omit frame pointers, so omitting the frame pointer saves nothing), >> With cc -O -fno-omit-frame-pointer, it takes 0.43 seconds, but this >> case is just broken -- the -fno-omit-frame-pointer is silently ignored :-(. > I do not believe this measurement is indicative. Yes, since this program is too simple to be representative. > i386 is > register-starved architecture. Using the frame pointer means that > you are left with only 6 registers instead of 7. For the PIC code, > there are 5 vs. 6. It is real code that does something more than > incrementing the same variable which could get the performance hit with > -fno-omit-frame-pointer for i386. But on i386 use of the frame pointer > is ABI mandated. Register starvation is another thing that makes very little difference. But here is another non-representative program that goes to the oppositie extreme to get register starvation. The result is that omitting the frame pointer is a small pessimization on i386 and makes no difference on amd64: main.c: % int foo(int, int, int, int, int, int, int, int, int, int); % % volatile int mf; % % int % main(void) % { % int i; % % for (i = 0; i < 100; i++) % mf += foo(1, 2, 3, 4, 5, 6, 7, 8, 9, 10); % } bar.c: % int % bar(int a, int b, int c, int d, int e, int f, int g, int h, int i, int j) % { % int n, r; % % r = 0; % for (n = 0; n < 1000; n++) { % r += a + b + c + d + e + f + g + h + i + j; % a += b; % b = -b; % c += d; % d = -d; % e += f; % f = -f; % g += h; % h = -h; % i += j; % j = -j; % } % return (r); % } foo.c: % int bar(int, int, int, int, int, int, int, int, int, int); % % int % foo(int a, int b, int c, int d, int e, int f, int g, int h, int i, int j) % { % int n, r; % % r = 0; % for (n = 0; n < 1000; n++) % r += bar(a, b, c, d, e, f, g, h, i, j); % return (r); % } i386 times on ref10-i386: gcc -O2 -o f main.c bar.c foo.c: 0.81 seconds gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.81 seconds gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 0.83 seconds cc -O2 -o f main.c bar.c foo.c: 1.11 seconds cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 1.11 seconds cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 1.11 seconds 0.81 seconds is 15.08 cycles/iteration in the inner loop. 0.83 seconds is 15.45 cycles/iteration. 1.11 seconds is 20.67 cycles/iteration. The inner loop has 12 variables. Since i386 has only 6 or 7 integer registers, these can't be kept in registers. Checking the generated code in the inner loop in bar() shows that the source code is complicated enough to prevent significant optimizations. It has the following number of memory references: gcc -O2 -o f main.c bar.c foo.c: 7(r) 1(w) 4(r+w) = 16 gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: not counted gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 7(r) 1(w) 6(r+w) = 20 cc -O2 -o f main.c bar.c foo.c: 13(r) 13(w) 2(r+w) = 30 cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: not counted cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer not counted Although gcc -fomit-frame-pointer gives 4 fewer memory references, it runs 0.37 cycles/iteration slower. Apparently the extra memory references are always executed in parallel, so they are free. References relative to %ebp are 1 byte smaller than ones relative to %esp. Apparently this is enough to avoid 0.37 cycles of penalties for the larger instructions. cc (clang) generates remarkably bad code for this. This is shown by all metrics: - more instructions - less use of read-modify-write instructions. Where gcc generates lots of addl's to memory variables, clang likes to load the memory variables, add to them, and write them back. In theory this is no slower unless the larger instruction space is too large, since it takes the same number of memory references and should reduct to almost the same micro-ops . But the way clang does it, somehow also gives almost twice as many memory references (30 instead of 16). - many more many references - lots of spills which are carefully annotated by clang. It documents 15 spills and 13 reloads. amd64 times on ref10-amd64: gcc -O2 -o f main.c bar.c foo.c: 0.37 seconds gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.37 seconds gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 0.37 seconds cc -O2 -o f main.c bar.c foo.c: 0.41 seconds cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.41 seconds cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 0.41 seconds Now both compilers generate very nice code for bar(), with the variables in the inner loop all in registers, and not so nice code for foo() (it takes just 1 more instruction for all the arithmetic in the inner loop in bar() than for calling bar()). clang somehow finds a way to be slower here too. -fomit-frame-pointer now makes no difference at all, since it only takes about 3 extra instructions for every call to bar(). I made the inner loop in bar() too heavyweight to test it (oops). Another oops is that I forgot to modify the modification of the variables in bar(). It gets optimized away, and the code in bar() is only so very nice because bar() reduces to adding up the 10 variables. Now with the inner loop in bar() removed and the number of iterations in foo() multiplied by 1000 to compensate: amd64 times on ref10-amd64: gcc -O2 -o f main.c bar.c foo.c: 0.64 seconds gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.76 seconds gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 0.64 seconds cc -O2 -o f main.c bar.c foo.c: 0.67 seconds cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 0.67 seconds cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 0.61 seconds -fomit-frame-pointer is finally giving an optimization. It does so even for clang, and this is weird because for clang it only affects the non-leaf functions main() and foo() for which there are only 1+100 frame pointer initializations and finalizations. Having a frame pointer is apparently pessimizing foo() by changing the instructions that it uses to call bar(). i386 times on ref10-i386: gcc -O2 -o f main.c bar.c foo.c: 1.24 seconds gcc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 1.24 seconds gcc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 1.16 seconds cc -O2 -o f main.c bar.c foo.c: 1.11 seconds cc -O2 -o f main.c bar.c foo.c -fno-omit-frame-pointer: 1.11 seconds cc -O2 -o f main.c bar.c foo.c -fomit-frame-pointer 1.23 seconds Now the frame pointer in bar() has obvious costs for gcc. But the change in the time for clang is even weirder than the changes above: for clang, the frame pointer is always omitted in the leaf function bar(); there are only 1+100 frame pointer initializations and finalizations, as for clang on amd64. On amd64, the frame pointer in foo() gave an optimization from 0.67 to 0.61 seconds, but on i386 it gives a pessimization from 1.11 seconds to 1.23 seconds. > For amd64, there is no so high pressure on the register file, but I do > not know that much debugging tools which expect the frame pointer on > amd64 or could detect and use it if present. It is only ddb for our > kernel and dtrace for solaris and freebsd, gdb definitely does not. I was originally going to say that the number of registers is almost as irrelevant for performance as -fomit-frame-pointer :-). amd64 is typically slightly slower than i386 despite its extra registers and "optimized" ABI with parameters passed in registers, since the optimizations aren't quite enough to recover from the bloat of 64-bit pointers and function parameters. But my non-representative examples made amd64 about twice as fast. And to demonstrate loss due to the frame pointer increasing register pressure significantly, I should have tried an example where everything fits in registers with -fomit-frame-pointer but not without. Such an example would be even more non-representative. Bruce