Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 5 Mar 2017 15:40:45 +0100
From:      Jilles Tjoelker <jilles@stack.nl>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        Konstantin Belousov <kostikbel@gmail.com>, John Baldwin <jhb@freebsd.org>, Pedro Giffuni <pfg@freebsd.org>, Slawa Olhovchenkov <slw@zxy.spb.ru>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r314669 - head/sys/i386/conf
Message-ID:  <20170305144045.GA15347@stack.nl>
In-Reply-To: <20170305111842.K1472@besplex.bde.org>
References:  <201703041504.v24F4HMh023937@repo.freebsd.org> <D81029FA-61CF-4648-A2A8-8570DEF28B14@FreeBSD.org> <20170304211611.GW2092@kib.kiev.ua> <1951800.W2d2k3eamI@ralph.baldwin.cx> <20170304231822.GA30979@kib.kiev.ua> <20170305111842.K1472@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Mar 05, 2017 at 11:35:26AM +1100, Bruce Evans wrote:
> On Sun, 5 Mar 2017, Konstantin Belousov wrote:

> > On Sat, Mar 04, 2017 at 02:54:56PM -0800, John Baldwin wrote:
> >> On Saturday, March 04, 2017 11:16:11 PM Konstantin Belousov wrote:
> >>> On Sat, Mar 04, 2017 at 03:49:52PM -0500, Pedro Giffuni wrote:
> >>>> The number came out from an old posting involving buildworld times, which I can???t find now :(.
> >>>> Things seem to have changed a lot: it was surely using GCC back then, I don???t believe clang does much distinction about 486 at all.
> >>>>
> >>>> BTW, does it make sense to keep i586 in the configuration still? Both i486 and i586 were once removed but later re-instated in r205336.
> >>>>
> >>> What did make significant impact on 32bit shared libraries some time ago
> >>> was to compile them with -mtune=i686. Default PIC prologue effectively
> >>> neutered return stack predictor, adding uneccessary overhead to already
> >>> expensive PIC code. I think that this is even measureable, i.e. it might
> >>> give >= 5% of difference.

> I now use -mtune=athlon-xp (and no other -m CFLAGS) to get the same effect
> with minor additional optimizations/pessimizations for athlon-xp.  I forget
> if I switched to this before or after getting jhb to not use
> -mtune=pentiumpro.  Maybe we didn't know at the time that the pentiumpro
> optimizations affected little more than this PIC problem.

> >>> I did not rechecked modern compilers WRT the generated PIC code,
> >>> but I doubt that the thing changed recently.

> >>> Several notes: -mtune is not -march, i.e. the code would be still targeted
> >>> for 486 instruction set, but scheduling is optimized for more modern CPUs.
> >>> Also, recent gcc puts specific meaning into -mtune=i686, interpreting it
> >>> as request for scheduling for generic modern CPUs.  We already compile
> >>> 32bit compat libs on amd64 with -march=i686.

> >>> Working on this stuff would be much more useful than tweaking
> >>> kernel config for CPU detection.

> >> Hmm, I originally wanted to use -mtune=i686 (spelled as
> >> -mcpu=pentiumpro) on i386 builds for this reason, but I removed it
> >> at bde@'s request in r125252. I would be happy to go back to adding
> >> -mtune for i386 when CPUTYPE isn't specified.

> > I just rechecked.
> > gcc, at least 4.9 and 6.3, generate 'right' prologue, i.e.
> > 	call	__x86.get_pc_thunk.cx (ecx or whatever register
> > 					which is used to address GOT)
> > __x86.get_pc_thunk.cx:
> > 	movl	(%esp), %ecx
> > 	ret
> > even when compiling for -march=i486.

> > OTOH, clang 3.9.1 uses
> > 	calll	.L0
> > .L0:	popl	%eax
> > to get the base even for native nehalem and newer CPUs.

> > So indeed there is no reason to bother. gcc become too good to require any
> > tuning, and clang generates unoptimal code even when hinted.  I did not
> > checked 4.0.

> The old method might actually be best for the original i386.  It is
> 1 byte larger per call, but 1 instruction shorter by dynamic count.
> Original i386 has poor instruction fetch bandwidth and no caches and
> to help or harm.  Even the (%esp) address mode can cost a cycle on 
> original i386, and setting up a frame pointer to access the stack would
> be much worse, while on modern x86 the frame pointer might cost nothing
> since it can be done in parallel.

> Maybe some newer CPUs have better return address predictors so the old
> method is best for them too.

> It is only clear that generic optimizations should use the new method,
> since CPU manufacturers won't pessimize it since it looks like a normal
> function call.

I tried a naive benchmark of a million iterations on a

CPU: Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (2993.26-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x306a9  Family=0x6  Model=0x3a  Stepping=9

and a subroutine containing the old method is almost as fast as an empty
subroutine (3.8ns including loop overhead), while a subroutine
containing the new method is slower (6.3 ns including loop overhead).
Apparently this CPU knows that a call to the next instruction will not
be returned to.

Likewise, the old method is faster on a

hw.model: Intel(R) Xeon(R) CPU           X5675  @ 3.07GHz

On the other hand, the old method is slower on an old

hw.model: Intel(R) Core(TM)2 CPU          6400  @ 2.13GHz

Apparently, LLVM has decided to trade considerably worse performance on
CPUs such as Core2 for better performance on newer and older CPUs. What
is somewhat surprising is that there is no way to use the GCC method.
Then again, LLVM does not support bypassing the PLT either (loading the
address from the GOT directly).

Given that the old method is quite commonly used, it does not seem very
likely that CPU manufacturers will pessimize it in future.

-- 
Jilles Tjoelker



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170305144045.GA15347>