Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 26 Mar 2015 12:50:48 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Mateusz Guzik <mjguzik@gmail.com>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Bruce Evans <brde@optusnet.com.au>
Subject:   Re: svn commit: r280407 - head/sys/kern
Message-ID:  <20150326103908.G993@besplex.bde.org>
In-Reply-To: <20150325201524.GB14280@dft-labs.eu>
References:  <201503240010.t2O0ACZb089906@svn.freebsd.org> <20150324135350.N1665@besplex.bde.org> <20150325201524.GB14280@dft-labs.eu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 25 Mar 2015, Mateusz Guzik wrote:

> On Tue, Mar 24, 2015 at 03:58:14PM +1100, Bruce Evans wrote:
>> On Tue, 24 Mar 2015, Mateusz Guzik wrote:
>>
>>> Log:
>>> filedesc: microoptimize fget_unlocked by getting rid of fd < 0 branch
>>
>> This has no effect.  Compilers optimize to the equivalent of the the
>> unsigned cast hack if this is good.  On x86, it is good since no
>> instructions are needed for the conversion, and the only difference
>> between the code generated by
>>
>> 	if (fd >= fdt->fdt_nfiles)
>>
>> and
>>
>> 	if (fd < 0 || fd >= fdt->fdt_nfiles)
>>
>> is to change from jl (jump if less than) to jb (jump if below) (both
>> jumps over the if clause).  Negative values satisfy jl but not jb.
>
> I would not commit the change if it did not affect generated assembly at
> least on amd64.

Interesting.  All compilers that I tested (gcc-4.2, gcc-4.8 and clang
pessimize this).  clang generates especially fancy "optimizations" which
actually give pessimizations of 1-4 cycles for time and 15-25 bytes for
space (larger time penalties from the space pessimization in real use
where everything doesnt fit in the I-cache.  gcc-4.8 only gives a 1
cycle time pessimization plus the 5-10 bytes of space pessimizations
for the extra instructions for this, with less fancy scheduling that
is easier to untangle to remove the 1-cycle pessimization.  Micro-
Benchmark program below.

> if (fd < 0 || fd >= fdt->fdt_nfiles):
> 0xffffffff807d147d <fget_unlocked+13>:	sub    $0x38,%rsp
> 0xffffffff807d1481 <fget_unlocked+17>:	test   %esi,%esi
> 0xffffffff807d1483 <fget_unlocked+19>:	js     0xffffffff807d15b8 <fget_unlocked+328>

objdump output is of low quality, especially on 64-bit arches where the
0xff... numbers are twice as bloated.

This is the extra branch.  In the micro-benchmark, clang pre-loads a
pointer to the volatile return values and jumps to common code to load
the value.  This is its main pessimization relative to gcc.  clang also
generates code that runs into itself.  It generates 2 loads of one of
the pointers.  This is mostly a space pessimization, except it may be
necessary for the rest of the fancy scheduling to cost only 1 cycle.
Here the return value for the error case is constant and the return
value for the other case is variable, so this pessimization is
inhibited.  In the microbenchmark with the return values changed to
non-volatile, clang does a fancier "optimization" involving pre-loading
the values, then conditional moves.  This gives an additional
pessimization of 1 cycle.

> 0xffffffff807d1489 <fget_unlocked+25>:	mov    (%rdi),%rbx
> 0xffffffff807d148c <fget_unlocked+28>:	cmp    %esi,(%rbx)

Here rbx is naturally a pointer, so this is the best comparison instruction.
In the micro-benchmark, at least with the volatile limit, both clang and
gcc-4.8 load a pointer to the limit instead of doing a cmp on the memory
variable.  Old versions of gcc know that this is usually bad, and have
options -fforce-addr and -fforce-mem to change the default of not doing
this.  In the micro-benchmark, this just wastes space.  In real use,
loading the pointer to volatile data or loading non-volatile data is
best if the variable is accessed more than once and the register pressure
is not large.

> 0xffffffff807d148e <fget_unlocked+30>:	jle    0xffffffff807d15bf <fget_unlocked+335>

>
> if ((u_int)fd >= fdt->fdt_nfiles):
> 0xffffffff807d147d <fget_unlocked+13>:	sub    $0x38,%rsp
> 0xffffffff807d1481 <fget_unlocked+17>:	mov    (%rdi),%rbx
> 0xffffffff807d1484 <fget_unlocked+20>:	cmp    %esi,(%rbx)
> 0xffffffff807d1486 <fget_unlocked+22>:	jbe    0xffffffff807d15a8 <fget_unlocked+312>
>
> I did not check other archs prior to the commit.
>
> This is clang 3.6 as present in head. Sources compiled with -O2. Also
> see below for other compiler test.
>
>>> Casting fd to an unsigned type simplifies fd range coparison to mere checking
>>> if the result is bigger than the table.
>>
>> No, it obfuscates the range comparison.
>
> It is a standard hack which is hard to misread and which seems to add a
> slight benefit (see below).

Because the compiler doesn't actually do it.  I'm sure compilers sometimes
do it.  Apparently just when the top limit is constant.

> ...
> It affects assembly on all arm, powerpc64 and mips64 as well. Both arm
> and powerpc just get rid of zero-test and use the same instructions to
> perform the other comparison. I only found a difference on mips64 which
> used sltu instead of slt (but still got rid of the zero-check).

Getting rid of the zero test should give what I want.  The code might
differ due to register allocation.  Are arm and powerpc still using
gcc?

> [..]
>> - fget_locked() seems to be unchanged recently, so it doesn't have the
>>   obfuscation.  fget_unlocked() is newer than the unobfuscated version
>>   of fget_locked().  It is in a different file, but may have copied the
>>   unobfuscated range check from fget_locked().  This commit restores the
>>   obfuscation to it alone.
>
> This definitely should be synced one way or the other, thanks for
> pointing this out.

To the unobfuscated version :-).

> I wrote a toy program and checked e.g. gcc5 and it still did not
> optimise zero-test away.
>
> In fact I would argue the optimisation in question is impossible unless
> upper limit check is against a constant in (0, INT_MAX) range or against
> a var whose range is known at compile time.

Argh.  That must be it, or possibly -fno-wrapv.  I get the same results
with and without wrapv.

> In particular this is problematic for negative values.
>
> Consider:
> int fd, limit;
> ............
> if (fd < 0 || fd >= limit)
>
> Let's have fd = -5 and limit = -4.
>
> Should the fd < 0 check be dropped by the compiler and the expression
> turned into (u_int)fd >= limit, the coparison would be false thus
> changing the outcome.

Wrapv doesn't affect this.  Compilers can use wrapping behaviour internally
for some things, but not for signed comparisons.  E.g., on x86 for
(fd >= limit) they can check the flags after (fd - limit).  The flags are
the same as after cmp(fd, limit).  But the non-wrapping flag (ge) must
be used, not the wrapping flag (ae).

> As such, I prefer to keep the cast.

Now I dislike the cast even more.  The sign conversions are especially
delicate.

If the limit were u_int, then the sign conversions would happen without
the cast, and are even more confusing since they are implicit.  The limits
for fd's are intentionally signed, the same as for fd's, since otherwise
there would be lots of confusing automatic conversions and compiler warnings
for some of them.

BTW, sysctl limits like maxfiles and maxfilesperproc are signed.  Foot-
shooting is correctly allowed for these like for most sysctls, so they
can be set to preposterous negative limits as well as to preposterously
large positive ones, unlike for the corresponding rlimit.  So the
comparison (fd < (limit = -4)) may happen in practice.  In practice,
other sign extension bugs usually occur first.  The first reference to
maxfiles* in kern_descrip.c has 3 and a half sign/extension overflow
bugs including one for maxfilesperproc.  It uses min() to convert
maxfilesperproc to u_int.  It does the same for the other arg after
first truncating the arg's value from 64 bits signed to 32 bits signed
(not even to 32 bits unsigned to give defined behaviour and to match
min()'s type).  min() gives a 32-bit unsigned value.  This is assigned
to a td_retval[0], which is 32 bits signed on 32-bit arches and 64-bit
signed on 64-bit arches, so there is only a bug on half of the arches
for the final step.  Problems don't accur in practice because the
rlimit is clamped to well below 0x7fffffff.  Values near RLIM_INFINITY
are allowed for mose limits but not the files limit.  Pseudo-infinite
values like 0x7fffffff00000001 would become 1 after truncation.

Micro-benchmark program:

X #include <stdio.h>
X 
X volatile int err = 1;
X volatile int ok = 0;
X volatile int toplimit = 20;
X 
X int __noinline
X foo(int x)
X {
X 	if (x < 0 || x >= toplimit)
X 		return (err);
X 	return (ok);
X }
X 
X int
X main(void)
X {
X 	int errcount, i;
X 
X 	errcount = 0;
X 	for (i = 0; i < 266681953; i++)	/* the magic number is tsc_freq/10 */
X 		errcount += foo(i);
X 	printf("errcount = %d\n", errcount);
X }

Since the optimization that I wanted is invalid, this is just a benchmark
of how far away the compiler is from generating optimal code.

Timing on freefall:
clang:   7 cycles
clang:   7 cycles   (with manual editing... didn't help, tended to harm)
clang:   6 cycles   (with above changed to use unsigned hack)
gcc-4.8: 6 cycles
gcc-4.8: 5 cycles   (with manual editing of asm output to get the invalid opt)
gcc-4.8: 5 cycles   (with above changed to use unsigned hack)

This is with -O.  Other "optimizations are mostly pessimizations:

clang -O2:   no change
gcc-4.8 -O2: 1 cycle slower

clang  with __predict_true():  1 cycles slower
clang  with __predict_false(): 2 cycles slower (slower with the correct pred!)
gcc-48 with __predict_false() or __predict_false() in above: no change

Adding __predict_*() to the above with the unsigned hacks makes no further
change.

The bad code generated by clang for the above is:

X 	movl	$err, %eax
X 	testl	%edi, %edi
X 	js	.LBB0_3
X # BB#1:                                 # %lor.lhs.false
X 	movl	toplimit(%rip), %ecx
X 	movl	$ok, %eax
X 	cmpl	%edi, %ecx
X 	jg	.LBB0_3
X # BB#2:                                 # %select.mid
X 	movl	$err, %eax
X .LBB0_3:                                # %return
X 	movl	(%rax), %eax
X 	popq	%rbp
X 	retq

This uses fancy scheduling of a bad method to lose by only 1 cycle
relative to gcc.  The loss is from pre-loading pointers so as to
jump to a common load at the end.  It seems stupid to load $err
twice, but this saves 1 cycle.

__predict_false() gives the following changes:

X --- z.s~	2015-03-26 01:11:17.825573000 +0000
X +++ z.s	2015-03-26 01:10:47.728912000 +0000
X @@ -19,10 +19,10 @@
X  	js	.LBB0_3
X -# BB#1:                                 # %lor.lhs.false
X +# BB#1:                                 # %lor.rhs
X  	movl	toplimit(%rip), %ecx
X -	movl	$ok, %eax
X +	movl	$err, %eax
X  	cmpl	%edi, %ecx
X -	jg	.LBB0_3
X +	jle	.LBB0_3
X  # BB#2:                                 # %select.mid
X -	movl	$err, %eax
X -.LBB0_3:                                # %return
X +	movl	$ok, %eax
X +.LBB0_3:                                # %lor.end
X  	movl	(%rax), %eax

This costs 2 cycles.  It leaves a doubled load of $err which makes no
sense now since the code no longer runs into itself and the second
load is redundant.  The extra instruction might be useful as padding,
but is actually negatively useful.  Removing it reduces the loss to 1
cycle relative to the unpredicted version, by giving essentially the
same code as the unpredicted version with my editing to avoid the
apparently-unnecessary doubled load.

__predict_true() loses only 1 cycle, essentially by producing the above
without the extra instruction.  That it, it takes 8 cycles.  2 more than
gcc and 3 more than the best seen with the unsigned hack.  These cycle
counts intentionally include many for loop and function call overhead.
I used a function and many volatile variables to prevent most things
being calculated a compile time.  With __inline instead of __noinline,
most or all variants take about 2.7 cycles with clang (I think 2 cycles
is the minimum loop overhead and other operations run in parallel with
the loop and themselves).  gcc-4.8 then takes 2.7 cycles with the
unsigned hack but 3.5 cycles without it.  gcc's not so fancy scheduling
loses instead of wins when it is inlined.  In real code, the optimizer
has even more chances to do fancy scheduling to hide the latency of
extra branches, but 1 more branch may thrash branch prediction caches.
Anyway, you won't be able to measure 1-cycle optimizations in real code.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150326103908.G993>