Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 5 Sep 2016 01:56:48 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        src-committers@FreeBSD.org, svn-src-all@FreeBSD.org,  svn-src-head@FreeBSD.org
Subject:   Re: svn commit: r305382 - in head/lib/msun: amd64 i387
Message-ID:  <20160905012859.L6221@besplex.bde.org>
In-Reply-To: <20160904144859.GC83214@kib.kiev.ua>
References:  <201609041222.u84CMEdM033135@repo.freebsd.org> <20160904144859.GC83214@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 4 Sep 2016, Konstantin Belousov wrote:

> On Sun, Sep 04, 2016 at 12:22:14PM +0000, Bruce Evans wrote:
> ...
>> Log:
>>   Add asm versions of fmod(), fmodf() and fmodl() on amd64.  Add asm
>>   versions of fmodf() amd fmodl() on i387.
>> ...
> It seems that wrong version of i387/f_fmodf.S, it is identical to the
> amd64 version.

Indeed.  Fixed.

>> Added: head/lib/msun/amd64/e_fmod.S
>> ==============================================================================
>> --- /dev/null	00:00:00 1970	(empty, because file is newly added)
>> +++ head/lib/msun/amd64/e_fmod.S	Sun Sep  4 12:22:14 2016	(r305382)
>> +ENTRY(fmod)
>> +	movsd	%xmm0,-8(%rsp)
>> +	movsd	%xmm1,-16(%rsp)
>> +	fldl	-16(%rsp)
>> +	fldl	-8(%rsp)
>> +1:	fprem
>> +	fstsw	%ax
>> +	testw	$0x400,%ax
>> +	jne	1b
>> +	fstpl	-8(%rsp)
>> +	movsd	-8(%rsp),%xmm0
>> +	fstp	%st
>> +	ret
>> +END(fmod)
>
> I see that this is not a new approach in the amd64 subdirectory, to use
> x87 FPU on amd64.  Please note that it might have non-obvious effects on
> the performance, in particular, on the speed of the context switches and
> handling of #NM exception.

For long double functions, the i387 gets used anyway.

This function is very slow even with the i387.  It takes about 500
cycles per call on args uniformly distributed in double precision
space, but this distribution is very non-average since it gives many
huge args.   The loop iterates many times on huge args.

This is still better the the C code which takes 3 or more times longer
or > 1500 cycles.  It does a loop on the bits using integer code.  The
C code is relatively even slower when there are fewer bits (something
like 9 times slower for args uniformly distributed in float precision
space).

> Newer Intel and possibly AMD CPUs have an optimization which allows
> coprocessor code to save and restore state to not save and restore state
> which was not changed.  In other words, for typical amd64 binary which
> uses %xmm register file but did not touched %st nor %ymm, only %xmm
> bits are spilled and then loaded.  Touching %st defeats the optimization,
> possible for the whole lifetime of the thread.
>
> This feature (XSAVEOPT) is available at least starting from Haswell
> microarchitecture, not sure about IvyBridge.

Isn't the i386 space too small to matter much?  There should be the
same number of NM#'s and just 100 bytes extra to save.  Avoiding use
of larger register sets by using only the i387 might save more :-).

The other amd64 asm uses of the i387 for floats and doubles are:
- 3 files for remainder and 3 files for remquo.  Needed for the same
   reason as for fmod
- s_scalbn.S, s_scalbnf.S.  To use i387 fscale.  Probably a mistake.
   The functions themselves are too slow to be very useful too.  libm
   almost never uses them internally, and in optimized functions like
   exp* the exponent scaling is done inline using special integer code.
   I have spent many hours fighting the compiler to stop it pessimizing
   the memory accesses to give pipeline stalls for this integer code.
   Using fscale probably tends to give another type of pipeline stall.

I plan to remove many more i387 uses on i386, but there aren't many more
on amd64.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160905012859.L6221>