Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 17 May 2017 12:49:45 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Steve Kargl <sgk@troutmask.apl.washington.edu>
Cc:        freebsd-numerics@freebsd.org
Subject:   Re: Implementation of half-cycle trignometric functions
Message-ID:  <20170517094848.A52247@besplex.bde.org>
In-Reply-To: <20170516224618.GA40855@troutmask.apl.washington.edu>
References:  <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428175851.A1386@besplex.bde.org> <20170516224618.GA40855@troutmask.apl.washington.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 16 May 2017, Steve Kargl wrote:

> ...
> I have removed my new kernels and written shims to
> use the fdlibm trig kernels.  I know that in some
> cases this leads to slower code, but a small (if not
> tiny) improvement in accuracy.  This also leads to
> a significant reduction in code size.  A couple of
> notes:

Still quite large.
On Tue, 16 May 2017, Steve Kargl wrote:

> ...
> I have removed my new kernels and written shims to
> use the fdlibm trig kernels.  I know that in some
> cases this leads to slower code, but a small (if not
> tiny) improvement in accuracy.  This also leads to
> a significant reduction in code size.  A couple of
> notes:

Still quite large.

> Index: lib/msun/ld128/k_cospil.c

This is a header file, not a .c file or even a self-sufficient kernel,
since it only compiles when included by C files which declare the things
that it uses.

My kernels for exp are header files and are also self-sufficient (except
they only define static functions and objects, so are useless unless they
are included).

> +static inline long double
> +__kernel_cospil(long double x)
> +{
> +	long double hi, lo;
> +	hi = (double)x;

Style (missing blank line after declararions).

> Index: lib/msun/ld128/k_sinpil.c
> ...
> +	long double hi, lo;
> +	hi = (double)x;

Style.

> Index: lib/msun/ld128/s_cospil.c
> ...
> +static const long double
> +pihi = 3.14159265358979322702026593105983920e+00L,
> +pilo = 1.14423774522196636802434264184180742e-17L;

These are not in normal format, and are hard to read.  I can't see if
pihi has the correct number of zero bits for exact multiplication.

These don't have the normal spelling.  fdlibm never uses pihi or pio2hi,
or pi_hi.  It often uses pio2_hi and other pio2_*.  My s_sinpi.c uses
pi_hi.


> +long double
> +cospil(long double x)
> +{
> +	volatile static const double vzero = 0;
> +	long double ax, c, xf;
> +	uint32_t ix;
> +
> +	ax = fabsl(x);
> +

Style (lots of extra blank lines).

> +	if (ax < 0x1p112) {
> +
> +		xf = floorl(ax);
> +
> +		ax -= xf;
> +

These extra blank lines are especially bad, since they separate related
code.

> +		if (xf > 0x1p50) xf -= 0x1p50;
> +		if (xf > 0x1p30) xf -= 0x1p30;

Style.

> Index: lib/msun/ld128/s_sinpil.c

Similarly.

> Index: lib/msun/ld128/s_tanpil.c

Similarly.

> + * FIXME: This should use a polynomial approximation in [0,0.25], but the
> + * FIXME: increase in max ULP is probably small, so who cares.

Doesn't it already use the standard kernel, which uses a poly approx?

Splitting up [0, Pi/4] doesn't work very well for reducing the degree of
the poly.  The current poly has degree about 56 (maybe 52 or 54), and I
couldn't reduce it to below 40 using reasonably large interval.  I think
it can be reduced to degree about 10 using intervals of length 1/128 as
for logl(), but unlike for logl() this obvious way of doing this needs
about 128 diferent polys.

> +static inline long double
> +__kernel_tanpi(long double x)
> +{
> +	long double hi, lo, t;
> +
> +	if (x < 0.25) {
> +		hi = (double)x;
> +		lo = x - hi;
> +		lo = lo * (pilo + pihi) + hi * pilo;
> +		hi *= pihi;
> +		_2sumF(hi, lo);
> +		t = __kernel_tanl(hi, lo, -1);

Oops, I forgot that this 0.25 is really Pi/4 when scaled.  It is impossible
to use 1 poly over the whole range, since the poly would need to have even
higher degree than 50+, and would still have bad numeric properties.  so
__kernel_tan*() uses special methods.

> +long double
> +tanpil(long double x)
> +{
> +	volatile static const double vzero = 0;
> +	long double ax, xf;
> +	uint32_t ix;
> +
> +	ax = fabsl(ax);
> +
> +	if (ax < 1) {
> +		if (ax < 0.5) {
> +			if (ax < 0x1p-60) {
> +				if (x == 0)
> +					return (x);
> +				ax = (double)x;

Style (unnecessary cast).

> +				x -= ax;
> +				t = pilo * x + pihi * x + pilo * ax
> +				    + pihi * ax;

Style:
- unnecessary line splitting
- gnu style for placement of the operator on a new line.  If not following
   fdlibm style, use strict KNF.

> +				return (t);
> +			}
> +			t = __kernel_tanpil(ax);

For most long double functions, we put everything except the kernels in
the same files, with ifdefs for LDBL_MANT_DIG.  Here even the hi+lo
decomposition and multiplication by Pi can be in a kernel.  The style
bugs are especially painful to read when sextuplicated.  (The ld80
variants are quite different due to being better optimized, but this
is a different style bug.)

> Index: lib/msun/ld80/k_cospil.c

No more comments on sextuplicated style bugs.

> Index: lib/msun/ld80/s_cospil.c
> ...
> +		if (ix < 0x3ffe)			/* |x| < 0.5 */
> +			c = __kernel_sinpil(0.5 - ax);
> +		else if (lx < 0xc000000000000000ull) {	/* |x| < 0.75 */

Style bug (spelling of the long long abomination using "ull").  The
explicit abomination is unfortunately needed by old versions of gcc
with CFLAGS asking for nearly C90.

> +			if (ax == 0.5)
> +				RETURNI(0);

Perhaps pessimal to not test for 0.5 in bits.

It is pessimal to avoid generating inexact for 0.5 at all.  I don't see
a much better way.

> +	if (ix < 0x403e) {		/* 1 <= |x| < 0x1p63 */
> +		/* Determine integer part of ax. */
> +		j0 = ix - 0x3fff + 1;
> +		if (j0 < 32) {
> +			lx = (lx >> 32) << 32;
> +			lx &= ~(((lx << 32)-1) >> j0);
> +		} else {
> +			uint64_t m = (uint64_t)-1 >> (j0 + 1);
> +			if (lx & m) lx &= ~m;

Style:
- nested declaration
- initialization in declaration
- half the code in the initialization
- no newline for statement after 'if'

> +		}
> +		INSERT_LDBL80_WORDS(x, ix, lx);
> +
> +		ax -= x;
> +		EXTRACT_LDBL80_WORDS(ix, lx, ax);

I don't know if this is a good method.  INSERT/EXTRACT are slow, so I
doubt that it is.  In e_lgamma*.c:sin_pi*(), we use only a single GET
of the high word.  This function already used EXTRACT/INSERT to classify,
and 2 more here.  (_e_lgamma*.c used GET to classify before calling
sin_pi*(), then 1 more GET.)  INSERT is especially slow.

Care is needed near the 0x1p63 boundary, since the fast methods require
adding approx 0x1p63.  At or above the boundary, these methods don't
obviously work.  Perhaps they give the correct result starting at 2 times
the boundary, except for setting inexact.  Ugh.  Do you avoid the fast
methods since they set inexact for most half-integers?  That costs more
than avoiding inexact for 0.5.

tanpi(0.25 is exact too, so quarter-integers need special treatment too
if you really want to avoid all spurious inexacts.

I now see that some of the above complications are from inlining floor().
floor() can't use the fast method since it is careful about inexact, and
the spec may require this.  It actually sets inexact whenever the result
is not an integer.  So it use it for sinpi() without getting spurious
inexact, the arg must be multiplied by 2 first, and for tanpi(), multiply
by 4 first.  This is excessive.

> Index: lib/msun/ld80/s_sinpil.c

Similarly.

> Index: lib/msun/ld80/s_tanpil.c

Similarly.

> Index: lib/msun/src/k_cospi.c
> ...
> +/*
> + * The basic kernel for x in [0,0.25].  To exploit the kernel for cos(x),
> + * the argument to __kernel_cospi() must be multiply by pi.  As pi is an
> + * irrational number, this allows the splitting of pi*x into high and low
> + * parts.
> + */

Allows?

> +
> +static inline double
> +__kernel_cospi(double x)
> +{
> +	double hi, lo;
> +	hi = (float)x;

_2sumF is extensively documented to not work with double.  It requires
double_t here.

Many style bugs are actually duodecimally (?) tuplicated (3 copies for ld128,
3 for ld80, 3 for double, 3 for float).

> Index: lib/msun/src/k_sinpi.c

Similarly.

> Index: lib/msun/src/math.h
> ===================================================================
> --- lib/msun/src/math.h	(revision 318383)
> +++ lib/msun/src/math.h	(working copy)
> @@ -500,6 +500,15 @@
>
> #if __BSD_VISIBLE
> long double	lgammal_r(long double, int *);
> +double		cospi(double);
> +float		cospif(float);
> +long double	cospil(long double);
> +double		sinpi(double);
> +float		sinpif(float);
> +long double	sinpil(long double);
> +double		tanpi(double);
> +float		tanpif(float);
> +long double	tanpil(long double);
> #endif

Should be sorted into the double, float and long double sections.

> Index: lib/msun/src/s_cospi.c

> +/**
> + * cospi(x) computes cos(pi*x) without multiplication by pi (almost).  First,

You mean with multiplication by pi in extra precision.

See s_sinpi.c for more.

> Index: lib/msun/src/s_cospif.c

Similarly.

> Index: lib/msun/src/s_cospif.c

> +#define	INLINE_KERNEL_SINDF
> +#define	INLINE_KERNEL_COSDF

Prossibly not worth inlining.

> +#define  __kernel_cospif(x)   ((float)__kernel_cosdf(M_PI * (x)))
> +#define  __kernel_sinpif(x)   ((float)__kernel_sindf(M_PI * (x)))

Style bugs (not even corrupt tabs, but 2 spaces after #define and 3 spaces
instead of a second tab).

Bogus casts.  The kernels already return float, although that is pessimal
and destroys their the xtra precision that they naturally have.

The cost of avoiding inexact is clearer here.  Returning the non-kernel
sin(M_PI * (x)), etc. with no classifation would probably be faster than
the slow reduction done here (except for large x).

See s_sinpi.c for more.

> Index: lib/msun/src/s_sinpi.c
> ...
> +double
> +sinpi(double x)
> +{
> +	volatile static const double vzero = 0;
> +	double ax, s;
> +	uint32_t hx, ix, j0, lx;
> +
> +	EXTRACT_WORDS(hx, lx, x);
> +	ix = hx & 0x7fffffff;
> +	INSERT_WORDS(ax, ix, lx);

A slow way to set ax to fabs(x).

This is otherwise much slower than sin_pi() for lgamma().  That started
with just a GET of hx, while this EXTRACTs 2 words, like floor(), apparently
to avoid spurious inexact.

Note that lgamma(0.5) is naturally inexact.  lgamma(integer) should be exact,
but I doubt that it is.  Maybe we broke it.

> +
> +	if (ix < 0x3ff00000) {			/* |x| < 1 */
> +		if (ix < 0x3fd00000) {		/* |x| < 0.25 */
> +			if (ix < 0x3e200000) {	/* |x| < 0x1p-29 */
> +				if (x == 0)
> +					return (x);
> +				INSERT_WORDS(ax, hx, 0);
> +				x -= ax;

Flipping minus signs to reduce the number of cases for odd functions
is a bad way to do things.  I speeded up several functions by 4-10
cycles (5-20%) by avoiding such flipping.  Here there is a dependency
on ax and a slower than usual calculation of it with.  (On modern
x86, fabs would take ~2 cycles latency and ~1/2 cycles throughput),
while going throigh these extractions and insertions might take 10
cycles latency at best and 50 or more with store to load mismatch
penalties on not so modern x86).

Apart from being slower, sign flipping it prevents gives different
rounding in non-default modes.  C99 says that sin() is odd, so may
require the flipping to break the rounding modes intentionally.  If
so, then the bug is in C99.  I believe that it intentionally doesn't
require avoiding spurious inexact because this would be onerous.  It
also doesn't require sin() to be rounded in the current rounding mode.
Requiring sin() to be perfectly old is similarly onerous.

On Haswell, fdlibm floor() takes about 33 cycles and fdlibm sin() about
44 cycles below 2*Pi, so any sinpi() that uses floor() or its internals
is going to be about twice as slow as sin() because its classification
is too slow.

> +	if (ix < 0x43300000) {			/* 1 <= |x| < 0x1p52 */
> +		/* Determine integer part of ax. */
> +		j0 = ((ix >> 20) & 0x7ff) - 0x3ff;
> +		if (j0 < 20) {
> +			ix &= ~(0x000fffff >> j0);
> +			lx = 0;
> +		} else {
> +			lx &= ~((uint32_t)0xffffffff >> (j0 - 20));
> +		}
> +		INSERT_WORDS(x, ix, lx);
> +
> +		ax -= x;
> +		EXTRACT_WORDS(ix, lx, ax);

Much slower than what lgamma()s sin_pi() does.  After our optimizations,
that does basically rint() inline using ax + 0x1p52 - 0x1p52, then a
further += 0x1p50 to get octants instead of half-cycles.  This takes
approx 16 cycles latency on modern x86, which is lot.  I think and64 can
do the above in 10-20 cycles, while i386 might take 20-40 or much
longer with store-to-load mismatch penalties.

I think we can avoid the spurious inexacts with a little care.  When
ax is an integer (and < 0x1p52), we can add at least 0x1p51 to it exactly.
We also want to avoid inexact for half-integers for sinpi and cospi, and
quarter-integers for tanpi, and it is good to reduce to quadrants anyway.
So we might intitially multiply by 4 and do the += 0x1p52 trick only if
4*ax < 0x1p52 (or whatever the safe threshold is).  Efficiency doesn't
matter for larger ax, so slow methods using things like floor(16*ax)
are adequate, or we can reduce by a few powers of 2 using special cases:

 	v = 4 * ax;
 	if (v >= 0x1p54)
 		;		/* ax is an even integer (check factor) /
 	else {
 		if (v >= 0x1p53)
 			v -= 0x1p53;
 		if (v >= 0x1p52)
 			v -= 0x1p52;
 	}

I will finish this for lgamma*().  The fdlibm version does multiplications
by 2 and we seem to have removed a bit too much of that method.

> +	/*
> +	 * |x| >= 0x1p52 is always an integer, so return +-0.
> +	 * FIXME: should this raise FE_INEXACT or FE_INVALID?
> +	 */

The result is always +0, exact and valid (maybe -0 in unsupported
unusual rounding modes).  We just did a lot of work to avoid inexact.

> +	if (ax + 1 > 1)
> +		ax = copysign(0, x);

An even slower way of setting the sign than INSERT, especially if
the compiler doesn't inline copysign().  Using the correct +0 fixes
this.

We might have classified slightly smaller values as odd integers and
have to select the sign for them.  This is done differently now.

> Index: lib/msun/src/s_sinpif.c
> ...
> Index: lib/msun/src/s_tanpi.c
> ...
> Index: lib/msun/src/s_tanpif.c
> ...

Similarly.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170517094848.A52247>