Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 21 May 2017 09:52:39 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Steve Kargl <sgk@troutmask.apl.washington.edu>
Cc:        freebsd-numerics@freebsd.org
Subject:   Re: Implementation of half-cycle trignometric functions
Message-ID:  <20170521082808.T1142@besplex.bde.org>
In-Reply-To: <20170520145813.GA31395@troutmask.apl.washington.edu>
References:  <20170517094848.A52247@besplex.bde.org> <20170517180927.GA54431@troutmask.apl.washington.edu> <20170518072636.E951@besplex.bde.org> <20170518014226.GA63819@troutmask.apl.washington.edu> <20170518154820.A8280@besplex.bde.org> <20170518175434.GA74453@troutmask.apl.washington.edu> <20170519074512.M884@besplex.bde.org> <20170518234722.GB77471@troutmask.apl.washington.edu> <20170519153326.A20099@besplex.bde.org> <20170520105748.X1017@besplex.bde.org> <20170520145813.GA31395@troutmask.apl.washington.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 20 May 2017, Steve Kargl wrote:

> On Sat, May 20, 2017 at 10:58:03AM +1000, Bruce Evans wrote:
>> On Fri, 19 May 2017, Bruce Evans wrote:
>
>> XX  	uint32_t hx, ix, j0, lx;
>> XX
>> XX @@ -87,12 +86,15 @@
>> XX  	if (ix < 0x3ff00000) {			/* |x| < 1 */
>> XX  		if (ix < 0x3fd00000) {		/* |x| < 0.25 */
>> XX -			if (ix < 0x3e200000) {	/* |x| < 0x1p-29 */
>> XX +			if (ix < 0x3de00000) {	/* |x| < 2**-33 */
>> XX  				if (x == 0)
>> XX  					return (x);
>> XX -				INSERT_WORDS(ax, hx, 0);
>> XX -				x -= ax;
>> XX -				s = pilo * x + pihi * x + pilo * ax
>> XX -				    + pihi * ax;
>> XX -				return (s);
>> XX +				if ((int)x == 1) /* gen. inexact if x != 0 */
>> XX +					return (x);	/* NOTREACHED */
>
> Is this simply to guarantee that inexact is raised?  I had
> assumed that inexact would be raised by one or more if the
> * and + operation in s.

I was confused, but accidentally correct.  Let x be a power of 2.  Then
all the multiplications are exact, but the additions are usually inexact.
However, the additions may be exact if there is extra precision, and other
cases are not clear either.

Simpler example: suppose we approximate cos(x) by 1 - x*x/2.  Then the
result is obviously exact for must powers of 2.  It is unclear if poly
coeffs have enough low bits nonzero to give inexact for _all_ args in
general.

Relevant cases here:

(1) (M_PI * x), where x is float and a small power of 2, is exact in
double precision, but inexact when rounded to float precision.  The C
standard requires destruction of the extra precision on return, since
the return type has lower rank than the nominal expression type.  (I
use a hack to prevent this.)  It is lots of low bits 1 in M_PI that
give the extra precision and the destruction of them is what sets
inexact.

(2) Expressions like:
 				s = pilo * x + pihi * x + pilo * ax
 				    + pihi * ax;
C allows the rvalue to be evaluated in extra precision.  This may be
exact, depending on how much extra precision there is.

(2a) C requires discarding the extra precision on assignment.  (This is
now discarding and not destruction, since there must be ways to remove
the extra precision intentionally and assignment is the most natural
way -- the user should have used float_t to avoid loss.)

(2b) However, (2a) is too slow when there actually is extra precision,
so x86 non-C compilers don't do it.  gcc starting with about gcc-4.8
attempts to be a C99 compiler with certain CFLAGS.  clang doesn't support
these flags.

So s may be exact too.

(2c) C11 requires destroying the extra precision on return.  (This is now
perverse destruction and incompatible with C99.  It defeats FLT_EVAL_METHOD
giving extra precision for parts of expressions using functions, at least
if the functions are written in C.

So if s is exact, C11 requires it to become inexact on return.

(2d) Howver, (2c) is too slow and broken, so x86 non-C compilers don't
do it.  gcc starting with about gcc-4.8 attempts to be a C11 compiler
with with just -std=c11 in CFLAGS and the user must use the reverse
of the more nonstandard CFLAGS to get C99 conformats to avoid this bug.
clang doesn't attempt to support this bug.

Callers can see the extra precision even if they are compiled with a C
compiler:

(3a)
 	float f = sinpif(x);
C compilers remove any extra precision and set inxact if there was any.

(3b)
 	float_t f = sinpif(x);
C compilers keep any extra precision, and code should usually be written
like this to keep it intentionally.  But if sinpif() was compiled by a
C11 compiler, it is not allowed to return extra precision.  It is unclear
if callers compiled with a C11 compiler can depend on sinpif() being
compiled with a C11 compiler so that it can't have extra precision.

(3c)
 	float_t f or float f;
 	STRICT_ASSIGN(float, f, sinpif(x));

My code to not see any extra precision when testing sinpif() does this.

>> XX diff -u2 s_sinpif.c~ s_sinpif.c
>> ...
>> XX +	 		if (ix < 0x34800000)	/* |x| < 2**-22 */
>> XX +				if ((int)x == 0) /* gen. inexact if x != 0 */
>> XX +					return (M_PI * x);
>> XX  			s = __kernel_sinpif(ax);
>> XX  			return ((hx & 0x80000000) ? -s : s);
>>
>> This is simpler because it can just use double precision.
>
> It defeats the use of the same algorithm in all precisions.
> In particular, it is easier to develop an algorithm in float,
> test it, and then generalize to other precisions.

The algorithm is already very different for the usual case since it has to
map to the kernels which use double precision.  It would be a bit much to
restore all the unoptimized kernels that use float precision just to
develop the sinpif() family.  But I have considered going back to pure
float precision to get further optimizations.  Switching precisions costs
about 2-4 cycles per switch on x86, which is a lot when some of the functions
take less than 20 cycles.  On i386, the switch should be free, but
destruction of extra precision on return costs more than 4 cycles.  On
and64, there is a switch to promote to double precision on entry and
another to demote to float precision on return.

>> XX static inline double
>> XX __kernel_mpi(double x)
>> XX {
>> XX 	double hi, lo;
>> XX 	int32_t hx, lx;
>> XX
>> XX 	EXTRACT_WORDS(hx, lx, x);
>> XX 	hi = x;
>> XX 	SET_LOW_WORD(hi, 0);
>> XX 	hi *= 0x1p1000;
>> XX 	lo = 0x1p1000 * x  - hi;
>> XX 	return ((0x1p1000 * pi_lo * x + pi_hi * lo + pi_hi * hi) * 0x1p-1000);
>> XX }
>>
>> This is cleaner, but not yet suitable for a kernel in a header file since
>> it does a pessimal second extraction (which should be only of the high word
>> again).
>
> Why the 2nd extraction?  Neither hx nor lx are used.  Also,

It is a bug.  Larger than I remembered, but the compiler will optimize
away the extra extraction.  If there is an extraction, then it should
match the one in the caller so that it can be optimized away.

> int32_t should be uint32_t (or to be in line with math_private.h
> perhaps u_int32_t).

u[_]int32_t is technically correct for raw bits, but int32_t works too
since it is pure 2's complement.  fdlibm uses int32_t to do signed
comparisons on the result or when it is just sloppy, and I must have
copied one of the sloppy places.

>> Since this is inline, perhaps the compile can optimize away the
>> second extraction.  It can do this easily only if the extractions are the
>> same.  If the compiler can't do this (perhaps because this isn't inlined),
>> then hx (and lx if available) should be passed to the kernel.
>
> Why?  Neither is used in the kernel.

They need to be the same to avoid repeated insertions or extractions.
Since my s_sinpi() only GETs the high word and __kernel_mpi() only does
a null extraction followed by a SET of the low word, the operations appear
to be independent.  However, the only efficient way to do them is 64 bits
at a time for each, with operations merged.  This would be more needed to
clear 53/2 instead of 32 low bits in 'hi'.  Then lx would be needed
explicitly.  But efficient code has to extract and insert 64 bits at a
time anyway (but on x86, keep it in unnamed SSE registers unless the
bits are used explicitly).

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170521082808.T1142>