From owner-freebsd-numerics@FreeBSD.ORG  Thu Sep  6 11:42:38 2012
Return-Path: <owner-freebsd-numerics@FreeBSD.ORG>
Delivered-To: freebsd-numerics@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 4E1F9106564A
	for <freebsd-numerics@freebsd.org>;
	Thu,  6 Sep 2012 11:42:38 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail17.syd.optusnet.com.au (mail17.syd.optusnet.com.au
	[211.29.132.198])
	by mx1.freebsd.org (Postfix) with ESMTP id D64D58FC0C
	for <freebsd-numerics@freebsd.org>;
	Thu,  6 Sep 2012 11:42:36 +0000 (UTC)
Received: from c122-106-171-246.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-246.carlnfd1.nsw.optusnet.com.au [122.106.171.246])
	by mail17.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q86BgQKH022338
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 6 Sep 2012 21:42:27 +1000
Date: Thu, 6 Sep 2012 21:42:26 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Stephen Montgomery-Smith <stephen@missouri.edu>
In-Reply-To: <503ABAC4.3060503@missouri.edu>
Message-ID: <20120906200420.H1056@besplex.bde.org>
References: <5017111E.6060003@missouri.edu> <501C361D.4010807@missouri.edu>
	<20120804165555.X1231@besplex.bde.org> <501D51D7.1020101@missouri.edu>
	<20120805030609.R3101@besplex.bde.org> <501D9C36.2040207@missouri.edu>
	<20120805175106.X3574@besplex.bde.org> <501EC015.3000808@missouri.edu>
	<20120805191954.GA50379@troutmask.apl.washington.edu>
	<20120807205725.GA10572@server.rulingia.com>
	<20120809025220.N4114@besplex.bde.org>
	<5027F07E.9060409@missouri.edu> <20120814003614.H3692@besplex.bde.org>
	<50295F5C.6010800@missouri.edu> <20120814072946.S5260@besplex.bde.org>
	<50297CA5.5010900@missouri.edu> <50297E43.7090309@missouri.edu>
	<20120814201105.T934@besplex.bde.org> <502A780B.2010106@missouri.edu>
	<20120815223631.N1751@besplex.bde.org> <502C0CF8.8040003@missouri.edu>
	<503265E8.3060101@missouri.edu> <5036EFDB.3010304@missouri.edu>
	<503A9A1A.1050004@missouri.edu> <503ABAC4.3060503@missouri.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-numerics@freebsd.org
Subject: Re: Complex arg-trig functions
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
	<freebsd-numerics.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
	<mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
	<mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 06 Sep 2012 11:42:38 -0000

I'm back from a holiday.

On Sun, 26 Aug 2012, Stephen Montgomery-Smith wrote:

> On 08/26/2012 04:50 PM, Stephen Montgomery-Smith wrote:
>> On 08/23/2012 10:07 PM, Stephen Montgomery-Smith wrote:
>>> I just found out that the boost libraries implement the complex asin
>>> function.  I think their implementation is more faithful to the paper by
>>> Hull et al than my implementation is.  It does seem to have a BSD style
>>> license.  The only problem with it is that it is written in C++.

In extremely ugly C++.

>> http://www.boost.org/doc/libs/1_51_0/boost/math/complex/asin.hpp
>
>> Their asin seems to be about 10-15% faster than mine.  Their error is
>> slightly higher (4.5 ULP instead of 4ULP).
>
> But the speed improvement are because they use their own log1p function. 
> When I replace it with the standard log1p, it goes 5% slower than mine.

It seems unlikely that C++ is faster :-).  In float and long double
precision, it might benefit from using builtins for isnan(), etc., but
your double version uses private versions of these (except when it
calls csqrt() etc., and the callee uses the slow non-builtins).

I tested your latest version and found regressions for NaNs.  (Also,
the float version is less accurate now that it doesn't call double
functions, but that is a progression.)
1. casin*() abuses x and y for |x| and |y| and thus loses the sign of
    NaNs.  The quick fix is to reload the original values.  The correct
    fix is to use separate variables for the absolute values, as is
    done for cacosh*() and catanh*().
2. NaNs are now filtered out early, but in casin*() and cacos*() this is
    not before handling large values for the other arg.  Perhaps the
    resulting NaN is correct except for (1).  I didn't try applying the
    fix for (1) in the large-arg case, and just moved moved the NaN checks
    earlier.  Later I noticed that this has other advantages.
3. You were missing my hack to give uniform behaviour in catanh().  The
    other functions seem to need it more than before.  I applied it to
    all the functions.

% diff -u2 catrig.c~ catrig.c
% --- catrig.c~	2012-08-28 02:22:51.000000000 +0000
% +++ catrig.c	2012-09-06 09:09:36.086949000 +0000
% @@ -30,9 +30,13 @@
%  #include "math_private.h"
% 
% +/* XXX should use normal fdlibm access macros GET_HIGH_WORD(), etc. */
%  #define ISNAN(x)	((x)!=(x))
%  #define SIGNBIT(bx)	(((bx).parts.msw & 0x80000000) != 0)

This is not quite optimal, but better than the non-builtin isnan() etc.
that are used in float and long double precision.

The above isnan() check is fairly efficient and correct, unlike the
extern versions (which are broken and especially slow for long double
precision), but normal fdlibm code doesn't do it this way (it checks
everything as bits).

The above SIGNBIT() check would be written similarly using bits, but
with an access macro to read the bits.  gcc tends to generate slow code
for the direct bitfield access, especially for the sign bit, especially
in long doubles, and this is not easy to change without a macro to hide
the details.

% +/*
% + * XXX now that we filter out NaNs as the first step, ISINF() can be
% + * simplified to checking that the (biased) exponent is 0x7ff.
% + */
%  #define ISINF(bx)	(((((bx).parts.msw & 0x7fffffff) - 0x7ff00000) | \
%  			(bx).parts.lsw) == 0)

When you do do the bit checks explicitly, it is more obvious that the
mantissa bits have already been checked by ISNAN(), so they don't
need to be checked again.  Optimized fdlibm code would also combine
the exponent checking for ISNAN() and ISINF().

% -#define ISFINITE(bx)	(((bx).parts.msw & 0x7ff00000) != 0x7ff00000)

Previous changes made this unused.  ISINF(bx) could be spelled as
!ISFINITE(bx) to implement the above optimization, but this is unclear
since it only works in contexts where !ISNAN(bx).

% 
%  #define MAX_4TH_ROOT		1e75	/* approx pow(DBL_MAX,0.25) */
% @@ -272,13 +276,4 @@
%  	ay = fabs(y);
% 
% -	if (ax > RECIP_SQRT_EPSILON_100 || ay > RECIP_SQRT_EPSILON_100)
% -		if (ISINF(bx) || ISINF(by) || (int)(1+tiny)==1) { /* raise inexact */
% -			if (SIGNBIT(bx) == 0)
% -				w = clog_for_large_values(z) + M_LN2;
% -			else
% -				w = clog_for_large_values(-z) + M_LN2;
% -			return (cpack(copysign(creal(w), x), copysign(cimag(w), y)));
% -		}
% -
%  	if (ISNAN(x) || ISNAN(y)) {
%  		/* casinh(NaN + I*0) = NaN + I*0 */

Nothing changed here.  The diff shows the wrong block of code being moved.

% @@ -289,7 +284,16 @@
%  		 * the arguments is not NaN, so we opt not to raise it.
%  		 */
% -		return (cpack(x+y, x+y));
% +		return (cpack(x+0.0L+(y+0), x+0.0L+(y+0)));
%  	}

I moved this to the start and added uniform NaN mixing, for all functions
in all precisions.  The NaN mixing is now localized to a single line
for each function.  It would be easy to hide its details in a macro,
without any efficiency problems if you change the macro back to
cpack(x+y, x+y).

% 
% +	if (ax > RECIP_SQRT_EPSILON_100 || ay > RECIP_SQRT_EPSILON_100)
% +		if (ISINF(bx) || ISINF(by) || (int)(1+tiny)==1) { /* raise inexact */
% +			if (SIGNBIT(bx) == 0)
% +				w = clog_for_large_values(z) + M_LN2;
% +			else
% +				w = clog_for_large_values(-z) + M_LN2;
% +			return (cpack(copysign(creal(w), x), copysign(cimag(w), y)));
% +		}
% +

Not really changed.

Reducing the ISINF() tests to just checking the exponent depends on all
cases with NaNs being filtered out earlier.  Similarly for the other
functions, in all precisions.

%  	if (ax < SQRT_EPSILON_100 && ay < SQRT_EPSILON_100)
%  		if ((int)ax==0 && (int)ay==0) /* raise inexact */
% @@ -345,15 +349,11 @@
%  	y = fabs(y);
% 
% -	if (x > RECIP_SQRT_EPSILON_100 || y > RECIP_SQRT_EPSILON_100)
% -		if (ISINF(bx) || ISINF(by) || (int)(1+tiny)==1) { /* raise inexact */
% -			w = clog_for_large_values(z);
% -			rx = fabs(cimag(w));
% -			ry = creal(w) + M_LN2;
% -			if (sy == 0)
% -				ry = -ry;
% -			return (cpack(rx, ry));
% -		}
% -

Not changed.  Diff messed up the move again.

%  	if (ISNAN(x) || ISNAN(y)) {
% +		/*
% +		 * XXX unclobber x and y.  We should not have abused x and
% +		 * y to hold |x| and |y|.
% +		 */
% +		x = creal(z);
% +		y = cimag(z);

Quick fix.

If you actually want to toggle the sign[s] of the returned x and y to
match the symmetry properties of this and other functions even for
NaNs, then more work is needed.  copysign() would have to be used,
probably in a different way for the real and imaginary parts.  C99
doesn't require this much care even for +-0 in combination with a
NaN (the sign of the zero is unspecified, presumably because it
could reasonably depend on the sign of a NaN arg and nothing specifies
what the sign of a NaN does).

%  		/* cacos(0 + I*NaN) = PI/2 + I*NaN */
%  		if (x == 0) return (cpack(M_PI_2, y+y));
% @@ -363,11 +363,26 @@
%  		 * the arguments is not NaN, so we opt not to raise it.
%  		 */
% -		return (cpack(x+y, x+y));
% +		return (cpack(x+0.0L+(y+0), x+0.0L+(y+0)));

Usual uniformization.

%  	}
% 
% +	if (x > RECIP_SQRT_EPSILON_100 || y > RECIP_SQRT_EPSILON_100)
% +		if (ISINF(bx) || ISINF(by) || (int)(1+tiny)==1) { /* raise inexact */
% +			w = clog_for_large_values(z);
% +			rx = fabs(cimag(w));
% +			ry = creal(w) + M_LN2;
% +			if (sy == 0)
% +				ry = -ry;
% +			return (cpack(rx, ry));
% +		}
% +

Usual diff mess.

%  	if (x < SQRT_EPSILON_100 && y < SQRT_EPSILON_100)
%  		if ((int)x==0 && (int)y==0) { /* raise inexact */
% +			/* XXX we should use unclobbered x and y here too. */
%  			if (sy == 0)
%  				return (cpack(M_PI_2 - creal(z), copysign(cimag(z), -1)));
% +			/*
% +			 * XXX and here, except we should use the clobbered
% +			 * y and not spell it fabs(cimag(z)).
% +			 */
%  			return (cpack(M_PI_2 - creal(z), fabs(cimag(z))));

creal(z) must be reloaded, but fabs(cimag(z)) is already in y, but this y
should be spelled ay.

%  		}
% @@ -547,5 +562,5 @@
%  		 * the arguments is not NaN, so we opt not to raise it.
%  		 */
% -		return (cpack(x+y, x+y));
% +		return (cpack(x+0.0L+(y+0), x+0.0L+(y+0)));

catanh() didn't need recovering x and y or moving the check for NaNs
earlier to be correct.  Checking for NaNs earlier might still be an
optimization.


%  	}
% 
% diff -c2 catrigf.c~ catrigf.c
% --- catrigf.c~	2012-08-22 02:29:06.000000000 +0000
% +++ catrigf.c	2012-09-06 08:57:14.029522000 +0000
% @@ -167,4 +167,9 @@
%  	ay = fabsf(y);
% 
% +	if (isnan(x) || isnan(y)) {
% +		if (y == 0) return (cpackf(x+x, y));
% +		return (cpackf(x+0.0L+(y+0), x+0.0L+(y+0)));
% +	}
% +
%  	if (ax > RECIP_SQRT_EPSILON_100 || ay > RECIP_SQRT_EPSILON_100)
%  		if (isinf(x) || isinf(y) || (int)(1+tiny)==1) { /* raise inexact */
% @@ -176,9 +181,4 @@
%  		}
% 
% -	if (isnan(x) || isnan(y)) {
% -		if (y == 0) return (cpackf(x+x, y));
% -		return (cpackf(x+y, x+y));
% -	}
% -
%  	if (ax < SQRT_EPSILON_100 && ay < SQRT_EPSILON_100)
%  		if ((int)ax==0 && (int)ay==0) /* raise inexact */
% @@ -215,4 +215,11 @@
%  	y = fabsf(y);
% 
% +	if (isnan(x) || isnan(y)) {
% +		x = crealf(z);
% +		y = cimagf(z);
% +		if (x == 0) return (cpackf(M_PI_2, y+y));
% +		return (cpackf(x+0.0L+(y+0), x+0.0L+(y+0)));
% +	}
% +
%  	if (x > RECIP_SQRT_EPSILON_100 || y > RECIP_SQRT_EPSILON_100)
%  		if (isinf(x) || isinf(y) || (int)(1+tiny)==1) { /* raise inexact */
% @@ -225,9 +232,4 @@
%  		}
% 
% -	if (isnan(x) || isnan(y)) {
% -		if (x == 0) return (cpackf(M_PI_2, y+y));
% -		return (cpackf(x+y, x+y));
% -	}
% -
%  	if (x < SQRT_EPSILON_100 && y < SQRT_EPSILON_100)
%  		if ((int)x==0 && (int)y==0) { /* raise inexact */
% @@ -336,5 +338,5 @@
%  		if (x == 0)
%  			return (cpackf(x, y+y));
% -		return (cpackf(x+y, x+y));
% +		return (cpackf(x+0.0L+(y+0), x+0.0L+(y+0)));
%  	}
% 
% diff -c2 catrigl.c~ catrigl.c
% --- catrigl.c~	2012-08-22 02:29:48.000000000 +0000
% +++ catrigl.c	2012-09-05 16:50:10.684867000 +0000
% @@ -139,4 +139,9 @@
%  	ay = fabsl(y);
% 
% +	if (isnan(x) || isnan(y)) {
% +		if (y == 0) return (cpackl(x+x, y));
% +		return (cpackl(x+0.0L+(y+0), x+0.0L+(y+0)));
% +	}
% +
%  	if (ax > RECIP_SQRT_EPSILON_100 || ay > RECIP_SQRT_EPSILON_100)
%  		if (isinf(x) || isinf(y) || (int)(1+tiny)==1) { /* raise inexact */
% @@ -148,9 +153,4 @@
%  		}
% 
% -	if (isnan(x) || isnan(y)) {
% -		if (y == 0) return (cpackl(x+x, y));
% -		return (cpackl(x+y, x+y));
% -	}
% -
%  	if (ax < SQRT_EPSILON_100 && ay < SQRT_EPSILON_100)
%  		if ((int)ax==0 && (int)ay==0) /* raise inexact */
% @@ -187,4 +187,11 @@
%  	y = fabsl(y);
% 
% +	if (isnan(x) || isnan(y)) {
% +		x = creall(z);
% +		y = cimagl(z);
% +		if (x == 0) return (cpackl(L_PI_2, y+y));
% +		return (cpackl(x+0.0L+(y+0), x+0.0L+(y+0)));
% +	}
% +
%  	if (x > RECIP_SQRT_EPSILON_100 || y > RECIP_SQRT_EPSILON_100)
%  		if (isinf(x) || isinf(y) || (int)(1+tiny)==1) { /* raise inexact */
% @@ -197,9 +204,4 @@
%  		}
% 
% -	if (isnan(x) || isnan(y)) {
% -		if (x == 0) return (cpackl(L_PI_2, y+y));
% -		return (cpackl(x+y, x+y));
% -	}
% -
%  	if (x < SQRT_EPSILON_100 && y < SQRT_EPSILON_100)
%  		if ((int)x==0 && (int)y==0) { /* raise inexact */
% @@ -308,5 +310,5 @@
%  		if (x == 0)
%  			return (cpackl(x, y+y));
% -		return (cpackl(x+y, x+y));
% +		return (cpackl(x+0.0L+(y+0), x+0.0L+(y+0)));
%  	}
%

Same changes in all precision.  Diff only messed up the move in double
precision.

More details on the pessimizations for isnan() etc.:
- gcc-4.2 (but not gcc-3.3) has builtins for isnan() and signbit(), but
   apparently not for isinf() or isfinite() (it has builtin even for the
   last 2, but these just make libcalls
- for isnan(x) on at least i386, gcc-4.2 generates the same as for
   (x) != (x).  Efficiency of this is mediocre, except possibly for long
   doubles (you can often beat it using bit tests, but for long doubles
   there are lots of bits and special cases.  The extern version is broken
   for long doubles since it doesn't understand the special cases)
- for signbit(x) on at least i386, gcc-4.2 generates very good code in
   float and double precisions, but not so good code in lond double
   precision.  The extern versions are pessimized using a bit-field, in
   an unusual way: signbit() only returns nonzero if the sign bit is
   set, but bit-field semantics require a value of +-1, and several
   operations are needed to convert to this.  The builtins are missing
   these extra operations.  But gcc-4.2 pessimizes long doubles even
   more for the builtin than for bit-fields:

@ f:
@ 	pushl	%ebp
@ 	movl	%esp, %ebp
@ 	movl	16(%ebp), %eax

This loads the high 16-bit word of the long double, and also 16-bits
of garbage.  The garbage is usually not even written to, and may
cause a cache miss or write buffer penalty.  Versions with bit-fields
tend to cause related penalties but not this one.

@ 	andl	$32768, %eax
@ 	leave
@ 	ret

- only the builtin for isnan() is ever used under FreeBSD, because
   macros in math.h translate to other names which gcc doesn't
   understand, except isnan() alone is translated to itself so gcc
   can still see its name.  E.g., from math.h:

@ #define	isnan(x)					\
@     ((sizeof (x) == sizeof (float)) ? __isnanf(x)	\
@     : (sizeof (x) == sizeof (double)) ? isnan(x)	\
@     : __isnanl(x))

The double isnan() has a null translation for historical reasons, and this
accident allows the builtin to work.

@ ...
@ #ifdef __MATH_BUILTIN_RELOPS
@ #define	isgreater(x, y)		__builtin_isgreater((x), (y))
@ #define	isgreaterequal(x, y)	__builtin_isgreaterequal((x), (y))
@ #define	isless(x, y)		__builtin_isless((x), (y))
@ #define	islessequal(x, y)	__builtin_islessequal((x), (y))
@ #define	islessgreater(x, y)	__builtin_islessgreater((x), (y))
@ #define	isunordered(x, y)	__builtin_isunordered((x), (y))
@ #else
@ #define	isgreater(x, y)		(!isunordered((x), (y)) && (x) > (y))
@ #define	isgreaterequal(x, y)	(!isunordered((x), (y)) && (x) >= (y))
@ #define	isless(x, y)		(!isunordered((x), (y)) && (x) < (y))
@ #define	islessequal(x, y)	(!isunordered((x), (y)) && (x) <= (y))
@ #define	islessgreater(x, y)	(!isunordered((x), (y)) && \
@ 					((x) > (y) || (y) > (x)))
@ #define	isunordered(x, y)	(isnan(x) || isnan(y))
@ #endif /* __MATH_BUILTIN_RELOPS */

Builtins are used for these macros, but not for any other similar macros.

Bruce

From owner-freebsd-numerics@FreeBSD.ORG  Thu Sep  6 13:51:08 2012
Return-Path: <owner-freebsd-numerics@FreeBSD.ORG>
Delivered-To: freebsd-numerics@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 9DB77106566C
	for <freebsd-numerics@freebsd.org>;
	Thu,  6 Sep 2012 13:51:08 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au
	[211.29.132.188])
	by mx1.freebsd.org (Postfix) with ESMTP id 20BB88FC1E
	for <freebsd-numerics@freebsd.org>;
	Thu,  6 Sep 2012 13:51:07 +0000 (UTC)
Received: from c122-106-171-246.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-246.carlnfd1.nsw.optusnet.com.au [122.106.171.246])
	by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q86DowsF008413
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 6 Sep 2012 23:51:00 +1000
Date: Thu, 6 Sep 2012 23:50:58 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Stephen Montgomery-Smith <stephen@missouri.edu>
In-Reply-To: <502C0CF8.8040003@missouri.edu>
Message-ID: <20120906221028.O1542@besplex.bde.org>
References: <5017111E.6060003@missouri.edu> <501C361D.4010807@missouri.edu>
	<20120804165555.X1231@besplex.bde.org> <501D51D7.1020101@missouri.edu>
	<20120805030609.R3101@besplex.bde.org> <501D9C36.2040207@missouri.edu>
	<20120805175106.X3574@besplex.bde.org> <501EC015.3000808@missouri.edu>
	<20120805191954.GA50379@troutmask.apl.washington.edu>
	<20120807205725.GA10572@server.rulingia.com>
	<20120809025220.N4114@besplex.bde.org>
	<5027F07E.9060409@missouri.edu> <20120814003614.H3692@besplex.bde.org>
	<50295F5C.6010800@missouri.edu> <20120814072946.S5260@besplex.bde.org>
	<50297CA5.5010900@missouri.edu> <50297E43.7090309@missouri.edu>
	<20120814201105.T934@besplex.bde.org> <502A780B.2010106@missouri.edu>
	<20120815223631.N1751@besplex.bde.org> <502C0CF8.8040003@missouri.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-numerics@freebsd.org
Subject: Re: Complex arg-trig functions
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
	<freebsd-numerics.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
	<mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
	<mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 06 Sep 2012 13:51:08 -0000

On Wed, 15 Aug 2012, Stephen Montgomery-Smith wrote:

> On 08/15/2012 08:35 AM, Bruce Evans wrote:
>> An up-front check may still be simpler, and gives more control.  In
>> csqrt*(), I needed an explicit check and special expressions to get
>> uniform behaviour.
>
> I still like it the way I have it.  There is a definite logic in the way infs 
> and nans come out of casinh, etc.

Changed now :-).

> There is only one place I disagree with C99:
> catanh(1) = Inf + 0*I
>
> I think mpc gets it correct:
> atanh(1) = Inf + nan*I

C99 is trying be compatible with the real function to a fault.  Real
atanh(1) must be Inf, and any imaginary part would make catanh(1)
different.  C99 requires similar beahaviour for many other functions.

>> I added this to the NaN mixing in catan[h]*(),
>> and now all my tests pass:
>> 
>> % diff -c2 catrig.c~ catrig.c
>> % *** catrig.c~    Sun Aug 12 17:29:18 2012
>> % --- catrig.c    Wed Aug 15 11:57:02 2012
>> % ***************
>> % *** 605,609 ****
>> %        */
>> %       if (ISNAN(x) || ISNAN(y))
>> % !         return (cpack(x+y, x+y));
>> % %       /* (x,inf) and (inf,y) and (inf,inf) -> (0,PI/2) */
>> % --- 609,613 ----
>> %        */
>> %       if (ISNAN(x) || ISNAN(y))
>> % !         return (cpack((x+0.0L)+(y+0), (x+0.0L)+(y+0)));
>> % %       /* (x,inf) and (inf,y) and (inf,inf) -> (0,PI/2) */
>> 
>> Use this expression in all precisions.
>
> Would this work?
>
>       if (ISNAN(x) || ISNAN(y))
>                return (cpack((x+x)+(y+y), (x+x)+(y+y)));

No, since this would be evaluated by a different ALU than my version,
if the ALU for long doubles is different from the ALU for doubles, as
happens on amd64.  However, I could use y+y instead of y+0 in some
cases.  The point of using y+0 is to use the same code and not change
the result much if y happens to be finite (only -0 gets changed, to 0).
ALUs:
   (x+0.0L)+(y+0):
     x+0.0L:
       long double + long double, after promotion of x.  Always in the i387.
     y+0 (or y+y):
       typeof(y) + typeof(y), after promotion of x.  In the i387 on i386,
       but in SSE on amd64.  y+0 ends up quasi-promoted on i386.
     final addition:
       long double + long double, after promotion of y+0.  Always in the i387.
       Always long double result.  Returning this then converts to double.

   (x+x)+(y+y):
     x+x:
       as above for y+y.  Only in the i387 on i386.
     y+y:
       as above
     final addition:
       still only in the i387 on i386.  So the result of casin() on 2 NaNs is
       quite different from that of casinl() on the same 2 NaNs (initially),
       since the i387 ALU has different semantics for NaNs than the SSE ALU
       despite the ALUs sharing the FPU hardware.  Only quieting of NaNs works
       perfectly in this version.

>> I forgot to comment it.  Adding 0 quietens signaling NaNs before mixing
>> NaNs.  I should have tried y+y.  Adding 0.0L promotes part of the
>> expression to long double together with quietening signaling NaNs.
>> The rest of the expression is promoted to match.  I should try the
>> old way again: of (long double)x+x.

Another explanation.  Without testing it, I think it has the same speed
with the x+0[.0L] versions, but is slightly better for the x+x version,
since x+x can be done without another load for 0 in SSE only.  But the
extra load is likely to free on x86 since it can be hidden in scheduling
latencies.

>> ...
>> % Index: ../s_csqrt.c
>> % ===================================================================
>> % RCS file: /home/ncvs/src/lib/msun/src/s_csqrt.c,v
>> % retrieving revision 1.4
>> % diff -u -2 -r1.4 s_csqrt.c
>> % --- ../s_csqrt.c    8 Aug 2008 00:15:16 -0000    1.4
>> % +++ ../s_csqrt.c    14 Aug 2012 20:34:07 -0000
>> % @@ -34,14 +34,5 @@
>> %  #include "math_private.h"
>> % % -/*
>> % - * gcc doesn't implement complex multiplication or division correctly,
>> % - * so we need to handle infinities specially. We turn on this pragma to
>> % - * notify conforming c99 compilers that the fast-but-incorrect code that
>> % - * gcc generates is acceptable, since the special cases have already
>> been
>> % - * handled.
>> % - */
>> % -#pragma    STDC CX_LIMITED_RANGE    ON
>> 
>> Remove this.  There was only 1 complex expression, and it depended on the
>> negation of this pragma to work.  Since gcc doesn't support this pragma,
>> the expression only worked accidentally when it was optimized.
>
> I removed it.  (I copied it verbatim from csqrt without really understanding 
> it.)

Good.

> The part that follows - is this all referencing csqrt?

Depends on whether you clipped an "Index" header :-).

>> 
>> % -
>> % -/* We risk spurious overflow for components >= DBL_MAX / (1 +
>> sqrt(2)). */
>> % +/* For avoiding overflow for components >= DBL_MAX / (1 + sqrt(2)). */
>> %  #define    THRESH    0x1.a827999fcef32p+1022
>> 
>>   ..............   snip

Looks like csqrt.

I didn't check that THRESH is actually correct.  It seems to have more
bits than are strictly necessary.

>> This is like a fix in clog().  hypot() handles denormals OK, but
>> necessarily loses accuracy when it returns a denormal result, so
>> the expression (a + hypot(a, b)) is more inaccurate than necessary.
>
> Which code is being referenced here?  I use expressions like this catrig. 
> Although I think when I use it, I am somewhat certain that neither a nor b 
> are denormal.

Now "This" is the change in csqrt() in the snipped quote and is being
compared to related code in clog() that wasn't quoted recently in this
thread.

Snipped quote:

% +		scale = 2;
%  	} else {
% -		scale = 0;
% +		scale = 1;
% +	}
% +
% +	/* Scale to reduce inaccuracies when both components are denormal. */
% +	if (fabs(a) <= 0x1p-1023 && fabs(b) <= 0x1p-1023) {
% +		a *= 0x1p54;
% +		b *= 0x1p54;
% +		scale = 0x1p-27;

Related code in clog():

% 	/* Reduce inaccuracies and avoid underflow when ax is denormal. */
% 	if (kx <= MIN_EXP - 2)
% 		return (cpack(log(hypot(x * 0x1p1023, y * 0x1p1023)) +
% 		    (MIN_EXP - 2) * ln2_lo + (MIN_EXP - 2) * ln2_hi, v));
% 
% 	/* Avoid remaining underflows (when ax is small but not denormal). */
% 	if (ky < (MIN_EXP - 1) / 2 + MANT_DIG)
% 		return (cpack(log(hypot(x, y)), v));

This code in catrig.c seems to be related:

@ /*
@  * sum_squares(x,y) = x*x + y*y.
@  * Assumes x*x and y*y will not overflow.
@  * Assumes x and y are finite.
@  * Assumes fabs(x) >= DBL_EPSILON.
@  */
@ inline static double
@ sum_squares(double x, double y)
@ {
@ 	/*
@ 	 * The following code seems to do better than
@ 	 * return (hypot(x, y) * hypot(x, y));
@ 	 */
@ 
@ 	if (fabs(y) < MIN_4TH_ROOT)
@ 		if ((int)y==0) /* raise inexact */
@ 			return (x*x);
@ 	return (x*x + y*y);
@ }

but it isn't really -- it is more like the case of a large difference
of exponents, so x*x+y*y reduces to x*x (at a smaller exponent difference
than for hypot(x, y) reducing to |x|).  hypot() sets inexact by returning
|x|+|y| in that case, but the above avoids using y*y since it would give
spurious underflow.

Hmm, MIN_4TH_ROOT seems to be logically wrong here, and physically
wrong for float precision.  In double precision, MIN_4TH_ROOT/DBL_EPSILON
is ~0.5e-59, but in float precision, MIN_4TH_ROOT/FLT_EPSILON is only
~1e-2, so when x = FLT_EPSILON, y = MIN_4TH_ROOT is very far from giving
a discardable y*y.

The logically correct threshold is something like y/x < 2**-MANT_DIG/2
or y < 2**-MANT_DIG/2 * EPSILON to avoid the division.  clog*() uses
essentially y/x < 2**-MANT_DIG for some reason (it actually uses a fuzzy
check based on exponent differences: kx - ky > MANT_DIG; MANT_DIG/2
would be a little too small here, but doubling it seems wasteful).
hypot() uses essentially y/x < 2**-60, where the magic 60 is DBL_MANT_DIG
plys a safety margin.  hypotf() uses a magic 30.  The non-magic integer
exponent thresholds work very well in clog*(), but depend on having the
exponents handy for efficiency, which may be possible via optimization
for an inline function like the above but takes more source code.

I don't see how the above avoids cancelation problems like the ones in
clog().  You never subtract 1 from sum_squares(), but there are open-coded
expressions like '(1-ax)*(1+ax) - ay*ay' to reduce the cancelation error.
I found that for clog() it was worth using extra precision out to about
|z| = 4 (or |z| < 1/4) to reduce the cancelation error.  But somehow my
tests didn't show large errors.  Perhaps because I'm happy with < 4 ulps
instead of < 1.

Bruce

From owner-freebsd-numerics@FreeBSD.ORG  Thu Sep  6 16:32:23 2012
Return-Path: <owner-freebsd-numerics@FreeBSD.ORG>
Delivered-To: freebsd-numerics@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7C31810656D4
	for <freebsd-numerics@freebsd.org>;
	Thu,  6 Sep 2012 16:32:23 +0000 (UTC)
	(envelope-from stephen@missouri.edu)
Received: from wilberforce.math.missouri.edu (wilberforce.math.missouri.edu
	[128.206.184.213])
	by mx1.freebsd.org (Postfix) with ESMTP id 1408C8FC15
	for <freebsd-numerics@freebsd.org>;
	Thu,  6 Sep 2012 16:32:22 +0000 (UTC)
Received: from [127.0.0.1] (wilberforce.math.missouri.edu [128.206.184.213])
	by wilberforce.math.missouri.edu (8.14.5/8.14.5) with ESMTP id
	q86GWBZh069525; Thu, 6 Sep 2012 11:32:14 -0500 (CDT)
	(envelope-from stephen@missouri.edu)
Message-ID: <5048D00B.8010401@missouri.edu>
Date: Thu, 06 Sep 2012 11:32:11 -0500
From: Stephen Montgomery-Smith <stephen@missouri.edu>
User-Agent: Mozilla/5.0 (X11; Linux i686;
	rv:15.0) Gecko/20120827 Thunderbird/15.0
MIME-Version: 1.0
To: Bruce Evans <brde@optusnet.com.au>
References: <5017111E.6060003@missouri.edu> <501C361D.4010807@missouri.edu>
	<20120804165555.X1231@besplex.bde.org>
	<501D51D7.1020101@missouri.edu>
	<20120805030609.R3101@besplex.bde.org>
	<501D9C36.2040207@missouri.edu>
	<20120805175106.X3574@besplex.bde.org>
	<501EC015.3000808@missouri.edu>
	<20120805191954.GA50379@troutmask.apl.washington.edu>
	<20120807205725.GA10572@server.rulingia.com>
	<20120809025220.N4114@besplex.bde.org>
	<5027F07E.9060409@missouri.edu>
	<20120814003614.H3692@besplex.bde.org>
	<50295F5C.6010800@missouri.edu>
	<20120814072946.S5260@besplex.bde.org>
	<50297CA5.5010900@missouri.edu> <50297E43.7090309@missouri.edu>
	<20120814201105.T934@besplex.bde.org>
	<502A780B.2010106@missouri.edu>
	<20120815223631.N1751@besplex.bde.org>
	<502C0CF8.8040003@missouri.edu>
	<20120906221028.O1542@besplex.bde.org>
In-Reply-To: <20120906221028.O1542@besplex.bde.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-numerics@freebsd.org
Subject: Re: Complex arg-trig functions
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
	<freebsd-numerics.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
	<mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
	<mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 06 Sep 2012 16:32:23 -0000

Life is busy, so it will be a few days before I can process most of what 
you have said.  But let me respond to a few of the points.

On 09/06/2012 08:50 AM, Bruce Evans wrote:

> This code in catrig.c seems to be related:
>
> @ /*
> @  * sum_squares(x,y) = x*x + y*y.
> @  * Assumes x*x and y*y will not overflow.
> @  * Assumes x and y are finite.
> @  * Assumes fabs(x) >= DBL_EPSILON.
> @  */
> @ inline static double
> @ sum_squares(double x, double y)
> @ {
> @     /*
> @      * The following code seems to do better than
> @      * return (hypot(x, y) * hypot(x, y));
> @      */
> @ @     if (fabs(y) < MIN_4TH_ROOT)
> @         if ((int)y==0) /* raise inexact */
> @             return (x*x);
> @     return (x*x + y*y);
> @ }
>
> but it isn't really -- it is more like the case of a large difference
> of exponents, so x*x+y*y reduces to x*x (at a smaller exponent difference
> than for hypot(x, y) reducing to |x|).  hypot() sets inexact by returning
> |x|+|y| in that case, but the above avoids using y*y since it would give
> spurious underflow.
>
> Hmm, MIN_4TH_ROOT seems to be logically wrong here, and physically
> wrong for float precision.  In double precision, MIN_4TH_ROOT/DBL_EPSILON
> is ~0.5e-59, but in float precision, MIN_4TH_ROOT/FLT_EPSILON is only
> ~1e-2, so when x = FLT_EPSILON, y = MIN_4TH_ROOT is very far from giving
> a discardable y*y.
>
> The logically correct threshold is something like y/x < 2**-MANT_DIG/2
> or y < 2**-MANT_DIG/2 * EPSILON to avoid the division.  clog*() uses
> essentially y/x < 2**-MANT_DIG for some reason (it actually uses a fuzzy
> check based on exponent differences: kx - ky > MANT_DIG; MANT_DIG/2
> would be a little too small here, but doubling it seems wasteful).
> hypot() uses essentially y/x < 2**-60, where the magic 60 is DBL_MANT_DIG
> plys a safety margin.  hypotf() uses a magic 30.  The non-magic integer
> exponent thresholds work very well in clog*(), but depend on having the
> exponents handy for efficiency, which may be possible via optimization
> for an inline function like the above but takes more source code.

The code is different for catrigf.c for precisely the reason you stated. 
  This is what I was talking about a while back when I said that the 
float code was a little harder than the double and long code.

In the days or weeks to come, I might go over all these kinds of 
conditions over again.  The paper by Hull et al, and the code used by 
boost, is written in such a way that the code for float is the same as 
the code for double.  My code was designed just enough to work, whereas 
they put more thought into it.

> I don't see how the above avoids cancelation problems like the ones in
> clog().  You never subtract 1 from sum_squares(), but there are open-coded
> expressions like '(1-ax)*(1+ax) - ay*ay' to reduce the cancelation error.
> I found that for clog() it was worth using extra precision out to about
> |z| = 4 (or |z| < 1/4) to reduce the cancelation error.  But somehow my
> tests didn't show large errors.  Perhaps because I'm happy with < 4 ulps
> instead of < 1.

Because for clog, when x^2+y^2 is close to one, the real part of clog is 
very close to zero.

Whereas when x^2+y^2 is close to 1, the imaginary part of catanh is 
close to Pi/4.

So even though the absolute error in computing Im(catanh) might be much 
higher than Re(clog), the relative error is comparable.