From owner-freebsd-numerics@FreeBSD.ORG  Mon Jun 10 15:43:28 2013
Return-Path: <owner-freebsd-numerics@FreeBSD.ORG>
Delivered-To: freebsd-numerics@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 5D5C4C9B;
 Mon, 10 Jun 2013 15:43:28 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail36.syd.optusnet.com.au (mail36.syd.optusnet.com.au
 [211.29.133.76])
 by mx1.freebsd.org (Postfix) with ESMTP id D7DFC1F74;
 Mon, 10 Jun 2013 15:43:26 +0000 (UTC)
Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au
 (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23])
 by mail36.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r5AFhHfe009078
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Tue, 11 Jun 2013 01:43:18 +1000
Date: Tue, 11 Jun 2013 01:43:17 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: David Schultz <das@FreeBSD.org>
Subject: Re: Implementation for coshl.
In-Reply-To: <20130610055834.GA68643@zim.MIT.EDU>
Message-ID: <20130610235038.D790@besplex.bde.org>
References: <20130610003645.GA16444@troutmask.apl.washington.edu>
 <20130610110740.V24058@besplex.bde.org> <20130610055834.GA68643@zim.MIT.EDU>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=K8x6hFqI c=1 sm=1 a=LM0AswAWfpYA:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=MtpAjnDH-vAA:10
 a=o_Yc0WgG-dlwfkC-MugA:9 a=CjuIK1q_8ugA:10 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117
Cc: freebsd-numerics@FreeBSD.org,
 Steve Kargl <sgk@troutmask.apl.washington.edu>,
 Bruce Evans <brde@optusnet.com.au>
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Jun 2013 15:43:28 -0000

On Sun, 9 Jun 2013, David Schultz wrote:

> On Mon, Jun 10, 2013, Bruce Evans wrote:
>> On Sun, 9 Jun 2013, Steve Kargl wrote:
>
> This is fine for now.  We might eventually want to improve the
> accuracy of the float, double, and long double implementations,
> but that ought to come later.  It's much easier to start with
> float and double when making improvements in the algorithm.

Yes, it would have been better to improve them before cloning their
bugs.

>>> The former would require a refactoring of
>>> s_expl.c into a kernel __kernel_expl(x, hi, lo).  I have no plans on
>>> pursuing this at the this time.
>>
>> But you need this soon for __ldexp_exp() and __ldexp_cexp(), which are
>> needed for hyperbolic functions (the large args already fixed for float
>> and double precision) and for cexp() and trig and hyperbolic complex
>> functions.  It is much easier to implement these using a kernel.  I do
>> this only for float precision.
>
> That's fine for now.  We need a better k_exp, too.  I believe
> Bruce was working on one.

This is especially easy for long double precision, since expl() only
has 1 return that needs to be changed:

@ 	/* Evaluate expl(endpoint[n2] + r1 + r2) = tbl[n2] * expl(r1 + r2). */
@ 	z = r * r;
@ 	q = r2 + z * (A2 + r * A3) + z * z * (A4 + r * A5) + z * z * z * A6;
@ 	t = (long double)tbl[n2].lo + tbl[n2].hi;
@ 	t = tbl[n2].lo + t * (q + r1) + tbl[n2].hi;

Return the hi and low parts tbl[n2].hi and tbl[n2].lo + t * (q + r1) here
and add them up in callers.

@ 
@ 	/* Scale by 2**k. */
@ 	if (k >= LDBL_MIN_EXP) {
@ 		if (k == LDBL_MAX_EXP)
@ 			RETURNI(t * 2 * 0x1p16383L);
@ 		RETURNI(t * twopk);
@ 	} else {
@ 		RETURNI(t * twopkp10000 * twom10000);
@ 	}

Return k and/or twopk here and multiply by them in callers.

Remove all other returns (don't support special values in the kernel).

The only difficulty is that twopk doesn't give the full power for cases
near of after overflow and underflow, and those cases are precisely the
critical ones for __ldexp_[c]exp*().

I have this sort of working for sinhf() in float precision, for the
range [1, 9+].  There is no problem with twopk in this range, and I
multiply by it in the kernel and don't return it.  A simplified
version of adding up the values in the caller is:

 	_k_exp(fabs(x), &hi1, &lo1);
 	_k_exp(-fabs(x), &hi2, &lo2);
 	return h * (lo2 + lo1 + hi2 + hi1);

This summation is sloppy and loses about hi2/hi1 ulps unnecessarily.
hi2/hi1 <= exp(-2*|x|) < exp(-2) so this loss is never absolutely
large, and is only relatively large if exp() it very accurate.
This error is in addition to the error in exp(|x|) ~= lo1 + h1.

> I have a few comments on the code... mostly a subset of Bruce's
> comments:
> ...
> - I think the way you handle the tiny-x case is an improvement
>  over the double version.  However, float-to-int conversion is
>  very slow on x86, so you might instead try something like
>  RETURNI(1.0L + tiny), where tiny is volatile and e.g., 0x1p-1000.
>  That will also return the correct result in all rounding modes.

No, it is an unimprovement.  It micro-optimizes for an unusual case,
and breaks the rounding in that case.  We were careful about this in
expl() and expm1l().  expl(x) returns 1+x, and expm1l(x) returns a
very complicated expression to get the rounding right for x + x*x/2.
This is quite difficult for expm1l() since the signs alternate.  In
the complicated expression, we needed an explicit test for x == 0,
since the rest of the complicated expression would mess up the signed
for +-0 in some rounding modes.

This particular float-to-int conversion isn't very slow on i386.  In
previous lectures, I explained why it is very efficient even on (out
of order execution) i386's: no operand depend on its result, so it can
be and is done in parallel.  It is magic complicated expressions that
are inefficient, since even a simple expression that uses the arg gives
1 dependency and complicated ones give more.

I tried to get the rounding for values of +-0 and +-pi/2 right in
catrig*.c.  But fdlibm generally isn't very careful about this.  I
noticed the following types of errors:
- expressions using pi/2 in such a way that it doesn't work right in
   all rounding modes.  Returning a const pi/2 can never work in all
   rounding modes, so exprssions like pi/2 + tiny should be used
- often, tiny isn't volatile, so compiler bugs prevent pi/2 + tiny
   working
- sometimes, optimization broke pi/2 + tiny by replacing it by pi/2
- often, pi/2 is written as pio2_hi + pio2_lo.  This (and pi for pi/2)
   should be rounded down so that adding tiny to it works right.  IIRC,
   rounding to nearest gives the same result as rounding down in double
   precision, so everyone gets the same result, but in float precision
   rounding to nearest rounds up.  This delicacy was not understood when
   the functions were converted to float precision, so there are some
   misrounded spittings of pi/2.  Now I'm not sure if any fixed splitting
   can work in all rounding modes.  The complications from double rounding
   are espcially large since compile-time rounding of the constant is mixed
   with runtime rounding of expressions using it.

   Typical code using this from e_asin.c:

 	    if(((ix-0x3ff00000)|lx)==0)
 		    /* asin(1)=+-pi/2 with inexact */
 		return x*pio2_hi+x*pio2_lo;

   The comment is bad and should say asin(-1) (for asin(1), we wouldn't
   use x in the expression).  The result of this is pio2_hi + pio2_lo
   for x = 1 and -pio2_hi - pio2_lo for x = -1.  This sets inexact but
   seems to be quite broken for non-default rounding modes.  It seems
   too hard to round to pio/2 correctly in all cases (unless delicate
   rounding in the splitting does it), but the aboves breaks even asin()
   being an odd function when the rounding mode is towards +-infinity.
   Most odd functions avoid this problem by calculating func(|x|) and
   adjusting the sign at the end.

   The bugs in this were moved by dubious optimizations in e_asinf.c:

 	    if(ix==0x3f800000)		/* |x| == 1 */
 		return x*pio2;		/* asin(+-1) = +-pi/2 with inexact */

   Now asinf() is odd at +-1 although not for general args; there is no
   longer any chance of correct rounding at +-1 in all rounding modes;
   inexact is no longer set in the code, but is still claimed to be set
   in the comment.  I implemented most of these bugs :-(.  Fortunately
   they were not copied to the long double version.

- constant folding of pio2_hi + pio2_lo may cause problems even without
   compiler bugs.  The above expression with x's instead of +-1's hopefully
   prevents compilers reducing to constant expressions and then folding.
   If we wrote the classification as (fabs(x) == 1), then the optimizer
   wouldn't have to be very smart to notice that x is 1 or -1 and doing
   the folding.  That would only be an invalid optimization with FENV on.

Bruce