From owner-freebsd-numerics@FreeBSD.ORG  Mon Jun 10 19:14:54 2013
Return-Path: <owner-freebsd-numerics@FreeBSD.ORG>
Delivered-To: freebsd-numerics@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 9285C585;
 Mon, 10 Jun 2013 19:14:54 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au
 [211.29.132.42])
 by mx1.freebsd.org (Postfix) with ESMTP id 41E0A1BF2;
 Mon, 10 Jun 2013 19:14:53 +0000 (UTC)
Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au
 (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23])
 by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id B7BDA3C0EA5;
 Tue, 11 Jun 2013 05:14:51 +1000 (EST)
Date: Tue, 11 Jun 2013 05:14:46 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: David Schultz <das@freebsd.org>
Subject: Re: Implementation for coshl.
In-Reply-To: <20130610170603.GA72597@zim.MIT.EDU>
Message-ID: <20130611040059.A17928@besplex.bde.org>
References: <20130610003645.GA16444@troutmask.apl.washington.edu>
 <20130610110740.V24058@besplex.bde.org> <20130610055834.GA68643@zim.MIT.EDU>
 <20130610162927.GA20856@troutmask.apl.washington.edu>
 <20130610170603.GA72597@zim.MIT.EDU>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=Q6eKePKa c=1 sm=1 a=LM0AswAWfpYA:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=MtpAjnDH-vAA:10
 a=HMF5KdcwFu_V6rpZWRkA:9 a=CjuIK1q_8ugA:10 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117
Cc: freebsd-numerics@freebsd.org,
 Steve Kargl <sgk@troutmask.apl.washington.edu>
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Jun 2013 19:14:54 -0000

On Mon, 10 Jun 2013, David Schultz wrote:

> On Mon, Jun 10, 2013, Steve Kargl wrote:
>> On Sun, Jun 09, 2013 at 10:58:34PM -0700, David Schultz wrote:
> ...
>>> - I think the way you handle the tiny-x case is an improvement
>>>   over the double version.  However, float-to-int conversion is
>>>   very slow on x86, so you might instead try something like
>>>   RETURNI(1.0L + tiny), where tiny is volatile and e.g., 0x1p-1000.
>>>   That will also return the correct result in all rounding modes.
>>
>> I originally had
>>
>>                 if (ix < BIAS - 33) {   /* |x| < 0x1p-33 */
>>                         /* cosh(+-0) = 1. */
>>                         if (x == 0)
>>                                 return (one);
>>                         /* cosh(tiny) = one */
>>                         return (one + tiny * tiny);
>>                 }
>>
>> because C99 requires cosh(+-0) to be exact.  For |x| != the
>> return value will raise inexact.
>
> Aah, right, what I proposed won't handle 0 correctly.  In that
> case, I think what you have is fine; it should handle everything
> except for the rounding mode issue, which we don't really support
> anyway.  Alternatively, you could just do what fdlibm does, which
> is hard to argue with given that it has been acceptable for
> several decades.

No, the above is quite broken.  tiny * tiny gives underflow, and
when the underflow is to 0, adding 1 to it doesn't give inexact.
Also, the 'tiny' in the comment is confusing, since it refers to
a different variable than the 'tiny' in the code.

Just copy the known-good fdlibm code.  It uses one+t, where t is
expm1(fabs(x)).  This has many subtleties:
- it avoids pessimizing the usual case
- it moves all the complications for setting inexact to expm1().  These
   include:
   - the expression 1+x doesn't work for tiny x since it rounds incorrectly
     in some modes.  It also requires x >= 0 for correct rounding, so
     we may have to caclulate |x| earlier than we want.
   - the expression 1+x*x doesn't work for tiny x since it may underflow
     (just like 1+tiny*tiny)
   - the expression x doesn't work for tiny x in expm1() since it doesn't
     set inexact.
   - the expression x+x*x doesn't work for tiny x in expm1() since it may
     underflow.

fdlibm expm1() avoids the last 2 problems using the strange expressions
't = huge+x; return x - (t-(huge+x));'.  That is unlikely to be best,
but it does avoid a branch for testing x == 0.  We may have looked at
this before.  If so, we didn't like it, and use a different rather
strange expression in expm1l(), and we still need a branch.  fdlibm
itself uses a different method for the same problem in log1p().  It
uses 'if (two54+x>zero && ax<3c90000)...'.  I didn't like that, and
used the more common ((int)x == 0) method to set inexact after classifying
x as being tiny using a test like (ax<3c90000).  Benchmarks showed that
the ((int)x = 0) works well.
   I just thought of another possible reason for this: on i386 the cast
   requires large code with an fenv switch.  Perhaps the compiler moves
   the code out of the way, the same as if we used a __predict_false().
   This is good since the tiny-arg case is special.

   The ENTERI() and RETURNI() macros generate similar large code, but
   much larger and uglier because of excessive carew taken to avoid
   trapping in fpsetprec().  gcc generates almost 100 bytes per macro
   invocation, so average small functions explode in object code if
   they have many returns in them.  But compilers seem to do a good
   job of moving this code out of the way.

>>> - Please check whether volatile hacks are needed to induce the
>>>   compiler to do the right thing.  I'm increasingly finding that
>>>   these are needed with clang.
>>
>> I suppose I need to either look at the generated assembly or
>> add testing for signals to my test programs.  How do you go
>> about looking at this?
>
> #include <fenv.h>
> volatile long double x = ..., y;
> feclearexcept(FE_ALL_EXCEPT);
> y = coshl(x);
> printf("%x\n", fetestexcept(FE_ALL_EXCEPT));
>
> The main cases to check are:
>
> - cosh(+-0) = 0, no exceptions
> - cosh(large) = inf, overflow exception
>  (On some architectures, this also raises inexact, which is allowed.)
> - cosh(inf or qnan) raises no exceptions
> - cosh(anything else) raises an inexact exception

Or in a debugger, put breakpoints at interesting points and look at
the exception mask.  This is easier on i386, or with long doubles.
"info float" decodes the i387 exception mask.  "p $mxcsr" displays
the SSE mask.  Later versions of gdb decode it too.  I don't know any
easy way to write to the i387 exception mask in gdb.  $mxcsr can be
assigned to.

I now have the kernel expf working for sinhf like I expected.  It worked
better when didn't corrupt hi-lo to hi_lo, doh.

To emulate the eventual high-precision version, I use this:

% void
% k_expf(float x, float *hip, float *lop)
% {
%     long double r;
% 
%     r = expl(x);
%     *hip = r;
%     *lop = r - *hip;
% }

This gives 24+24 bits of accuracy, and sinhf() ends up being perfectly
rounded in the range where it uses this.  A better enulation would
destroy many low bits to get only about 24+6 bits of accuracy.

The version created from expf()'s internals has about 24+3 bits.

Then in sinhf(x), for 1 <= x <= 12:

% 	    float hi1, hi2, lo1, lo2;
% 	    extern void k_expf(float x, float *hip, float *lop);
% 
% 	    t = fabsf(x);
% 	    k_expf(t, &hi1, &lo1);

Returning h * (lo1 + 1 / (hi1 + lo1) + hi1) as this point works OK.
Then the final error is under 1 ulp.  Without this, sophisticated
expressions were needed to keep it under 2 ulps.

% 	    k_expf(-t, &hi2, &lo2);
% 	    hi2 = -hi2;
% 	    lo2 = -lo2;
% 	    _2sumF(hi1, hi2);
% 	    return h * (lo1 + hi2 + lo2 + hi1);

Adding up the terms like this retains almost all the accuracy provided
by the kernel.  Calculating expf(-x) as 1/expf(x) gives an extra
inaccuracy of about 1/exp(2) = 0.135 ulps.

Efficiency is currently too low when there are 2 calls to non-inline
kernels, so this method is slightly slower than the old method
(except using the faster expl() kernel on non-i386.  On i386 it and
also i387 exp are too slow due to non-null precision switches).  The
correct fix is unclear.  Hopefully just speed up the kernels.

Bruce