From owner-freebsd-numerics@FreeBSD.ORG Mon Jun 10 19:14:54 2013 Return-Path: Delivered-To: freebsd-numerics@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 9285C585; Mon, 10 Jun 2013 19:14:54 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id 41E0A1BF2; Mon, 10 Jun 2013 19:14:53 +0000 (UTC) Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id B7BDA3C0EA5; Tue, 11 Jun 2013 05:14:51 +1000 (EST) Date: Tue, 11 Jun 2013 05:14:46 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: David Schultz Subject: Re: Implementation for coshl. In-Reply-To: <20130610170603.GA72597@zim.MIT.EDU> Message-ID: <20130611040059.A17928@besplex.bde.org> References: <20130610003645.GA16444@troutmask.apl.washington.edu> <20130610110740.V24058@besplex.bde.org> <20130610055834.GA68643@zim.MIT.EDU> <20130610162927.GA20856@troutmask.apl.washington.edu> <20130610170603.GA72597@zim.MIT.EDU> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=Q6eKePKa c=1 sm=1 a=LM0AswAWfpYA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=MtpAjnDH-vAA:10 a=HMF5KdcwFu_V6rpZWRkA:9 a=CjuIK1q_8ugA:10 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117 Cc: freebsd-numerics@freebsd.org, Steve Kargl X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Jun 2013 19:14:54 -0000 On Mon, 10 Jun 2013, David Schultz wrote: > On Mon, Jun 10, 2013, Steve Kargl wrote: >> On Sun, Jun 09, 2013 at 10:58:34PM -0700, David Schultz wrote: > ... >>> - I think the way you handle the tiny-x case is an improvement >>> over the double version. However, float-to-int conversion is >>> very slow on x86, so you might instead try something like >>> RETURNI(1.0L + tiny), where tiny is volatile and e.g., 0x1p-1000. >>> That will also return the correct result in all rounding modes. >> >> I originally had >> >> if (ix < BIAS - 33) { /* |x| < 0x1p-33 */ >> /* cosh(+-0) = 1. */ >> if (x == 0) >> return (one); >> /* cosh(tiny) = one */ >> return (one + tiny * tiny); >> } >> >> because C99 requires cosh(+-0) to be exact. For |x| != the >> return value will raise inexact. > > Aah, right, what I proposed won't handle 0 correctly. In that > case, I think what you have is fine; it should handle everything > except for the rounding mode issue, which we don't really support > anyway. Alternatively, you could just do what fdlibm does, which > is hard to argue with given that it has been acceptable for > several decades. No, the above is quite broken. tiny * tiny gives underflow, and when the underflow is to 0, adding 1 to it doesn't give inexact. Also, the 'tiny' in the comment is confusing, since it refers to a different variable than the 'tiny' in the code. Just copy the known-good fdlibm code. It uses one+t, where t is expm1(fabs(x)). This has many subtleties: - it avoids pessimizing the usual case - it moves all the complications for setting inexact to expm1(). These include: - the expression 1+x doesn't work for tiny x since it rounds incorrectly in some modes. It also requires x >= 0 for correct rounding, so we may have to caclulate |x| earlier than we want. - the expression 1+x*x doesn't work for tiny x since it may underflow (just like 1+tiny*tiny) - the expression x doesn't work for tiny x in expm1() since it doesn't set inexact. - the expression x+x*x doesn't work for tiny x in expm1() since it may underflow. fdlibm expm1() avoids the last 2 problems using the strange expressions 't = huge+x; return x - (t-(huge+x));'. That is unlikely to be best, but it does avoid a branch for testing x == 0. We may have looked at this before. If so, we didn't like it, and use a different rather strange expression in expm1l(), and we still need a branch. fdlibm itself uses a different method for the same problem in log1p(). It uses 'if (two54+x>zero && ax<3c90000)...'. I didn't like that, and used the more common ((int)x == 0) method to set inexact after classifying x as being tiny using a test like (ax<3c90000). Benchmarks showed that the ((int)x = 0) works well. I just thought of another possible reason for this: on i386 the cast requires large code with an fenv switch. Perhaps the compiler moves the code out of the way, the same as if we used a __predict_false(). This is good since the tiny-arg case is special. The ENTERI() and RETURNI() macros generate similar large code, but much larger and uglier because of excessive carew taken to avoid trapping in fpsetprec(). gcc generates almost 100 bytes per macro invocation, so average small functions explode in object code if they have many returns in them. But compilers seem to do a good job of moving this code out of the way. >>> - Please check whether volatile hacks are needed to induce the >>> compiler to do the right thing. I'm increasingly finding that >>> these are needed with clang. >> >> I suppose I need to either look at the generated assembly or >> add testing for signals to my test programs. How do you go >> about looking at this? > > #include > volatile long double x = ..., y; > feclearexcept(FE_ALL_EXCEPT); > y = coshl(x); > printf("%x\n", fetestexcept(FE_ALL_EXCEPT)); > > The main cases to check are: > > - cosh(+-0) = 0, no exceptions > - cosh(large) = inf, overflow exception > (On some architectures, this also raises inexact, which is allowed.) > - cosh(inf or qnan) raises no exceptions > - cosh(anything else) raises an inexact exception Or in a debugger, put breakpoints at interesting points and look at the exception mask. This is easier on i386, or with long doubles. "info float" decodes the i387 exception mask. "p $mxcsr" displays the SSE mask. Later versions of gdb decode it too. I don't know any easy way to write to the i387 exception mask in gdb. $mxcsr can be assigned to. I now have the kernel expf working for sinhf like I expected. It worked better when didn't corrupt hi-lo to hi_lo, doh. To emulate the eventual high-precision version, I use this: % void % k_expf(float x, float *hip, float *lop) % { % long double r; % % r = expl(x); % *hip = r; % *lop = r - *hip; % } This gives 24+24 bits of accuracy, and sinhf() ends up being perfectly rounded in the range where it uses this. A better enulation would destroy many low bits to get only about 24+6 bits of accuracy. The version created from expf()'s internals has about 24+3 bits. Then in sinhf(x), for 1 <= x <= 12: % float hi1, hi2, lo1, lo2; % extern void k_expf(float x, float *hip, float *lop); % % t = fabsf(x); % k_expf(t, &hi1, &lo1); Returning h * (lo1 + 1 / (hi1 + lo1) + hi1) as this point works OK. Then the final error is under 1 ulp. Without this, sophisticated expressions were needed to keep it under 2 ulps. % k_expf(-t, &hi2, &lo2); % hi2 = -hi2; % lo2 = -lo2; % _2sumF(hi1, hi2); % return h * (lo1 + hi2 + lo2 + hi1); Adding up the terms like this retains almost all the accuracy provided by the kernel. Calculating expf(-x) as 1/expf(x) gives an extra inaccuracy of about 1/exp(2) = 0.135 ulps. Efficiency is currently too low when there are 2 calls to non-inline kernels, so this method is slightly slower than the old method (except using the faster expl() kernel on non-i386. On i386 it and also i387 exp are too slow due to non-null precision switches). The correct fix is unclear. Hopefully just speed up the kernels. Bruce