Date:Tue, 11 Jun 2013 01:43:17 +1000 (EST)From:Bruce Evans <brde@optusnet.com.au>To:David Schultz <das@FreeBSD.org>Cc:freebsd-numerics@FreeBSD.org, Steve Kargl <sgk@troutmask.apl.washington.edu>, Bruce Evans <brde@optusnet.com.au>Subject:Re: Implementation for coshl.Message-ID:<20130610235038.D790@besplex.bde.org>In-Reply-To:<20130610055834.GA68643@zim.MIT.EDU>References:<20130610003645.GA16444@troutmask.apl.washington.edu> <20130610110740.V24058@besplex.bde.org> <20130610055834.GA68643@zim.MIT.EDU>

Next in thread | Previous in thread | Raw E-Mail | Index | Archive | Help

On Sun, 9 Jun 2013, David Schultz wrote: > On Mon, Jun 10, 2013, Bruce Evans wrote: >> On Sun, 9 Jun 2013, Steve Kargl wrote: > > This is fine for now. We might eventually want to improve the > accuracy of the float, double, and long double implementations, > but that ought to come later. It's much easier to start with > float and double when making improvements in the algorithm. Yes, it would have been better to improve them before cloning their bugs. >>> The former would require a refactoring of >>> s_expl.c into a kernel __kernel_expl(x, hi, lo). I have no plans on >>> pursuing this at the this time. >> >> But you need this soon for __ldexp_exp() and __ldexp_cexp(), which are >> needed for hyperbolic functions (the large args already fixed for float >> and double precision) and for cexp() and trig and hyperbolic complex >> functions. It is much easier to implement these using a kernel. I do >> this only for float precision. > > That's fine for now. We need a better k_exp, too. I believe > Bruce was working on one. This is especially easy for long double precision, since expl() only has 1 return that needs to be changed: @ /* Evaluate expl(endpoint[n2] + r1 + r2) = tbl[n2] * expl(r1 + r2). */ @ z = r * r; @ q = r2 + z * (A2 + r * A3) + z * z * (A4 + r * A5) + z * z * z * A6; @ t = (long double)tbl[n2].lo + tbl[n2].hi; @ t = tbl[n2].lo + t * (q + r1) + tbl[n2].hi; Return the hi and low parts tbl[n2].hi and tbl[n2].lo + t * (q + r1) here and add them up in callers. @ @ /* Scale by 2**k. */ @ if (k >= LDBL_MIN_EXP) { @ if (k == LDBL_MAX_EXP) @ RETURNI(t * 2 * 0x1p16383L); @ RETURNI(t * twopk); @ } else { @ RETURNI(t * twopkp10000 * twom10000); @ } Return k and/or twopk here and multiply by them in callers. Remove all other returns (don't support special values in the kernel). The only difficulty is that twopk doesn't give the full power for cases near of after overflow and underflow, and those cases are precisely the critical ones for __ldexp_[c]exp*(). I have this sort of working for sinhf() in float precision, for the range [1, 9+]. There is no problem with twopk in this range, and I multiply by it in the kernel and don't return it. A simplified version of adding up the values in the caller is: _k_exp(fabs(x), &hi1, &lo1); _k_exp(-fabs(x), &hi2, &lo2); return h * (lo2 + lo1 + hi2 + hi1); This summation is sloppy and loses about hi2/hi1 ulps unnecessarily. hi2/hi1 <= exp(-2*|x|) < exp(-2) so this loss is never absolutely large, and is only relatively large if exp() it very accurate. This error is in addition to the error in exp(|x|) ~= lo1 + h1. > I have a few comments on the code... mostly a subset of Bruce's > comments: > ... > - I think the way you handle the tiny-x case is an improvement > over the double version. However, float-to-int conversion is > very slow on x86, so you might instead try something like > RETURNI(1.0L + tiny), where tiny is volatile and e.g., 0x1p-1000. > That will also return the correct result in all rounding modes. No, it is an unimprovement. It micro-optimizes for an unusual case, and breaks the rounding in that case. We were careful about this in expl() and expm1l(). expl(x) returns 1+x, and expm1l(x) returns a very complicated expression to get the rounding right for x + x*x/2. This is quite difficult for expm1l() since the signs alternate. In the complicated expression, we needed an explicit test for x == 0, since the rest of the complicated expression would mess up the signed for +-0 in some rounding modes. This particular float-to-int conversion isn't very slow on i386. In previous lectures, I explained why it is very efficient even on (out of order execution) i386's: no operand depend on its result, so it can be and is done in parallel. It is magic complicated expressions that are inefficient, since even a simple expression that uses the arg gives 1 dependency and complicated ones give more. I tried to get the rounding for values of +-0 and +-pi/2 right in catrig*.c. But fdlibm generally isn't very careful about this. I noticed the following types of errors: - expressions using pi/2 in such a way that it doesn't work right in all rounding modes. Returning a const pi/2 can never work in all rounding modes, so exprssions like pi/2 + tiny should be used - often, tiny isn't volatile, so compiler bugs prevent pi/2 + tiny working - sometimes, optimization broke pi/2 + tiny by replacing it by pi/2 - often, pi/2 is written as pio2_hi + pio2_lo. This (and pi for pi/2) should be rounded down so that adding tiny to it works right. IIRC, rounding to nearest gives the same result as rounding down in double precision, so everyone gets the same result, but in float precision rounding to nearest rounds up. This delicacy was not understood when the functions were converted to float precision, so there are some misrounded spittings of pi/2. Now I'm not sure if any fixed splitting can work in all rounding modes. The complications from double rounding are espcially large since compile-time rounding of the constant is mixed with runtime rounding of expressions using it. Typical code using this from e_asin.c: if(((ix-0x3ff00000)|lx)==0) /* asin(1)=+-pi/2 with inexact */ return x*pio2_hi+x*pio2_lo; The comment is bad and should say asin(-1) (for asin(1), we wouldn't use x in the expression). The result of this is pio2_hi + pio2_lo for x = 1 and -pio2_hi - pio2_lo for x = -1. This sets inexact but seems to be quite broken for non-default rounding modes. It seems too hard to round to pio/2 correctly in all cases (unless delicate rounding in the splitting does it), but the aboves breaks even asin() being an odd function when the rounding mode is towards +-infinity. Most odd functions avoid this problem by calculating func(|x|) and adjusting the sign at the end. The bugs in this were moved by dubious optimizations in e_asinf.c: if(ix==0x3f800000) /* |x| == 1 */ return x*pio2; /* asin(+-1) = +-pi/2 with inexact */ Now asinf() is odd at +-1 although not for general args; there is no longer any chance of correct rounding at +-1 in all rounding modes; inexact is no longer set in the code, but is still claimed to be set in the comment. I implemented most of these bugs :-(. Fortunately they were not copied to the long double version. - constant folding of pio2_hi + pio2_lo may cause problems even without compiler bugs. The above expression with x's instead of +-1's hopefully prevents compilers reducing to constant expressions and then folding. If we wrote the classification as (fabs(x) == 1), then the optimizer wouldn't have to be very smart to notice that x is 1 or -1 and doing the folding. That would only be an invalid optimization with FENV on. Bruce

Want to link to this message? Use this URL: <http://docs.FreeBSD.org/cgi/mid.cgi?20130610235038.D790>