From owner-freebsd-numerics@FreeBSD.ORG Mon Jun 10 02:32:17 2013 Return-Path: Delivered-To: freebsd-numerics@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 66A48826 for ; Mon, 10 Jun 2013 02:32:17 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail35.syd.optusnet.com.au (mail35.syd.optusnet.com.au [211.29.133.51]) by mx1.freebsd.org (Postfix) with ESMTP id 92BF5166C for ; Mon, 10 Jun 2013 02:32:16 +0000 (UTC) Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23]) by mail35.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r5A2W54q007158 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 10 Jun 2013 12:32:07 +1000 Date: Mon, 10 Jun 2013 12:32:05 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl Subject: Re: Implementation for coshl. In-Reply-To: <20130610003645.GA16444@troutmask.apl.washington.edu> Message-ID: <20130610110740.V24058@besplex.bde.org> References: <20130610003645.GA16444@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=eqSHVfVX c=1 sm=1 a=LM0AswAWfpYA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=MtpAjnDH-vAA:10 a=uVhegijvKTqswUFXLp4A:9 a=CjuIK1q_8ugA:10 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117 Cc: freebsd-numerics@FreeBSD.org X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Jun 2013 02:32:17 -0000 On Sun, 9 Jun 2013, Steve Kargl wrote: > I suspect that there will be some nits with the implementation. Quite a few nats :-). > Anyway, testing gives > > Arch | Interval | #calls | Time (s) | Max ULP | Compiler | Value > -----------+---------------------+--------+----------+---------+----------+------- > i386 [1] | [ 0.00: 0.35] | 100M | 15.0198 | 0.58583 | gcc | 1 > i386 [1] | [ 0.35: 24.00] | 100M | 15.1858 | 1.01504 | gcc | 2 > i386 [1] | [ 24.00:11356.52] | 100M | 12.9591 | 0.51198 | gcc | 3 > i386 [1] | [11356.52:11357.22] | 100M | 13.3328 | 1.90988 | gcc | 4 > -----------+---------------------+--------+----------+---------+----------+------- Quite large errors, unfortunately much the same as in double precision. > amd64 [2]| [ 0.00: 0.35] | 100M | 11.7811 | 0.63075 | clang | 5 > amd64 [2]| [ 0.35: 24.00] | 100M | 11.0662 | 1.01535 | clang | 6 > amd64 [2]| [ 24.00:11356.52] | 100M | 9.97704 | 0.50852 | clang | 7 > amd64 [2]| [11356.52:11357.22] | 100M | 10.8221 | 1.90931 | clang | 8 > -----------+---------------------+--------+----------+---------+----------+------- Likely a bug for the ULPs for the first domain to be so differnt. I suspect even much smaller differences are due to bugs. > ... > 1. The max ulp for the intervals [0.35:24] and [0.35:41] of 1.xxx is > due to the division in the expression half*exp(x) + half/exp(x). That's the one with the large MD difference. > Bruce and I exchanged emails a long time ago about possible ways > to reduce the ulp in this range by either computer exp(x) with > extra precision or using a table with cosh(x) = cosh(x_i) * cosh(d) > + sinh(x_i) * sinh(d) with d = x - x_i. I tried the latter with > disappointing results. The latter may be good for trig functions, but it is bad for hyperbolic functions. It is technically difficult to splice the functions. > The former would require a refactoring of > s_expl.c into a kernel __kernel_expl(x, hi, lo). I have no plans on > pursuing this at the this time. But you need this soon for __ldexp_exp() and __ldexp_cexp(), which are needed for hyperbolic functions (the large args already fixed for float and double precision) and for cexp() and trig and hyperbolic complex functions. It is much easier to implement these using a kernel. I do this only for float precision. The hyperbolic functions are also much easier with a kernel. Most of the thresholds become unnecessary. Above about |x| < 0.5, all cases are handled uniformly using code like: k_expl(x, &hi1, &lo1, &k1); k_expl(-x, &hi2, &lo2, &k2); /* KISS slow initially */ /* * Bah, that's too uniform. I don't want to deal with k. So * use a threshold for large |x| (case handled by __ldexp_expl(). * Now for 0.5 <= |x| < thresh: */ k_expl(x, &hi1, &lo1); k_expl(-x, &hi2, &lo2); /* diferent API includes 2**k */ _2sumF(hi1, lo1); /* a bit sloppy */ return (0.5 * (lo2 + lo1 + hi1 + hi2)); Error analysis: since |x| >= 0.5, the ratio exp(x)/exp(-x) is >= exp(1). Already if we we add exp(x) + exp(-x), the error is at most ~1 ulps (certainly less than 2). But our hi_lo approximations give 6-10 extra bits, so the ~1 ulp error is scaled by 2**-6 or better until the final addition. Note that this doesn't need the complexities of expm1l(x) or a kernel for that. Adding exp(-x) to exp(x) without increasing the error significantly is similar to adding -1 to exp(x) (subtracting exp(-x) for sinh() is even more similar). The addiitonal complications in expm1l() are because: - it wants to give 6-10 bits and not lose any relative to expl() - |x| >= 0.5 so large cancelations cannot occur. For |x| <= 0.5, use a poly approx. 0.5 can be reduced significantly if necessary to get less terms in the poly. This is less needed than for expm1l() since the power series about 0 converges much faster for coshl(). Similarly for sinhl(). I don't know of a similar method for tanhl(). A division seems to be necessary, and hi+lo decompositions only work well for additions. > /* > * ==================================================== > * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved. > * > * Developed at SunSoft, a Sun Microsystems, Inc. business. > * Permission to use, copy, modify, and distribute this > * software is freely granted, provided that this notice > * is preserved. > * ==================================================== > * > * Converted to long double by Steven G. Kargl > */ It changes the style completely, so diffs with the double version are unreadable. > #if LDBL_MANT_DIG == 64 > ... > #else > #error "Unsupported long double format" > #endif The complications for ld80/128 and i386 seem reasonble. > long double > coshl(long double x) > { > long double t, w; > uint16_t hx, ix; > > ENTERI(); > > GET_LDBL_EXPSIGN(hx, x); > ix = hx & 0x7fff; > SET_LDBL_EXPSIGN(x, ix); This sign frobbing is pessimal, and is not done by the double version. Signs are better cleared using fabs*() unless you are sure that clearing them in bits is more optimal. But don't optimize before getting it right. Any optimizations should be made to the double version first. > /* x is +-Inf or NaN. */ > if (ix == BIAS + LDBL_MAX_EXP) > RETURNI(x * x); The sign frobbing also clobbers the result here. > > if (x < log2o2) { The threshold comparisons are painful and probably inefficient to do in bits. Hoever you have most of the pain of using bits by using long doubles and LD80C() to declare precise thresholds. Most or all of the thresholds are fuzzy and don't need more than float precision (or maybe just a the exponent). The double version uses a fuzzy threshold here. It only tests the upper 21 bits of the mantissa, so it uses less than float precision. > if (ix < BIAS + EXP_TINY) { /* |x| < 0x1pEXP_TINY */ > /* cosh(x) = 1 exactly iff x = +-0. */ > if ((int)x == 0) > RETURNI(1.0L); > } Unnecessary algorithm change and micro-optimization. The double version uses the general case doing 1+t to set inexact here. > t = expm1l(x); > w = 1 + t; > RETURNI(1 + t * t / (w + w)); > } > ... > if (x < o_threshold2) { > t = expl(half * x); > RETURNI(half * t * t); > } This is missing use of __ldexp_expl(). Going back to the painful threshold declarations: > #if LDBL_MANT_DIG == 64 > static const union IEEEl2bits > #define EXP_TINY -32 Strange placement of macro in the middle of a declaration. This should be simply LDBL_MANT_DIG / 2. > #define s_threshold 24 I don't understand the magic for this now. It is not quite LDBL_MANT_DIG / 3. > /* log(2) / 2 */ > log2o2u = LD80C(0xb17217f7d1cf79ac, -2, 0.346573590279972654714L), > #define log2o2 (log2o2u.e) > /* x = log(LDBL_MAX - 0.5) */ > o_threshold1u = LD80C(0xb17217f7d1cf79ac, 13, 11356.5234062941439497L), > #define o_threshold1 (o_threshold1u.e) > /* log(LDBL_MAX - 0.5) + log(2) */ > o_threshold2u = LD80C(0xb174ddc031aec0ea, 13, 11357.2165534747038951L); > #define o_threshold2 (o_threshold2u.e) > #elif LDBL_MANT_DIG == 113 > #define EXP_TINY -56 > #define s_threshold 41 > static long double > log2o2 = 0.346573590279972654708616060729088288L, > o_threshold1 = 11356.5234062941439494919310779707650L, > o_threshold2 = 11357.2165534747038948013483100922230L; > #else > #error "Unsupported long double format" > #endif No need for any long doubles or LD80C()'s. The double version uses only 21 bits for all thresholds. We had to be more careful about the thresholds for technical reasons in expl(). IIRC, we needed precise thresholds to do a final filtering step after a fuzzy initial classification. This isn't needed here, since we use calculations that can't give spurious overflow. Going back a bit more: - 'huge' is missing a volatile qualifier to work around clang bugs. expl() is already out of date relative to exp2l() since it is missing the recent fix to add this qualifier. This fix is missing in the double and float versions of all the hyperbolic functions too. - 'half' doesn't need to be long double - the double version has a variable 'one'. You changed that to '1', but didn't change 'half' to 0.5. Bruce