From owner-freebsd-hackers@freebsd.org Sat Apr 29 07:54:36 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4555BD55D2D; Sat, 29 Apr 2017 07:54:36 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id 9E8DBA21; Sat, 29 Apr 2017 07:54:35 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id C542AD69213; Sat, 29 Apr 2017 17:54:25 +1000 (AEST) Date: Sat, 29 Apr 2017 17:54:21 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170429005924.GA37947@troutmask.apl.washington.edu> Message-ID: <20170429151457.F809@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=KeqiiUQD c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=nNk0bZpkwH5g55qFUBQA:9 a=tF56D53I6aLQB65K:21 a=CjuIK1q_8ugA:10 X-Mailman-Approved-At: Sat, 29 Apr 2017 11:09:39 +0000 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Apr 2017 07:54:36 -0000 On Fri, 28 Apr 2017, Steve Kargl wrote: > On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote: >> >> I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61. Comments on this below. This is all rather over-engineered. Optimizing these functions is unimportant comparing with finishing cosl() and sinl() and optimizing all of the standard trig functions better, but we need correctness. But I now see many simplifications and improvements: (1) There is no need for new kernels. The standard kernels already handle extra precision using approximations like: sin(x+y) ~= sin(x) + (1-x*x/2)*y. Simply reduce x and write Pi*x = hi+lo. Then sin(Pi*x) = __kernel_sin(hi, lo, 1). I now see how to do the extra-precision calculations without any multiplications. For Pi*x, suppose that x is near 1/4. Choose a nice constant value x0 as far below 1/4 as works and reduce x further using this: x = x0 + (x - x0) // plain '=' means exact y0 ~= Pi*x0 = hi0 + lo0 // in extra prec at compile time y ~= Pi*(x-x0) // small, so doesn't need extra prec Pi*x ~= hi0 + lo0 + y = hi1 + lo1 // use 2sum sin(Pi*x) ~= __kernel_sin(hi1, lo1, 1). but it is better to use a special polynomial approximation about y0 on y. We have reduced x to the significantly smaller y, so the polynominal can have fewer terms and not need any extra precision except for the additive term. In infinite precision, sin(y+y0) = S0 + S1*y + ... Write S0 = hi2+lo2 and choose x0 so that lo2 can be replaced by 0 without significant loss of accuracy (see some exp2's for this method). At worst, lo2 is nonzero and we have to do just 1 extra-precision operation. Otherwise, we have a normal polyomial evaluation. Unlike for sin() near 0, there are even terms and and not nice fractional coefficients, but since y is small there aren't so many terms. It is important to avoid needing extra precision for many of the coefficients. Higher ones don't need it since y is small, and S0 might not need it. The Intel ia64 math library uses the reduction x = y+y0 at several interval endpoints below Pi/4 for ordinary sin(). Previously, I couldn't get this method to work as well as using the single large interval [-Pi/4, Pi/4], since the even terms take a while to calculate and I didn't understand the hi+lo decomposition methods so well. Reminder on why cos(x) needs extra precison but sin(x) doesn't when we use the brute force expansion up to Pi/4: we use cos(x) = 1 - x*x/2 + ... and the only problem is to subtract these terms accurately enough. There is a problem since Pi/4 is slightly larger than sqrt(1/2), so x*x/2 can be slightly larger than 1/4. If it were larger than 1/2, then the subtraction would lose a full bit to cancelation, so extra inaccuracy could be more than 1 ulp, but we want it to be less 0.5 ulps so that the combined inaccuracy is less than 1 ulp. Similarly, when x*x/2 is larger than 1/4 but below 1/2, the extra inaccuracy can be larger than 0.5 ulps though smaller than 1 ulp. Above 0.5 is still to high, so we need some extra precision. This problem doesn't affect sin(x) since the ratio of the terms is -x*x/6 so the extra inaccuracy is 3 times smaller so below 1/3 ulps. For cos(y+y0) ~= C0 + C1*y + C2*y*y, y must be small enough for the extra inaccuracy from the first addition to be a little smaller tha 0.5 ulps. This is not as easy as I was thinking, since the lineat term is relatively large. But there is no problem for cos(y+1/2): cos(y+1/2) ~= 0.88 - 0.48*y, and the ratio of terms with y = Pi/4-1/2 is ~0.16. However, sin(y+1/2) ~= 0.48 + 0.88*y, so the maximum ratio of terms is ~0.53 -- too much. >> static const float >> s0hi = 3.14062500e+00f, /* 0x40490000 */ >> s0lo = 9.67653585e-04f, /* 0x3a7daa22 */ >> s1 = -5.16771269e+00f, /* 0xc0a55de7 */ >> s2 = 2.55016255e+00f, /* 0x402335dd */ >> s3 = -5.99202096e-01f, /* 0xbf19654f */ >> s4 = 8.10018554e-02f; /* 0x3da5e44d */ >> >> static inline float >> __kernel_sinpif(float x) >> { >> double s; >> float p, x2; >> x2 = x * x; >> p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1; >> s = x * (x2 * (double)p + s0hi + s0lo); >> return ((float)s); >> } I don't like the style being completely different. Why lower-case constants? ... > Well, starting with above and playing the splitting game > with x, x2, and p. I've manage to reduce the max ULP in > the region testd. Testing against MPFR with sin(pi*x) > computed in 5*24-bit precision gives It already cheats by using (double)p, but is limited by using float to caclulate p. p should have an inaccuracy of not much more than 1 ulp, and we don't care if this is a little over 1. Then when we do the final caclulations in either extra precision or using the double hack, we don't get any more inaccuracy. Extra precision is not very useful for x2 * p -- it only increases the error by another half an ulp. So errors for that part are 0.5 ulps for x2, ~1 ulp for p and 0.5 for a float multiplication -- 2.0 althogther -- and the the double multiplication reduces this to 1.5. This gets scaled down as usual by the term ratio x*x/6 ~= 0.2. 0.2 of 1.5 = 0.3 is below the magic 0.5, but so is 0.2 of 2.0 = 0.4. But I would prefer below the magic 0.25. > MAX ULP: 0.73345101 > Total tested: 33554427 > 0.7 < ULP <= 0.8: 90 > 0.6 < ULP <= 0.7: 23948 You can even reverse this to calculate the maximum of the term ratio fairly accuractely. The rounding error is 0.233. We had an error of about 1.5 before the final addition. The ratio 0.233/1.5 = 0.16 gives the reduction ratio for the final addition. It approximates the term ratio. (I'm surprised that it is lower than the term ratio instead of higher. In fact, I don't believe the 0.733. According to the term ratio, the worst case must be at least 0.800.) > Exhaustive testing with my older sinpi(x) as the reference > gives > > ./testf -S -m 0x1p-14 -M 0.25 -d -e > MAX ULP: 0.73345101 > Total tested: 100663296 > 0.7 < ULP <= 0.8: 45 > 0.6 < ULP <= 0.7: 11977 Perhaps it really is 0.733. I estimated an error of 1 ulp for p, but its term ratio is much lower than for the first 2 terms -- 20 times lower from the factorial in the denominator -- so 0.55 ulps would be a better estimate. This gives the estimate term_ratio * 1.05 = 0.21 extra ulps which matches 0.233 almost perfectly. Now it seems too perfect to be true -- I usually forget to add up all the half- ulps in calculations like this. > The code is slightly slower than my current best kernel. > sinpif time is 0.0147 usec per call (60 cycles). It still seems slow :-). Full sin(x) takes 43 cycles on amd64 (freefall xeon) for x where its extra precision is used (iy != 0), and calculating only Pi*x in extra precision shouldn't add much to that, and the simpler range reduction should subtract a lot (sin(x) takes 23 cycles below 2Pi where its range reduction is simple). > static inline float > __kernel_sinpif(float x) > { > float p, phi, x2, x2hi, x2lo, xhi, xlo; > uint32_t ix; > > x2 = x * x; > p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1; > > GET_FLOAT_WORD(ix, p); > SET_FLOAT_WORD(phi, (ix >> 14) << 14); > > GET_FLOAT_WORD(ix, x2); > SET_FLOAT_WORD(x2hi, (ix >> 14) << 14); I expect that these GET/SET's are the slowest part. They are quite fast in float prec, but in double prec on old i386 CPUs compilers generate bad code which can have penalties of 20 cycles per GET/SET. Why the strange reduction? The double shift is just a manual optimization or pssimization (usually the latter) for clearing low bits. Here it is used to clear 14 low bits instead of the usual 12. This is normally written using just a mask of 0xffff0000, unless you want a different number of bits in the hi terms for technical reasons. Double precision can benefit more from asymmetric splitting of terms since 53 is not divisible by 2; 1 hi term must have less than 26.5 bits and the other term can hold an extra bit. Here is an example of fairly refined splitting and use of 2sum withit: X float X log10f(float x) X { X struct Float r; X float_t hi, lo; X float fhi; hi+lo decompositions need float_t in general, for both correctness and efficiency. X int32_t hx; X X DOPRINT_START(&x); X k_logf(x, &r); X if (!r.lo_set) X RETURNP(r.hi); X if (sizeof(float_t) > sizeof(float)) X RETURNP((invln10_lo + (float_t)invln10_hi) * (r.lo + r.hi)); X _2sumF(r.hi, r.lo); X fhi = r.hi; X GET_FLOAT_WORD(hx, fhi); X SET_FLOAT_WORD(fhi, hx & 0xfffff000); It is hard to do the splitting better than this. This method subtly masks bugs if float_t is not used. On i386/i387 it forces much the same slow loads and stores that are needed to discard any extra precision when float_t is not used. X hi = fhi; X lo = r.lo + (r.hi - hi); The code tries to optimize things with careful assignments between float_t and float. This has further subtleties from float_t being used. When there is extra precision, float_t should be different from float, and compilers should not be broken enough to elide conversions between float and float_t. The above 2 lines are basic splitting, and the subtleties are what is needed for this to work. struct Float is just hi and lo in a struct, each with type float, but a further subtlety is that you want the compiler to keep all variables in registers, with floats in extended precision in the registers; then it must be arranged that hi+lo decompositions in registers have no extra precision... X RETURN2P(invln10_hi * hi, X (invln10_lo + invln10_hi) * lo + invln10_lo * hi); X } ... but when the hi+lo values are used in expressions like this, we want the extra precision back. Here the hi*hi term is exact (that is the point of the decomposition), and extra precision gives some free extra accuracy for the rest of the calculation. > > x2lo = s0lo + x2 * (p - phi) + (x2 - x2hi) * phi; > x2hi *= phi; > > GET_FLOAT_WORD(ix, x); > SET_FLOAT_WORD(xhi, (ix >> 14) << 14); > xlo = x - xhi; > xlo = xlo * (x2lo + x2hi) + (xlo * s0hi + xhi * x2lo); > > return (xlo + xhi * x2hi + xhi * s0hi); > } Yet another splitting is going to be slow. This one is necessary. Are the first 2 really necessary? The above analysis might show that they really aren't -- didn't we get an error of 0.733 ulps (which seems too low) using both methods? Otherwise, this code is very similar to my log10f(). k_logf() does most of the work. It evaluates logf(x) in extra precision. log10f(x) is just C*logf(x) in extra precision. k_logf() arranges to return the extra precision. For sinpi*(), you need to scale the arg instead of the result. This is actually easier, since the arg is not in extra precision and the extra precision is very easy to handle in the kernel (and even already handled in standard kernels). The above logf() runs acceptably fast using a kernel not especially optimized for it. Signficantly slower than my logf(). Only slightly faster than fdlibm+cygnus log10f(), but it is much more accurate. Bruce