From owner-freebsd-numerics@freebsd.org Thu May 18 22:56:41 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B4D4AD71A0F for ; Thu, 18 May 2017 22:56:41 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id 529461D4B for ; Thu, 18 May 2017 22:56:41 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 4FCCBD67DD8; Fri, 19 May 2017 08:56:28 +1000 (AEST) Date: Fri, 19 May 2017 08:56:27 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170518175434.GA74453@troutmask.apl.washington.edu> Message-ID: <20170519074512.M884@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428175851.A1386@besplex.bde.org> <20170516224618.GA40855@troutmask.apl.washington.edu> <20170517094848.A52247@besplex.bde.org> <20170517180927.GA54431@troutmask.apl.washington.edu> <20170518072636.E951@besplex.bde.org> <20170518014226.GA63819@troutmask.apl.washington.edu> <20170518154820.A8280@besplex.bde.org> <20170518175434.GA74453@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=VbSHBBh9 c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=30vkLdqRwj5xFZukw2YA:9 a=2GCknW8xBUceK4Wk:21 a=Tm1mFJjD2G5J-NMC:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 May 2017 22:56:41 -0000 On Thu, 18 May 2017, Steve Kargl wrote: > On Thu, May 18, 2017 at 04:34:57PM +1000, Bruce Evans wrote: >> On Wed, 17 May 2017, Steve Kargl wrote: >> >>> On Thu, May 18, 2017 at 09:25:58AM +1000, Bruce Evans wrote: >>>> On Wed, 17 May 2017, Steve Kargl wrote: >>> ... >>>>> As such, I've added a comment >>>>> >>>>> /* >>>>> * pi_hi contains the leading 56 bits of a 169 bit approximation for pi. >>>>> */ >>>> >>>> Why 56? 53 + 56 = 109. >>> >>> This is ld128. >> >> ld128 has precision 113. > > Yes. I know. > >>> static const long double >>> pi_hi = 3.14159265358979322702026593105983920e+00L, >>> pi_lo = 1.14423774522196636802434264184180742e-17L; >>> >>> pi_hi has 113/2 = 56 bits. >>> pi_lo has 113 bit. >>> >>> 56 + 113 = 169 >> >> But hi is set intentionally sloppily by converting to double, so it has >> 53 bits instead of the 57 that 56 would go with (57 = 113/2 rounded up). > > A 53-bit entity times a 56-bit entity needs 109-bits for the result. > A 113-bit entity can hold a 109-bit entity. Or am I missing something? It wastes 4 bits. This reader has to guess if this is intentional for magic elsewhere, or just sloppy. Also, the rounded value might naturally have some extra low zero bits. So looking at the bits, it is not easy to see if there are the right number. This is not worth a comment. The reader should trust that the right (maximal) number of bits are used, and only check it approximately. The double precision pi*hi and pi*lo also seemed to be sloppily rounded. My version has less precision (28 low zero bits in pi_hi instead of the minimal 24), while yours has 27 low zero bits in pihi. The explanation for this is something like: - I copied pi_hi from another file with slightly different requirements - the choice makes pi_lo positive but pilo negative. The other file might have need pi_lo positive, or just extra zero bits in pi_hi. Most likely just the latter. - your 27 is apparently from the calculation 53 - 53 / 2 for almost even splitting of x, but the splitting of x is uneven (24 + 29). >>> I'll need to wait until the weekend to study your improved sin_pi. >> .. >> However, both versions fail tests. Min has a max error of 2048 ulps >> and an everage error of 928 ulps and yours has a maximum error of many >> giga-ups for the sign bit wrong and an average error of 561 ulps. The >> enormous errors are probably for denormals or NaNs. The were mostly incomplete setup of the tests. After fixing this, many real bugs were found: - 2 or 3 wrong magic numbers in your version (different ones for sinpi() and sinpil()) - wrong quadrant in my version - incompatible signs in my version (sign errors are reported as value errors in the exa-ulp range, so my tests can't pass with even 1) - the non-use of double_t breaks sinpi() completely, and gives errors of up to about 1.4 ulps (slightly worse than a simple multiplication by M_PI). Using double_t is no better, since the kernels don't support it. hi+lo ends up with extra precision in hi either way, and this gets destroyed on calling the kernel. The tests find this by switching the mode to extra precision to actually test it. The bug would be more obvious using the same algorithm in float precision, since then the default precision would be extra. This can be fixed using STRICT_ASSIGN() or volatile hacks in _2sumF(), but that would turn _2sumF() into back into one of its slow development versions. Options for controlling this were intentionally left out to get an easy to ise API for _2sumF(). The quick fix of declaring hi and lo volatile seems to work, but is probably slower than doing it in _2sumnvF() and is certainly worse without extra precision. Only 1 more strict assignment is needed. The quick fix of compiling with -ffloat-store works. > You're correct that I did not give much consideration to subnormal > number. For float this, can be fixed by > > if (ix < 0x38800000) { /* |x| < 0x1p-14 */ > if (x == 0) > return (x); > SET_FLOAT_WORD(hi, hx & 0xffff0000); > if (ix <= 0x0b7f0001) { /* Avoid spurios underflow. */ > hi *= 0x1p23F; > lo = x * 0x1p23F - hi; > s = (pi_lo + pi_hi) * lo + pi_lo * hi + > pi_hi * hi; > return (s * 0x1p-23F); > } > lo = x - hi; > s = (pi_lo + pi_hi) * lo + pi_lo * hi + > pi_hi * hi; > return (s); > } Far too verbose. The unfixed version was bad enough. > or if you want to avoid the extra if-statement. Simply scale > > if (ix < 0x38800000) { /* |x| < 0x1p-14 */ > if (x == 0) > return (x); > SET_FLOAT_WORD(hi, hx & 0xffff0000); > hi *= 0x1p23F; > lo = x * 0x1p23F - hi; > s = (pi_lo + pi_hi) * lo + pi_lo * hi + > pi_hi * hi; > return (s * 0x1p-23F); > } Better, but this is bogus for float. Just multiply by M_PI in double precision, as is done everywhere else in the function. My version for double precision moves this multiplication to a kernel to defend against n-tuplication, but botches the scaling by splitting x before scaling it: XX static inline double XX __kernel_mpi(double x) XX { XX double hi, lo; XX XX hi = (float)x; XX lo = x - hi; XX return (0x1p1000 * pi_lo * x + 0x1p1000 * pi_hi * lo + XX 0x1p1000 * pi_hi * hi) * 0x1p-1000; XX } XX ... XX if (ix < 0x3fd00000) { /* |x| < 0.25 */ XX if (ix < 0x3e200000) { /* |x| < 2**-29 */ XX if (x == 0) XX return x; XX if ((int)x == 0) /* generate inexact if x != 0 */ XX return __kernel_mpi(x); XX } XX return __kernel_sinpi(x); XX } The hi+lo decomposition can also be done in a kernel. This requires returning hi and lo indirectly, which is fast enough with inlining to make the indirections not actually indirect. My exp kernels return hi, lo, and also (approx log2(x)) indirectly, where hi and lo are closer to the final result. Other exp kernels re-assemble with several variations of (hi + lo) * 2**k and return that directly. My sinpif() now passes all tests and gives the same results as yours, except possibly near 0. The maximum error in float precision is about 0.54 ulps. This is much larger than 0.5009 ulps for sin() and cos(). I have no explanation. The approx. to pi should give only 12 bits of inaccuracy, so the errror shouldn't be much more than 1/1024 more. XX float XX sinpif(float x) XX { XX float s,y,z; XX int32_t hx,ix; XX int n; XX XX GET_FLOAT_WORD(hx,x); XX ix = hx & 0x7fffffff; XX XX if (ix < 0x3e800000) { /* |x| < 0.25 */ XX if (ix < 0x38800000) /* |x| < 2**-14 */ XX if(((int)x)==0) return M_PI*x; /* pi*x with inexact if x != 0 */ XX return __kernel_sindf(M_PI*x); XX } We multiply by M_PI for the usual case, so it is silly not to multiply by it when x is small. XX XX if (ix>=0x7f800000) return x-x; /* Inf or NaN -> NaN */ XX XX if (ix >= 0x4b800000) return (x < 0 ? -0.0F : 0); /* even int -> +-0 */ You said that n1950.pdf documents returning +0 at even integers and -0 at odd intgers. I couldn't find n1950.pdf, and your version doesn't do anything like that. It just returns +0 for all nonnegative integers and -0 for all negative integers, as actually makes sense. The threshold in the above is to classify even integers so as to possibly return +0 for them and -0 as part of the general classification below. It can be reduced to possibly give special cases for odd and even large integers above, as in your versions. XX XX y = fabs(x); XX XX STRICT_ASSIGN(float,z,y+0x1p21F); XX GET_FLOAT_WORD(n,z); /* bits for rounded y (units 0.25) */ XX z -= 0x1p21F; /* y rounded to a multiple of 0.25 */ XX if (z > y) { XX z -= 0.25F; /* adjust to round down */ XX n--; XX } XX n &= 7; /* octant of y mod 2 */ XX y -= z; /* y mod 0.25 */ XX XX if (n >= 3 && n < 7) XX s = -1; XX else XX s = 1; XX if (x < 0) XX s = -s; XX n &= 3; XX if (y == 0 && n == 0) XX return (x < 0 ? -0.0F : 0); I don't like the complications for signs, but they seem to execute fast enough using direct logic. On modern OOE x86, 's' can be calculated in parallel, so it only costs about 4 cycles of latency to assemble with it at the end, and my tests de-emphasize latency a bit too much. XX if (n == 0) XX return s*__kernel_sindf(M_PI*y); XX else if (n == 1) XX return s*__kernel_cosdf(M_PI*(y-0.25F)); XX else if (n == 2) XX return s*__kernel_cosdf(M_PI*y); XX else XX return s*__kernel_sindf(M_PI*(y-0.25F)); XX } Bruce