From owner-freebsd-hackers@freebsd.org Sat Apr 29 10:49:25 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DCCB4D5597F; Sat, 29 Apr 2017 10:49:25 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail110.syd.optusnet.com.au (mail110.syd.optusnet.com.au [211.29.132.97]) by mx1.freebsd.org (Postfix) with ESMTP id 720191A07; Sat, 29 Apr 2017 10:49:24 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id B764110390F; Sat, 29 Apr 2017 20:19:27 +1000 (AEST) Date: Sat, 29 Apr 2017 20:19:23 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans cc: Steve Kargl , freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170429151457.F809@besplex.bde.org> Message-ID: <20170429194239.P3294@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> <20170429151457.F809@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=AYLBJzfG c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=OsaYX3Fg_SImqZmlGUkA:9 a=CjuIK1q_8ugA:10 X-Mailman-Approved-At: Sat, 29 Apr 2017 11:09:57 +0000 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Apr 2017 10:49:26 -0000 On Sat, 29 Apr 2017, Bruce Evans wrote: > On Fri, 28 Apr 2017, Steve Kargl wrote: > >> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote: >>> >>> I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61. > > Comments on this below. > > This is all rather over-engineered. Optimizing these functions is > unimportant comparing with finishing cosl() and sinl() and optimizing > all of the standard trig functions better, but we need correctness. > But I now see many simplifications and improvements: > > (1) There is no need for new kernels. The standard kernels already handle > extra precision using approximations like: > > sin(x+y) ~= sin(x) + (1-x*x/2)*y. > > Simply reduce x and write Pi*x = hi+lo. Then > > sin(Pi*x) = __kernel_sin(hi, lo, 1). > > I now see how to do the extra-precision calculations without any > multiplications. But that is over-engineered too. Using the standard kernels is easy and works well: XX #include XX #include XX XX #include "math_private.h" XX XX static const double XX pi_hi = 3.1415926218032837e+00, /* 0x400921fb 0x50000000 */ XX pi_lo = 3.1786509547050787e-08; /* 0x3e6110b4 0x611a5f14 */ XX XX /* Only for |x| <= ~0.25 (missing range reduction. */ XX XX double XX cospi(double x) XX { XX double_t hi, lo; XX XX hi = (float)x; XX lo = x - hi; XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi; XX hi = pi_hi * hi; XX _2sumF(hi, lo); XX return __kernel_cos(hi, lo); XX } XX XX double XX sinpi(double x) XX { XX double_t hi, lo; XX XX hi = (float)x; XX lo = x - hi; XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi; XX hi = pi_hi * hi; XX _2sumF(hi, lo); XX return __kernel_sin(hi, lo, 1); XX } XX XX double XX tanpi(double x) XX { XX double_t hi, lo; XX XX hi = (float)x; XX lo = x - hi; XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi; XX hi = pi_hi * hi; XX _2sumF(hi, lo); XX return __kernel_tan(hi, lo, 1); XX } I only did a sloppy accuracy test for sinpi(). It was 0.03 ulps less accurate than sin() on the range [0, 0.25] for it and [0, Pi/4] for sin(). Efficiency is very good in some cases, but anomalous in others: all times in cycles, on i386, on the range [0, 0.25] athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1 cos: 61-62 44 43 cospi: 69-71 (8-9 extra) 78 (anomalous...) 42 (faster to do more!) sin: 59-60 51 37 sinpi: 67-68 (8 extra) 80 42 tan: 136-172 93-195 67-94 tanpi: 144-187 (8-15 extra) 145-176 61-189 That was a throughput test. Latency is not so good. My latency test doesn't use serializing instructions, but uses random args and the partial serialization of making each result depend on the previous one. athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1 cos: 84-85 69 79 cospi: 103-104 (19-21 extra) 117 94 sin: 75-76 89 77 sinpi: 105-106 (30 extra) 116 90 tan: 168-170 167-168 147 tanpi: 191-194 (23-24 extra) 191 154 This also indicates that the longest times for tan in the throughput test are what happens when the function doesn't run in parallel with itself. The high-degree polynomial and other complications in tan() are too complicated for much cross-function parallelism. Anywyay, it looks like the cost of using the kernel is at most 8-9 in the parallel case and at most 30 in the serial case. The extra- precision code has about 10 dependent instructions, so it s is doing OK to take 30. Bruce