From owner-freebsd-hackers@freebsd.org  Sat Apr 29 10:49:25 2017
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id DCCB4D5597F;
 Sat, 29 Apr 2017 10:49:25 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail110.syd.optusnet.com.au (mail110.syd.optusnet.com.au
 [211.29.132.97])
 by mx1.freebsd.org (Postfix) with ESMTP id 720191A07;
 Sat, 29 Apr 2017 10:49:24 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id B764110390F;
 Sat, 29 Apr 2017 20:19:27 +1000 (AEST)
Date: Sat, 29 Apr 2017 20:19:23 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
cc: Steve Kargl <sgk@troutmask.apl.washington.edu>, 
 freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170429151457.F809@besplex.bde.org>
Message-ID: <20170429194239.P3294@besplex.bde.org>
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
 <20170429035131.E3406@besplex.bde.org>
 <20170428201522.GA32785@troutmask.apl.washington.edu>
 <20170429070036.A4005@besplex.bde.org>
 <20170428233552.GA34580@troutmask.apl.washington.edu>
 <20170429005924.GA37947@troutmask.apl.washington.edu>
 <20170429151457.F809@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=AYLBJzfG c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=OsaYX3Fg_SImqZmlGUkA:9 a=CjuIK1q_8ugA:10
X-Mailman-Approved-At: Sat, 29 Apr 2017 11:09:57 +0000
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Apr 2017 10:49:26 -0000

On Sat, 29 Apr 2017, Bruce Evans wrote:

> On Fri, 28 Apr 2017, Steve Kargl wrote:
>
>> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote:
>>> 
>>> I was just backtracking with __kernel_sinpi.  This gets a max ULP < 0.61.
>
> Comments on this below.
>
> This is all rather over-engineered.  Optimizing these functions is
> unimportant comparing with finishing cosl() and sinl() and optimizing
> all of the standard trig functions better, but we need correctness.
> But I now see many simplifications and improvements:
>
> (1) There is no need for new kernels.  The standard kernels already handle
> extra precision using approximations like:
>
>    sin(x+y) ~= sin(x) + (1-x*x/2)*y.
>
> Simply reduce x and write Pi*x = hi+lo.  Then
>
>    sin(Pi*x) = __kernel_sin(hi, lo, 1).
>
> I now see how to do the extra-precision calculations without any
> multiplications.

But that is over-engineered too.

Using the standard kernels is easy and works well:

XX #include <float.h>
XX #include <math.h>
XX 
XX #include "math_private.h"
XX 
XX static const double
XX pi_hi = 3.1415926218032837e+00,	/* 0x400921fb 0x50000000 */
XX pi_lo = 3.1786509547050787e-08;	/* 0x3e6110b4 0x611a5f14 */
XX 
XX /* Only for |x| <= ~0.25 (missing range reduction. */
XX 
XX double
XX cospi(double x)
XX {
XX 	double_t hi, lo;
XX 
XX 	hi = (float)x;
XX 	lo = x - hi;
XX 	lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
XX 	hi = pi_hi * hi;
XX 	_2sumF(hi, lo);
XX 	return __kernel_cos(hi, lo);
XX }
XX 
XX double
XX sinpi(double x)
XX {
XX 	double_t hi, lo;
XX 
XX 	hi = (float)x;
XX 	lo = x - hi;
XX 	lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
XX 	hi = pi_hi * hi;
XX 	_2sumF(hi, lo);
XX 	return __kernel_sin(hi, lo, 1);
XX }
XX 
XX double
XX tanpi(double x)
XX {
XX 	double_t hi, lo;
XX 
XX 	hi = (float)x;
XX 	lo = x - hi;
XX 	lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
XX 	hi = pi_hi * hi;
XX 	_2sumF(hi, lo);
XX 	return __kernel_tan(hi, lo, 1);
XX }

I only did a sloppy accuracy test for sinpi().  It was 0.03 ulps less
accurate than sin() on the range [0, 0.25] for it and [0, Pi/4] for
sin().

Efficiency is very good in some cases, but anomalous in others: all
times in cycles, on i386, on the range [0, 0.25]

athlon-xp, gcc-3.3           Haswell, gcc-3.3   Haswell, gcc-4.2.1
cos:   61-62                 44                 43
cospi: 69-71 (8-9 extra)     78 (anomalous...)  42 (faster to do more!)
sin:   59-60                 51                 37
sinpi: 67-68 (8 extra)       80                 42
tan:   136-172               93-195             67-94
tanpi: 144-187 (8-15 extra)  145-176            61-189

That was a throughput test.  Latency is not so good.  My latency test
doesn't use serializing instructions, but uses random args and the
partial serialization of making each result depend on the previous
one.

athlon-xp, gcc-3.3           Haswell, gcc-3.3   Haswell, gcc-4.2.1
cos:   84-85                 69                 79
cospi: 103-104 (19-21 extra) 117                94
sin:   75-76                 89                 77
sinpi: 105-106 (30 extra)    116                90
tan:   168-170               167-168            147
tanpi: 191-194 (23-24 extra) 191                154

This also indicates that the longest times for tan in the throughput
test are what happens when the function doesn't run in parallel with
itself.  The high-degree polynomial and other complications in tan()
are too complicated for much cross-function parallelism.

Anywyay, it looks like the cost of using the kernel is at most 8-9
in the parallel case and at most 30 in the serial case.  The extra-
precision code has about 10 dependent instructions, so it s is
doing OK to take 30.

Bruce