Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 29 Apr 2017 12:38:29 -0700
From:      Steve Kargl <sgk@troutmask.apl.washington.edu>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org
Subject:   Re: Implementation of half-cycle trignometric functions
Message-ID:  <20170429193829.GA41964@troutmask.apl.washington.edu>
In-Reply-To: <20170430042756.A862@besplex.bde.org>
References:  <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> <20170429151457.F809@besplex.bde.org> <20170429181022.GA41420@troutmask.apl.washington.edu> <20170430042756.A862@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Apr 30, 2017 at 05:09:26AM +1000, Bruce Evans wrote:
> On Sat, 29 Apr 2017, Steve Kargl wrote:
> 
> > On Sat, Apr 29, 2017 at 05:54:21PM +1000, Bruce Evans wrote:
> >> On Fri, 28 Apr 2017, Steve Kargl wrote:
> > ...
> >>> 	GET_FLOAT_WORD(ix, p);
> >>> 	SET_FLOAT_WORD(phi, (ix >> 14) << 14);
> >>>
> >>> 	GET_FLOAT_WORD(ix, x2);
> >>> 	SET_FLOAT_WORD(x2hi, (ix >> 14) << 14);
> >>
> >> I expect that these GET/SET's are the slowest part.  They are quite fast
> >> in float prec, but in double prec on old i386 CPUs compilers generate bad
> >> code which can have penalties of 20 cycles per GET/SET.
> >>
> >> Why the strange reduction?  The double shift is just a manual optimization
> >> or pssimization (usually the latter) for clearing low bits.  Here it is
> >> used to clear 14 low bits instead of the usual 12.  This is normally
> >> written using just a mask of 0xffff0000, unless you want a different
> >> number of bits in the hi terms for technical reasons.  Double precision
> >> can benefit more from asymmetric splitting of terms since 53 is not
> >> divisible by 2; 1 hi term must have less than 26.5 bits and the other term
> >> can hold an extra bit.
> >
> > Because I didn't think about using a mask.  :-)
> >
> > It's easy to change 14 to 13 or 11 or ..., while I would
> > need to write out zeros and one to come up with 0xffff8000,
> > etc.
> 
> Here are some examples of more delicate splittings from the uncommitted
> clog*().  They are usually faster than GET/SET, but slower than converting
> to lower precision as is often possible for double precision and ld128
> only.  clog*() can't use the casting method since it needs to split in the
> middle, and doesn't use GET/SET since it is slow.  It uses methods that
> only work on args that are not too large or too small, and uses a GET
> earlier to classify the arg size.

I didn't know about these other splitting methods.  Thanks for
pointing them out to me.  

I updated by k_sinpif.c to use the standard masking with 0xffff0000.
It has no effect on the timing on Core2 dou.  It did however effect
the max ULP.  With exhaustive testing in [0x1p-14,0.25] I now have 

         MAX ULP: 0.68287528
    Total tested: 100663296
0.6 < ULP <= 0.7: 5607

the older version with the shifts by 14 bits gives

         MAX ULP: 0.73345101
    Total tested: 100663296
0.7 < ULP <= 0.8: 45
0.6 < ULP <= 0.7: 11977

The value of 14 is a holdover from an earlier version.

Getting back to the use of float_t and double_t.  If one
wants the performance penalty, these then work well.  Changing
types to float_t in k_cospif.c, I find a slowdown of for cospif,
but I also find

         MAX ULP: 0.64679509
    Total tested: 1048576000
0.6 < ULP <= 0.7: 31598

with exhaustive testing in [0,0.25].

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170429193829.GA41964>