Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 4 Nov 2007 16:26:03 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Steve Kargl <sgk@troutmask.apl.washington.edu>
Cc:        freebsd-standards@freebsd.org
Subject:   Re: long double broken on i386?
Message-ID:  <20071104142220.G15933@delplex.bde.org>
In-Reply-To: <20071103213000.GA92030@troutmask.apl.washington.edu>
References:  <20070928152227.GA39233@troutmask.apl.washington.edu> <20071001173736.U1985@besplex.bde.org> <20071002001154.GA3782@troutmask.apl.washington.edu> <20071002172317.GA95181@VARK.MIT.EDU> <20071002173237.GA12586@troutmask.apl.washington.edu> <20071003103519.X14175@delplex.bde.org> <20071010204249.GA7446@troutmask.apl.washington.edu> <20071103144151.U671@besplex.bde.org> <20071103213000.GA92030@troutmask.apl.washington.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 3 Nov 2007, Steve Kargl wrote:

> On Sat, Nov 03, 2007 at 03:53:19PM +1100, Bruce Evans wrote:
>> On Wed, 10 Oct 2007, Steve Kargl wrote:
>>> I've given up on i386.  All of my attempts to make my C implementations
>>> work on both amd64 and i386 have failed.  The code when run on i386
>>> results in ULP's that exceed 10^5.  Yes, that is 1E5.  Results for amd64
>>> are reported below.  As far as assembly, someone else will need to write
>>> those routines because I do not know assembly.
>>
>> Bad errors are expressed in GULPs (Giga-ULPs) :-).  Maybe you are hitting
>> the compiler bugs that break double-precision arithmetic on i386's when
>> the rounding precision is changed to long double.  The rounding precision
>> ...
> I did some limiting testing of the output from k_rem_pio2.c
> against argument reduction using MPFR.  I found no long
> double values that gave results which did not agree.  I fairly
> certain that the GULPs are do to improper polynomial approximations
> for the 53-bit precision long doubles on i386.

Rounding errors in polynomial coefficients can't possibly have that
much effect.  An error of 1 53-bit ULP in a coefficient is only 2^11
63-bit ULPs.  Since the first 2 coefficients of cos() and the first
coefficient of sin() are exact, the relative errors are further reduced
by a factor of x^n/4! for cos(x) and x^3/3! for sin(x).

Actual testing of messed-up Taylor coefficients using pari-gp gives the
following error bounds:

  17     ULPs   cos(x)   up to x^18/18! term with 1/18! rounded to 53 bits
   0.06  ULPs   ...                                                64 bits
   0.07  ULPs   ...                                     not rounded
105     ULPs   sin(x)/x up to x^18/19! term with 1/19! rounded to 53 bits
   0.06  ULPs   ...                                                64 bits
   0.003 ULPs   ...                                     not rounded

All arithmetic was in ~1000-bit precision.  The interval was [-Pi/4, Pi/4].

The errors are much larger than I said they should be above, but still
much less than 1 GULP.

Not rounding for cos(x) really does give a larger max error than rounding
to 64 bits.  My Remez poynomial constructor rounds the coefficients during
its search for the best coefficients.  It is possible to compensate to some
extent for unfortunate rounding of early coefficients by tweaking higher
coefficients, but this is not very important and currently causes stability
problems in the search.

With 1 less term, errors in the Taylor approximation dominate errors
in (53-bit) coefficients, and with 1 more term, errors in the 64-bit
coefficients dominate errors in the Taylor approximation.

>>>>> PS: There is also the minor possibility of giving bde either
>>>>> a heartache or death via laughter when he finally sees what I
>>>>> have done. :)
>>>>
>>>> I would be unhappy with any versions that aren't either:
>>>> - as identical as possible in algorithm and style to the double precision
>>>>  versions, in the same way that the float versions are as identical as
>>>>  possible, or
>>>> - much better in algorithm and style and efficiency and accuracy than the
>>>>  double precision versions.  I don't want to maintain different versions.
>>>
>>> I'll go with your second option see comments below and attached
>>> code.  BTW, if I add
>>
>> So I'm unhappy that the first point hasn't been addressed at all.  This
>> gives quality < the current float and double precision versions.
>
> You're moving the goal.  You gave me an "either/or" option.  I took
> the "or", so the first point is irrelevant.  Well, your observation
> on quality is subjective because I'd claim my (almost) KNF commented
> code is much better than the fdlibm code.

The second option is "MUCH better in ALL of algorithm, style, efficiency
and accuracy".  That is impossible.  Change the emphasis to MUCH better
in efficiency OR accuracy OR (simplicity without loss of the others).
It is difficult to be both more efficient and more accurate, and you will
need a better algorithm to be either more efficient or more accurate.
MUCH better is not very subjective for efficiency or accuracy.  It means
not rewriting everything to gain only 10% efficiency.  For accuracy, it
means:
- all errors < 2-3 ULPs in cases where errors were sometimes > hundreds/
   thousands/millions/billions of ULPs.  Note that this is still not
   implemented for that i387 hardware trig functions -- the hardware gives
   errors of many GULPs near multiples of Pi/2, and the software doesn't
   do anything to avoid the bad cases since avoiding the bad cases without
   losing efficiency is not easy.
- all errors < 1 ULP in cases where errors were sometimes > 1 ULP
- almost perfect rounding in cases where errors were always < 1 ULP
   (the max error will be about 0.51 ULPs instead of up to 0.99 ULPs)
- perfect rounding if cases where there was almost perfect rounding
   (the max error will be 0.499... ULPs instead of 0.51 ULPs).

>> The argument reduction seems to begin correctly, but then the extra
>> precision which is carefully provided by k_rem_pio2.c is thrown away,
>> unlike in the float and double versions which pass it to the __kernel
>> functions.  I would expect a loss of up to a full ULP from this, giving
>> a maximum error of 1.5 ULPs (0.5 from the __kernel and 1 from the arg
>> reduction).
>
> I had __kernel_cosl and __kernel_sinl that retained both a high
> and low portion of the reduced arguments, but in the range tested
> I did not find any advantage to doing the extra precision arithmetic.
> These may have been discarded during my i386 GULP experience, so
> I revisted the issue on amd64.

Test near multiples of Pi/2.

>>> My timings on the basic routines using gprof are:
>>>
>>> %   cumulative   self              self     total
>>> time   seconds   seconds    calls  ms/call  ms/call  name
>>> 0.6     804.42     5.67 40000000     0.00     0.00  sinl [18]
>>> 0.1     934.41     0.74 19998751     0.00     0.00  __kernel_sinl [95]
>>> 0.1     935.82     0.66 20001249     0.00     0.00  __kernel_cosl [96]
>>>
>>> where the system contains a dual-cpu opteron tyan motherboard running
>>> at 2.2 GHz.
>>
>> That seems to be about 175 nS/call.  Not too bad.  (My current times on
>> a 2.2 GHz A64*1 in i386 mode on small args ([-2Pi,2Pi]) are 70 nS for
>> the C double version, 43 nS for the asm double version, and 15 nS for
>> the C float version).
>
> Well, I've been more concerned about correctness over speed.  I'd
> need to read the grof output again, but I recall that argument
> reduction was the major time consumer.

Yes, even for small and medium quadrant numbers.  k_rem_pio2.c optimizes
for these but is still slow.  My float trig functions optimize better for
very small quadrant numbers, but even with double precision the corresponding
optimizations don't work so well since they take much more code than the
float case.

>> % Index: msun/src/k_cosl.c
>> % ===================================================================
>> % RCS file: msun/src/k_cosl.c
>> % diff -N msun/src/k_cosl.c
>> % --- /dev/null	1 Jan 1970 00:00:00 -0000
>> % +++ msun/src/k_cosl.c	9 Oct 2007 21:52:32 -0000
>> % @@ -0,0 +1,62 @@
>> % +/*
>> % + * Compute cos(x) on the interval [-1:1] with a polynomial approximation.
>>
>> Interval is larger than necessary.  This might result in unnecessary
>> coefficients.  10 of them seem to be too many for 64 bits of mantissa
>> but not enough for 113 bits (sparc64).
>
> Yes, I know.  My original Chebyshev interpolation code assumed a function

I found that 10 is right with Taylor coefficients.  Chebyshev and Remez
tent to need 1 fewer.

>> % Index: msun/src/k_tanl.c
>> % ===================================================================
>> % RCS file: msun/src/k_tanl.c
>> % diff -N msun/src/k_tanl.c
>> % ...
>> % +#define C33  (long double)  0x1.8f939ed1883f595p-32L
>>
>> This has far too many terms for a 64-bit mantissa, but not enough for a
>> 113-bit mantissa.  IIRC, the former requires about 25 terms.
>
> This is partly due to the range being larger than required.

Ah.  tan() will be adversely affected by the larger interval much more than
cos() and sin().

I found that terms up to about the x^35 one are needed for a Remez
approximation in 64 bits, and up to x^61 (?) in 113 bits.

>> Much more extra care is needed for an accuracy of < 1 ULP for tan*() than
>> for sin*() and cos*().
>
> Well, I did code both steps 3 and 4 from k_tan.c, but I saw no
> statistical improvement in my testing.

The problems are mainly on corner cases.

>> % +long double
>> % +cosl(long double x)
>> % +{
>> % +	union IEEEl2bits z;
>>
>> I want to avoid using this union...
>>
>> % +	z.e = x;
>> % +	z.bits.sign = 0;
>>
>> ... since, while code like this is easy to write, it hard-codes the
>> assumption that things are layed out in bit-fields.  Access macros
>> are more flexible and don't require so much compiler intelligence
>> to optimize away the memory accesses that are logically required for
>> bit-fields.  gcc does a mediocre jobs with this optimization,
>> especially for 80-bit long doubles on amd64 and i386 since the bits
>> don't fit evenly in registers.
>
> math_private.h doesn't have GET_* or SET_* macros for use with
> 64-bit long double (not to mention 128-bit long double), and
> the macros in math_private.h use similar unions.  AFAICT, all
> supported archs have an _fpmath.h header.

Adding efficient versions of them is a not easy, but temporary inefficient
versions can easily use the bit-fields in union IEEEl2bits instead of
direct accesses to these bit-fields.  I have local versions for a
couple of the macros only.  I needed to add a 64-bit mantissa field
in _fpmath.h for efficiency on amd64.  Unfortunately, the mantissa
field must be split since it has more than 64 bits on some machines,
so code and/or macros have to deal with manh and manl or have separate
macros args for the spit.

Another problem with long doubles is that unsupported formats may be
passed.  These should be treated as NaNs or Infs but require extra
care for classifying.

>> I see parametrizing the bit-field sizes and other magic numbers (not
>> quoted) for various formats of long doubles as the main difficulty
>> here (trig functions are easy except for this and k_rem_pio2.c).
>
> Which explains why GET_* and SET_* don't exist.  Although these
> might be present in bdeBSD. :)

GET_* and SET_* are in glibc, in the same style as the float and double
versions.  glibc handles parametrization mainly by having zillions of
different versions.

>> Note that the fdlibm __kernel code for trig functions is generic except for:
>>
>> - my optimizations in the float case
>> - the magic number for the threshold near Pi/4 for tan()
>> - the number and values of terms in the polynomial, and the scheduling
>>   (best scheduling depends on the number of terms, and we don't have this
>>   automated).
>>
>> Thus it is easier to clone the correct code than to rewrite it incorrectly.
>
> It may be easier to clone fdlibm, but I'm not sure I'd call it correct
> code due to
>  - magic numbers
>  - questionable source formatting (it some of the worse code I've read)
>  - lack of comments

- magic numbers are needed for efficiency.
- fdlibm has its own style, which is not too bad when you get used to it.
   For better/worse style numeric code, look at pari or djb*.  I haven't
   haven't looked so much at these at so they look worse :-).
- fdlibm has more and better comments than most code.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071104142220.G15933>