From owner-freebsd-numerics@FreeBSD.ORG  Fri Mar 13 03:58:33 2015
Return-Path: <owner-freebsd-numerics@FreeBSD.ORG>
Delivered-To: freebsd-numerics@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 5E2F7A3A
 for <freebsd-numerics@freebsd.org>; Fri, 13 Mar 2015 03:58:33 +0000 (UTC)
Received: from mail-ie0-x233.google.com (mail-ie0-x233.google.com
 [IPv6:2607:f8b0:4001:c03::233])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 23457F2
 for <freebsd-numerics@freebsd.org>; Fri, 13 Mar 2015 03:58:30 +0000 (UTC)
Received: by iegc3 with SMTP id c3so81887854ieg.3
 for <freebsd-numerics@freebsd.org>; Thu, 12 Mar 2015 20:58:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113;
 h=mime-version:from:date:message-id:subject:to:content-type;
 bh=wPCQFP2mAsAka5YWCdyecJXJN2M2UnRqKSVa+k/reAw=;
 b=GG43+ofDjD8MQ2zdp/f0EeTrYTfDYGqKIDxE7wt4fXo9ubbsP3FaLpRyye2m5yMb03
 rua1L4j1cuX+qnCn8mDTOLFI4HwAUPfCn083JoHnr9wt4RmBwryB/Iy5GiWImEmeeHVT
 vGogDS8ltaqU8E9XD3CZIRo2w/DSj7QOKxMikCr78LRHn57yD2oFKns6HZnQxslgxXbh
 V63GZrOouBTdpTJh3Kz3j72Em5iVrE/qWf4MgifVXVHjCj1Hc2grqy1v8wYooKD8xUYS
 2m0DmSG8fFVvmUZrgHZ+1qsQxfYCE4abicLlEx/MSzZJmuZFwP9nXgvwak6w/DlGjDtv
 RcyQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:from:date:message-id:subject:to
 :content-type;
 bh=wPCQFP2mAsAka5YWCdyecJXJN2M2UnRqKSVa+k/reAw=;
 b=hBew13VzRm/RByR+Fg7b5Yb1VZEDNDeLxmaAJ11pe6+6TG0nIxaSBsjPLJzcIzjiTv
 F5lKXSIzV53Kle0Dd8aEWdkpTaEF2Zaz+EY6rfvBOvmlJPfL4Xs6I9N0R9R88iqQ5QZB
 mQf298164qNRgGSN+TuxWirx9aDg4AhkyfdmjXPp3q3ZQmHiHWoM5GIxF58qsPyWZlh7
 +5dgbBX3R+Fobs2CigCEmHu1BxXFuPz0I0e875p39TbceMIjpImeqjMrWXP2Yn/f2xRc
 lADYAr+CyxYv+payOp5h92w2o4g/Do63jUILSAlHbSMbTprqrH8SU8Seu3HqUHLO4t5H
 NCxA==
X-Gm-Message-State: ALoCoQkn67M/HqMe5/epwLFwUEgs+rakBTNiG2yyxkIs4ODIzFnrue7PZufqXwmD45coUYo6C4WK
X-Received: by 10.107.135.212 with SMTP id r81mr56011877ioi.38.1426219108157; 
 Thu, 12 Mar 2015 20:58:28 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.36.25.129 with HTTP; Thu, 12 Mar 2015 20:58:06 -0700 (PDT)
From: enh <enh@google.com>
Date: Thu, 12 Mar 2015 20:58:06 -0700
Message-ID: <CAJgzZor=EgG2+E3L6MW-6DD7geUy=Ensa7G1B=viQBXLZKciLw@mail.gmail.com>
Subject: libm functions sensitive to current rounding mode
To: freebsd-numerics@freebsd.org
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 13 Mar 2015 03:58:33 -0000

various SoC vendors have been sending us (Android) assembler
implementations of various libm functions, and we noticed when looking
at one change that it was slower on certain inputs than the FreeBSD C
implementation we were already using. the response was that the cost
was coping with other rounding modes, and they gave this example:

#include <math.h>
#include <stdio.h>
#include <fenv.h>

int main(){
    double x = -0xb.6365e22ee46fp4;
    fesetround(FE_UPWARD);
    printf("sin(%f) = %.16e\n", x, sin(x));
    fesetround(FE_DOWNWARD);
    printf("sin(%f) = %.16e\n", x, sin(x));

    x = 0x1.921fb54442d16p0;
    fesetround(FE_UPWARD);
    printf("sin(%f) = %.16e\n", x, sin(x));
    fesetround(FE_DOWNWARD);
    printf("sin(%f) = %.16e\n", x, sin(x));
    return 0;
}

see https://android-review.googlesource.com/112700 for the original
change and related conversation.

glibc 2.19:

sin(-182.212373) = -2.4759225463534308e-18
sin(-182.212374) = -2.4759225463534309e-18
sin(1.570797) = 1.0000000000000000e+00
sin(1.570796) = 1.0000000000000000e+00

(glibc's fixed bugs here in the past. see
https://sourceware.org/bugzilla/show_bug.cgi?id=3976 for example.)

the FreeBSD libm:

sin(-182.212374) = -2.4759225463534304e-18
sin(-182.212374) = 2.2575413760606011e-11
sin(1.570796) = 1.0000000000000000e+00
sin(1.570796) = 1.0000000002522276e+00

it looks like e_sqrtl.c, s_fmal.c, s_fma.c, and s_fmaf.c save/restore
the rounding mode but nothing else does.

let me know if you'd like me to send trivial patches for any of the
affected functions. (do all the libm functions need this?)

-- 
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.

From owner-freebsd-numerics@FreeBSD.ORG  Fri Mar 13 05:34:41 2015
Return-Path: <owner-freebsd-numerics@FreeBSD.ORG>
Delivered-To: freebsd-numerics@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id A6D574A2
 for <freebsd-numerics@freebsd.org>; Fri, 13 Mar 2015 05:34:41 +0000 (UTC)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 88DF1C0D
 for <freebsd-numerics@freebsd.org>; Fri, 13 Mar 2015 05:34:41 +0000 (UTC)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.14.9/8.14.9) with ESMTP id t2D5Ho8A063320
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Thu, 12 Mar 2015 22:17:50 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.14.9/8.14.9/Submit) id t2D5HocW063319;
 Thu, 12 Mar 2015 22:17:50 -0700 (PDT) (envelope-from sgk)
Date: Thu, 12 Mar 2015 22:17:50 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: enh <enh@google.com>
Subject: Re: libm functions sensitive to current rounding mode
Message-ID: <20150313051749.GA63174@troutmask.apl.washington.edu>
References: <CAJgzZor=EgG2+E3L6MW-6DD7geUy=Ensa7G1B=viQBXLZKciLw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAJgzZor=EgG2+E3L6MW-6DD7geUy=Ensa7G1B=viQBXLZKciLw@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Cc: freebsd-numerics@freebsd.org
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 13 Mar 2015 05:34:41 -0000

On Thu, Mar 12, 2015 at 08:58:06PM -0700, enh via freebsd-numerics wrote:
> various SoC vendors have been sending us (Android) assembler
> implementations of various libm functions, and we noticed when looking
> at one change that it was slower on certain inputs than the FreeBSD C
> implementation we were already using. the response was that the cost
> was coping with other rounding modes, and they gave this example:
> 
> #include <math.h>
> #include <stdio.h>
> #include <fenv.h>
> 
> int main(){
>     double x = -0xb.6365e22ee46fp4;
>     fesetround(FE_UPWARD);
>     printf("sin(%f) = %.16e\n", x, sin(x));
>     fesetround(FE_DOWNWARD);
>     printf("sin(%f) = %.16e\n", x, sin(x));
> 
>     x = 0x1.921fb54442d16p0;
>     fesetround(FE_UPWARD);
>     printf("sin(%f) = %.16e\n", x, sin(x));
>     fesetround(FE_DOWNWARD);
>     printf("sin(%f) = %.16e\n", x, sin(x));
>     return 0;
> }
> 
> see https://android-review.googlesource.com/112700 for the original
> change and related conversation.
> 
> glibc 2.19:
> 
> sin(-182.212373) = -2.4759225463534308e-18
> sin(-182.212374) = -2.4759225463534309e-18
> sin(1.570797) = 1.0000000000000000e+00
> sin(1.570796) = 1.0000000000000000e+00
> 
> (glibc's fixed bugs here in the past. see
> https://sourceware.org/bugzilla/show_bug.cgi?id=3976 for example.)
> 
> the FreeBSD libm:
> 
> sin(-182.212374) = -2.4759225463534304e-18
> sin(-182.212374) = 2.2575413760606011e-11
> sin(1.570796) = 1.0000000000000000e+00
> sin(1.570796) = 1.0000000002522276e+00
> 
> it looks like e_sqrtl.c, s_fmal.c, s_fma.c, and s_fmaf.c save/restore
> the rounding mode but nothing else does.
> 
> let me know if you'd like me to send trivial patches for any of the
> affected functions. (do all the libm functions need this?)
> 

http://wiki.freebsd.org/Numerics

documents the progress on implementing missing libm functions.
It's somewhat out of date as only powl and tgammal as required
by C99/11 are missing for the long double functions.  I don't
know the status of complex.h functions.  I don't do wiki, so
haven't even tried to update the page.

At the bottom of the page, you'll see a list of future directions.
Dealing with rounding modes is on the list.  As of now, FreeBSD
libm should work with round-to-nearest; other modes probably
need to be tested.

e_sqrtl.c was implemented to handle the rounding mode for 2 reasons:
(1) IEEE-754 mandates that square root is correctly rounded in 
all rounding modes; and (2), fdlibm's src/e_sqrt.c (might be
wrong filename) has a long comment that explains an algorithm that
can be used to implement software sqrt() in all rounding modes.
That's what I implemented and then Bruce Evans and David Schultz
fixed what I did.

I suspect all other functions, which are implemented in software,
need to be tested and patches developed.  If you have trivial
patches, the best thing to do is submit a bug report via FreeBSD's
bugzilla so that the patch doesn't get lost.  Someday after I get
powl and tgammal written, and I become independently wealthy, I may 
have time to work on libm.

BTW, n1785.pdf is adding a number new functions to the C library
(e.g., sinpi(x), etc).  I suspect I would work on those before
dealing with rounding modes; as all of my personal codes are
expecting round-to-nearest.

-- 
Steve

From owner-freebsd-numerics@FreeBSD.ORG  Fri Mar 13 11:14:05 2015
Return-Path: <owner-freebsd-numerics@FreeBSD.ORG>
Delivered-To: freebsd-numerics@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 660A47F9
 for <freebsd-numerics@freebsd.org>; Fri, 13 Mar 2015 11:14:05 +0000 (UTC)
Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au
 [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id 2665A88A
 for <freebsd-numerics@freebsd.org>; Fri, 13 Mar 2015 11:14:04 +0000 (UTC)
Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au
 (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197])
 by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 55FD13C6020;
 Fri, 13 Mar 2015 21:46:17 +1100 (AEDT)
Date: Fri, 13 Mar 2015 21:46:14 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: enh <enh@google.com>
Subject: Re: libm functions sensitive to current rounding mode
In-Reply-To: <CAJgzZor=EgG2+E3L6MW-6DD7geUy=Ensa7G1B=viQBXLZKciLw@mail.gmail.com>
Message-ID: <20150313192625.B885@besplex.bde.org>
References: <CAJgzZor=EgG2+E3L6MW-6DD7geUy=Ensa7G1B=viQBXLZKciLw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=ZuzUdbLG c=1 sm=1 tr=0
 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=kj9zAlcOel0A:10
 a=JzwRw_2MAAAA:8 a=Oh2cFVv5AAAA:8 a=CCpqsmhAAAAA:8 a=UWV0wBGDR1em69PW5csA:9
 a=CjuIK1q_8ugA:10
Cc: freebsd-numerics@freebsd.org
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 13 Mar 2015 11:14:05 -0000

On Thu, 12 Mar 2015, enh via freebsd-numerics wrote:

> various SoC vendors have been sending us (Android) assembler
> implementations of various libm functions, and we noticed when looking
> at one change that it was slower on certain inputs than the FreeBSD C
> implementation we were already using. the response was that the cost
> was coping with other rounding modes, and they gave this example:

This seems to be due to bad fixes.  Supporting rounding in the current
rounding mode is only very useful if the rounding is perfect, but perfect
rounding is difficult for complicated functions.  Switching the rounding
mode is incredibly slow on x86 (it takes as long or longer than sin()),
and doing it usually gives worse results.  The result _should_ depend on
the rounding mode.  It is specified to do so for sqrt* and probably for
many other simple functions.  This has

FreeBSD has only been tested to do perfect rounding in all rounding modes
for the following functions:

     ceil*, fabs*, floor*, logb*, rint*, round*, significan*, sqrt*, trunc*
     ceil*, fabs*, floor*, logb*, rint*, round*, significan*, sqrt*, trunc*

Implementing this for the software sqrt* makes sqrt* unusably slow, but
most arches use hardware sqrt*.

> #include <math.h>
> #include <stdio.h>
> #include <fenv.h>
>
> int main(){
>    double x = -0xb.6365e22ee46fp4;
>    fesetround(FE_UPWARD);
>    printf("sin(%f) = %.16e\n", x, sin(x));
>    fesetround(FE_DOWNWARD);
>    printf("sin(%f) = %.16e\n", x, sin(x));
>
>    x = 0x1.921fb54442d16p0;
>    fesetround(FE_UPWARD);
>    printf("sin(%f) = %.16e\n", x, sin(x));
>    fesetround(FE_DOWNWARD);
>    printf("sin(%f) = %.16e\n", x, sin(x));
>    return 0;
> }
>
> see https://android-review.googlesource.com/112700 for the original
> change and related conversation.

I couldn't read this with my old browsers.

> glibc 2.19:
>
> sin(-182.212373) = -2.4759225463534308e-18
> sin(-182.212374) = -2.4759225463534309e-18
> sin(1.570797) = 1.0000000000000000e+00
> sin(1.570796) = 1.0000000000000000e+00

sin() is a good-bad example.  If you "optimize" it by writing it in asm,
then you get enormous errors (multiple tera-ulps) even near small even
multiples of Pi/2.  The 1-ulp difference made by the rounding mode is
in the noise.

Actual results for i387 sin() on i386, with a middle result for FE_TONEAREST
added:

    sin(-182.212373) = -2.7105054312137606e-18
    sin(-182.212374) = -2.7105054312137611e-18
    sin(-182.212374) = -2.7105054312137611e-18
    sin(1.570797) = 1.0000000000000000e+00
    sin(1.570796) = 1.0000000000000000e+00
    sin(1.570796) = 9.9999999999999988e-01

Results should be printed in %a format so that the bits can be seen.

i387 values:

    sin(-182.212373) = -0x1.8ffffffffffffp-59
    sin(-182.212374) = -0x1.9p-59
    sin(-182.212374) = -0x1.9p-59
    sin(1.570797) = 0x1p+0
    sin(1.570796) = 0x1p+0
    sin(1.570796) = 0x1.fffffffffffffp-1

We see that the rounding mode indeed makes a 1-bit difference.

software values on i386 (almost the same as below):

    sin(-182.212373) = -2.4759225463534304e-18
    sin(-182.212374) = -2.4759225463534308e-18
    sin(-182.212374) = 2.2575413760606011e-11
    sin(1.570797) = 1.0000000000000000e+00
    sin(1.570796) = 1.0000000000000000e+00
    sin(1.570796) = 1.0000000002522273e+00
    sin(-182.212373) = -0x1.6d61b58c99c42p-59
    sin(-182.212374) = -0x1.6d61b58c99c43p-59
    sin(-182.212374) = 0x1.8d26ap-36
    sin(1.570797) = 0x1p+0
    sin(1.570796) = 0x1p+0
    sin(1.570796) = 0x1.000000011553bp+0

We see much more from this:
- somehow TONEAREST gives almost the same results as UPWARDS.  It appears
   to work correctly at -182.212373 in the FreeBSD software case only
- enormous errors at -182.212373 from the i387, as predicted (the software
   result is believed to be correct, while the i387 result shows evidence
   of canceling all except the top 5 bits).  All bits except the top bit
   are wrong, so the error is about 2**52 = 4 Pulps (1 Pulp = 1 peta-ulp).
   DOWNWARDS makes a 1 ulp adjustment to this.
- glibc apparently uses software, since it is correct in all cases, except
   it apparently forces a common rounding for UPWARDS and NEAREST and thus
   is certain to be off by 1 ulp for 1 of them.
- the 1.57079x are was obviously chosen to be as close as possible to Pi/2
   (it should be printed in %a format too, so that we can see its bits
   (these must be more accurate than shown by the decimal format) and see
   that the bits don't depend on the rounding mode).  Since sin() has a
   maximum at this value, and the infinitely-precise result at exactly Pi/2
   is exactly 1, the approximate internal result must be slightly less than
   1, and UPWARDS and TONEAREST should round it to 1, while DOWNWARDS should
   round it to 1 ulp less than 1.  Hardware sin does this.  glibc sin
   apparently does extra work to round in the wrong direction...
- ... but FreeBSD software sin has an enormous error for DOWNWARDS on
   1.570796 -- it rounds in the wrong direction by 0x11553 ulps :-(.
   About 1 Mulp (1 Mulp = 1 mega-ulp).


> (glibc's fixed bugs here in the past. see
> https://sourceware.org/bugzilla/show_bug.cgi?id=3976 for example.)

This is interesting.  Vincent L certainly knows what he is doing.  As
stated in the thread, the the library can do anything it wants in
non-default rounding modes provided it documents this.  It apparently
uses the bad fix of changing the rounding mode to FE_TONEAREST.  This
is slow and guarantees an error of 1 ulp in most cases, so to avoid
much larger errors in hopefully most cases.

Long ago, I all of the most common functions for such errors,
exhaustively for float functions, and didn't find any of more than an
ulp or 2.  I might have only done this on i386.  i386 protects against
such errors in float precision by evaluating intermediate results in
higher precision.  In non-FreeBSD, the default rounding precision is
not double, so i386 should also protect agains such errors in double
precision.  Special cases like multiples of Pi/2 are too hard to find
without special tests except in float precision.

> the FreeBSD libm:
>
> sin(-182.212374) = -2.4759225463534304e-18
> sin(-182.212374) = 2.2575413760606011e-11
> sin(1.570796) = 1.0000000000000000e+00
> sin(1.570796) = 1.0000000002522276e+00

Verified above.

> it looks like e_sqrtl.c, s_fmal.c, s_fma.c, and s_fmaf.c save/restore
> the rounding mode but nothing else does.

This makes these functions incredibly slow.

> let me know if you'd like me to send trivial patches for any of the
> affected functions. (do all the libm functions need this?)

This cannot be fixed using trivial patches.  Almost all transcendental
functions are probably affected.

Actually, it is possible to fix many long double functions without
making them more than twice as slow in the usual case where the rounding
mode is not the default.  The ENTERI() macros does a conditional mode
switch to change the rounding precision from 53 bits to 64 bits, so that
long doubles work properly.  It avoids the switch when it would be null.
   (This reminds me that modern x86 does the same in hardware for at least
    switching the rounding mode, so unconditional switching of the rounding
    mode might not be so slow when the switch is null.

    Also, fpset*() is pessimized in space and time to avoid traps, while
    feset*() never bothered with this.  To switch the rounding mode, you
    should use fesetround(), but fesetprec() doesn't exist so ENTERI() has
    to use fpsetprec().)
You could try adding unconditional rounding mode switches to ENTERI() to
test the long double case.  However, testing long doubles is harder.

exp() is mentioned in the glibc thread.  It has some details of interest
in FreeBSD.  On i386 only, it written in "optimized" asm that is broken
as follows:

X 	/*
X 	 * Extended precision is needed to reduce the maximum error from
X 	 * hundreds of ulps to less than 1 ulp.  Switch to it if necessary.
X 	 * We may as well set the rounding mode to to-nearest and mask traps
X 	 * if we switch.
X 	 */

I thought that this rounding mode switch was a good idea when I wrote this
20+ years ago.  Now I know better, and haven't used this version for 3 years.
The software version never did this.  I now use the software version on
i386 as well as on amd64, after optimizing it a bit so that it is a little
more accurate than this asm (hardware i387) version.  The software version
has been faster for many years on x86.

The precision mode switch here is part of what makes it slow.  glibc
probably assumes that the default rounding precision is 64 bits, so that
it doesn't need this mode switch.  It turns out that 64 bits is just
enough to reduce the error in double precision to below 1 ulp.  expl()
cannot be written like this in asm, since there is no mode switch to
get more than long double precision.  Not switching would give the same
large errors in long double precision that this mode switch fixes for
double precision.

The rounding mode switch accidentally "fixes" any algorithm instability
in i386 exp() in the same way as in glibc.  But I think the extra precision
gives enough stablity.

Bruce

From owner-freebsd-numerics@FreeBSD.ORG  Fri Mar 13 12:16:57 2015
Return-Path: <owner-freebsd-numerics@FreeBSD.ORG>
Delivered-To: freebsd-numerics@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 5EFBF5BE
 for <freebsd-numerics@freebsd.org>; Fri, 13 Mar 2015 12:16:57 +0000 (UTC)
Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au
 [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id E4B5DECC
 for <freebsd-numerics@freebsd.org>; Fri, 13 Mar 2015 12:16:56 +0000 (UTC)
Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au
 (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197])
 by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 30DF2D6372E;
 Fri, 13 Mar 2015 23:16:42 +1100 (AEDT)
Date: Fri, 13 Mar 2015 23:16:41 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
Subject: Re: libm functions sensitive to current rounding mode
In-Reply-To: <20150313192625.B885@besplex.bde.org>
Message-ID: <20150313224213.W2994@besplex.bde.org>
References: <CAJgzZor=EgG2+E3L6MW-6DD7geUy=Ensa7G1B=viQBXLZKciLw@mail.gmail.com>
 <20150313192625.B885@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=Za4kaKlA c=1 sm=1 tr=0
 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=kj9zAlcOel0A:10
 a=JzwRw_2MAAAA:8 a=CCpqsmhAAAAA:8 a=04n_wW2SkaRwE1C0LGgA:9
 a=CjuIK1q_8ugA:10
Cc: enh <enh@google.com>, freebsd-numerics@freebsd.org
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 13 Mar 2015 12:16:57 -0000

On Fri, 13 Mar 2015, Bruce Evans wrote:

> On Thu, 12 Mar 2015, enh via freebsd-numerics wrote:
> ...
> - ... but FreeBSD software sin has an enormous error for DOWNWARDS on
>  1.570796 -- it rounds in the wrong direction by 0x11553 ulps :-(.
>  About 1 Mulp (1 Mulp = 1 mega-ulp).
>
>
>> (glibc's fixed bugs here in the past. see
>> https://sourceware.org/bugzilla/show_bug.cgi?id=3976 for example.)
>
> This is interesting.  Vincent L certainly knows what he is doing.  As
> stated in the thread, the the library can do anything it wants in
> non-default rounding modes provided it documents this.  It apparently
> uses the bad fix of changing the rounding mode to FE_TONEAREST.  This
> is slow and guarantees an error of 1 ulp in most cases, so to avoid
> much larger errors in hopefully most cases.
>
> Long ago, I all of the most common functions for such errors,
> exhaustively for float functions, and didn't find any of more than an
> ulp or 2.  I might have only done this on i386.  i386 protects against
> such errors in float precision by evaluating intermediate results in
> higher precision.  In non-FreeBSD, the default rounding precision is
> not double, so i386 should also protect agains such errors in double
> precision.  Special cases like multiples of Pi/2 are too hard to find
> without special tests except in float precision.

A quick rerun of the test finds problems in much the same areas as the
glibc thread.  I must have put off fixing this since most functions work
OK and there are larger bugs to fix (like the multi-Pulp errors given
by using the i387).

Test of most float functions relative to double, on i386, testing 1/256
of possible values.

X acos:  max_er = 0x129b24c8 0.5814, avg_er = 0.170, #>=1:0.5 = 0:2724
X acos:  max_er = 0x228da264 1.0797, avg_er = 0.299, #>=1:0.5 = 1334:7453322
X acos:  max_er = 0x2266570c 1.0750, avg_er = 0.197, #>=1:0.5 = 1324:869716
X acos:  max_er = 0x228da264 1.0797, avg_er = 0.299, #>=1:0.5 = 1334:7453322

The 4 cases are FP_RN, FP_RZ, RP_RP, FP_RM (ieeefp has better names).  The
errors are given in bits, then in ulps.  With perfect rounding, max_er
would be at most 0x0fffffff (0.5- ulps) for FP_RN and 0x1fffffff (1.0- ulps)
for the other rounding modes.  This is acheived mainly by sqrtf.  Less
than 0.5 ulps more than that is OK.  A few functions (e.g., cos) try to
get less than 0.01 or 0.1 ulps more, and some functions should be exact.

X acosh: max_er = 0x2bf776b9 1.3740, avg_er = 0.063, #>=1:0.5 = 62:29183
X acosh: max_er = 0x517afdff 2.5462, avg_er = 0.128, #>=1:0.5 = 57310:2149989
X acosh: max_er = 0x619b49f6 3.0503, avg_er = 0.129, #>=1:0.5 = 60042:2153013
X acosh: max_er = 0x517afdff 2.5462, avg_er = 0.128, #>=1:0.5 = 57310:2149989
X asin:  max_er = 0x1359931a 0.6047, avg_er = 0.012, #>=1:0.5 = 0:12924
X asin:  max_er = 0x22938df5 1.0805, avg_er = 0.022, #>=1:0.5 = 3592:367188
X asin:  max_er = 0x2459bbfe 1.1360, avg_er = 0.024, #>=1:0.5 = 6373:393217
X asin:  max_er = 0x23dc2d74 1.1206, avg_er = 0.023, #>=1:0.5 = 6325:393217
X asinh: max_er = 0x2f6f84c3 1.4824, avg_er = 0.141, #>=1:0.5 = 2644:244920
X asinh: max_er = 0x5d1b55c0 2.9095, avg_er = 0.301, #>=1:0.5 = 464816:4941040
X asinh: max_er = 0x5ee3fd1a 2.9654, avg_er = 0.301, #>=1:0.5 = 464616:4938660
X asinh: max_er = 0x5d1b55c0 2.9095, avg_er = 0.301, #>=1:0.5 = 464816:4941040
X atan:  max_er = 0x1089602a 0.5168, avg_er = 0.185, #>=1:0.5 = 0:2956
X atan:  max_er = 0x208b8696 1.0170, avg_er = 0.325, #>=1:0.5 = 936:7876354
X atan:  max_er = 0x208b8696 1.0171, avg_er = 0.275, #>=1:0.5 = 1439:4587521
X atan:  max_er = 0x208b8696 1.0170, avg_er = 0.274, #>=1:0.5 = 1439:4587521
X atanh: max_er = 0x2e56025d 1.4480, avg_er = 0.016, #>=1:0.5 = 2584:180474
X atanh: max_er = 0x5e75b4c4 2.9518, avg_er = 0.045, #>=1:0.5 = 359898:667576
X atanh: max_er = 0x5e325a52 2.9437, avg_er = 0.032, #>=1:0.5 = 180535:437158
X atanh: max_er = 0x5e75b4c4 2.9518, avg_er = 0.031, #>=1:0.5 = 182422:434622
X cbrt:  max_er =  0xffffec5 0.5000, avg_er = 0.250, #>=1:0.5 = 0:0
X cbrt:  max_er = 0x1fffffff 0.9999, avg_er = 0.497, #>=1:0.5 = 1724:8326990
X cbrt:  max_er = 0x1fffffff 1.0000, avg_er = 0.499, #>=1:0.5 = 1636:8355750
X cbrt:  max_er = 0x1fffffff 0.9999, avg_er = 0.498, #>=1:0.5 = 1380:8355494
X ceil:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X ceil:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X ceil:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X ceil:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0

The above errors are a bit larger than for the default FP_RN, but not too
bad.

X cos:   max_er = 0x10074700 0.5009, avg_er = 0.138, #>=1:0.5 = 0:2512
X cos:   max_er = 0x1bb65457418000 14529186.7267, avg_er = 43.479, #>=1:0.5 = 458712:4761870
X cos:   max_er = 0x1bb65477410000 14529187.7267, avg_er = 43.508, #>=1:0.5 = 525970:4867436
X cos:   max_er = 0x1bb65457418000 14529186.7267, avg_er = 43.486, #>=1:0.5 = 488650:4793876

The algorithm breaks down for cosf.  This is with my i386 system not using
the i387 for sin like -current still does (neither uses it for sinf).
Otherwise these errors of a mere 14 Mulps would be masked by errors of 17
Gulps.  That is 17 Gulps of error in 24-bit precision from the reference
function.  Multiply by 2**29 for the error in double precision.

X cosh:  max_er = 0x104af206 0.5091, avg_er = 0.019, #>=1:0.5 = 0:2180
X cosh:  max_er = 0x380000001fffffff 7516192768.9999, avg_er = 809.671, #>=1:0.5 = 8088592:8619926
X cosh:  max_er = 0x1d9252a8f 14.7858, avg_er = 0.516, #>=1:0.5 = 6750688:8175486
X cosh:  max_er = 0x380000001fffffff 7516192768.9999, avg_er = 809.671, #>=1:0.5 = 8088592:8619926

The algorithm breaks down for cosh, presumably due to its breakdown for
exp.  This is with my i386 system not using the i387 for exp like
-current still does (neither uses it for expf).  The glibc thread says
that the 32-bit exp worked but the 64-bit exp didn't work.  This would
be explained by algorithmic errors in the non-i387 version only, and
glibc using the i387 in 32-bit mode only for the same bad/historical
reasons as FreeBSD.

X erf:   max_er = 0x14faf158 0.6556, avg_er = 0.126, #>=1:0.5 = 0:109523
X erf:   max_er = 0x4c3e1ab7 2.3825, avg_er = 0.260, #>=1:0.5 = 153763:4375088
X erf:   max_er = 0x4041d2d3 2.0081, avg_er = 0.253, #>=1:0.5 = 67182:4227700
X erf:   max_er = 0x4a70f434 2.3262, avg_er = 0.252, #>=1:0.5 = 78494:4227994
X exp:   max_er = 0x1047a00a 0.5087, avg_er = 0.033, #>=1:0.5 = 0:2032
X exp:   max_er = 0x3800000020000000 7516192769.0000, avg_er = 1343.574, #>=1:0.5 = 7259405:8747670
X exp:   max_er = 0x39a138016 28.8149, avg_er = 1.162, #>=1:0.5 = 7256189:8776222
X exp:   max_er = 0x3800000020000000 7516192769.0000, avg_er = 1343.574, #>=1:0.5 = 7259405:8747670
X exp2:  max_er = 0x1006a936 0.5008, avg_er = 0.033, #>=1:0.5 = 0:581
X exp2:  max_er = 0x380000001fffffff 7516192768.9999, avg_er = 1056.147, #>=1:0.5 = 6412147:8351665
X exp2:  max_er = 0x20669100 1.0126, avg_er = 0.500, #>=1:0.5 = 6429089:8371868
X exp2:  max_er = 0x380000001fffffff 7516192768.9999, avg_er = 1056.147, #>=1:0.5 = 6412147:8351665
X expm1: max_er = 0x105e998d 0.5115, avg_er = 0.031, #>=1:0.5 = 0:508
X expm1: max_er = 0x380000001fffffff 7516192768.9999, avg_er = 1342.318, #>=1:0.5 = 3985247:4976979
X expm1: max_er = 0x231f0351 1.0976, avg_er = 0.061, #>=1:0.5 = 1888:1014283
X expm1: max_er = 0x380000001fffffff 7516192768.9999, avg_er = 1342.317, #>=1:0.5 = 3986102:4957995

All the functions that are exp*f or use exp*f have problems, except exp*f
in RP mode.  coshf has lesser problems in RP mode.  Its errors are then
large but not enormous.  Probably a small error in expf turns into a larger
one after addition.  (My coshf uses an expf kernel like coshl uses an
expl kernel in -current.)

X fabs:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X fabs:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X fabs:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X fabs:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X floor: max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X floor: max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X floor: max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X floor: max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X j0:    max_er = 0x2f51e6f4f1c 6056.9511, avg_er = 0.390, #>=1:0.5 = 2401634:5079654
X j0:    max_er = 0x109227b28a731a 8687933.5794, avg_er = 17.443, #>=1:0.5 = 6602082:9050730
X j0:    max_er = 0x80108bbf4ba8d7e8 17188543994.3644, avg_er = 29.127, #>=1:0.5 = 5773002:7488366
X j0:    max_er = 0x109227b28a731a 8687933.5794, avg_er = 17.492, #>=1:0.5 = 6521176:9046266
X j1:    max_er = 0x1b08a6a7544 3460.3255, avg_er = 0.395, #>=1:0.5 = 2415186:5077432
X j1:    max_er = 0x800eac41530e07cc 17187561994.5954, avg_er = 28.930, #>=1:0.5 = 4387856:7844188
X j1:    max_er = 0x64e89d1202af4 3306574.5352, avg_er = 16.048, #>=1:0.5 = 6010996:7994846
X j1:    max_er = 0x800eac41530e07ca 17187561994.5954, avg_er = 29.129, #>=1:0.5 = 5858116:8158408
X lgamma:max_er = 0x1d957c2a8000 60587.8802, avg_er = 0.232, #>=1:0.5 = 140016:899042
X lgamma:max_er = 0x65ee2cf0b75af4 53440871.5223, avg_er = 1682.939, #>=1:0.5 = 1398802:7250814
X lgamma:max_er = 0xd6b9b6e60000 439757.7156, avg_er = 0.476, #>=1:0.5 = 920256:6777199
X lgamma:max_er = 0x65ee2cf0b75af4 53440871.5223, avg_er = 1682.914, #>=1:0.5 = 1097301:6940713

Besel and gamma functions have large errors even in RN.  These get worse
in other rounding modes.

X log:   max_er =  0xfffffe6 0.5000, avg_er = 0.124, #>=1:0.5 = 0:0
X log:   max_er = 0x1fffffdd 0.9999, avg_er = 0.249, #>=1:0.5 = 0:4178615
X log:   max_er = 0x1fffffdd 1.0000, avg_er = 0.249, #>=1:0.5 = 0:4177376
X log:   max_er = 0x1fffff83 0.9999, avg_er = 0.249, #>=1:0.5 = 0:4178462
X log10: max_er =  0xfffff6d 0.5000, avg_er = 0.125, #>=1:0.5 = 0:0
X log10: max_er = 0x1fffffff 0.9999, avg_er = 0.249, #>=1:0.5 = 6:4177256
X log10: max_er = 0x1fffffff 1.0000, avg_er = 0.250, #>=1:0.5 = 6:4177785
X log10: max_er = 0x1fffffff 0.9999, avg_er = 0.249, #>=1:0.5 = 6:4178059
X log1p: max_er = 0x10288a73 0.5049, avg_er = 0.088, #>=1:0.5 = 0:604
X log1p: max_er = 0x2028aaef 1.0049, avg_er = 0.174, #>=1:0.5 = 341:2883620
X log1p: max_er = 0x2025c234 1.0047, avg_er = 0.175, #>=1:0.5 = 251:2896319
X log1p: max_er = 0x2028aaef 1.0049, avg_er = 0.173, #>=1:0.5 = 300:2870804
X log2:  max_er = 0x100128e0 0.5001, avg_er = 0.125, #>=1:0.5 = 0:17
X log2:  max_er = 0x20002e79 1.0000, avg_er = 0.249, #>=1:0.5 = 5:4177477
X log2:  max_er = 0x20005cf1 1.0001, avg_er = 0.250, #>=1:0.5 = 10:4187131
X log2:  max_er = 0x20011f0c 1.0001, avg_er = 0.248, #>=1:0.5 = 1:4168439

log and log10 still use the i387, for the test and reference function, so
are deceptively consistent/accurate.  The others use software that calls
the hardware log.  Apparently the software has no algorithmic problems.

X logb:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X logb:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X logb:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X logb:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X rint:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X rint:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X rint:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X rint:  max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X round: max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X round: max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X round: max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X round: max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X significand:max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X significand:max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X significand:max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X significand:max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X sin:   max_er = 0x10073f96 0.5009, avg_er = 0.137, #>=1:0.5 = 0:2436
X sin:   max_er = 0x117c97fcf3c000 9168063.9047, avg_er = 39.254, #>=1:0.5 = 457300:4795627
X sin:   max_er = 0x117c981cf34000 9168064.9047, avg_er = 39.281, #>=1:0.5 = 524127:4853666
X sin:   max_er = 0x117c97fcf3c000 9168063.9047, avg_er = 39.260, #>=1:0.5 = 486966:4806289
X sinh:  max_er = 0x10431a04 0.5082, avg_er = 0.019, #>=1:0.5 = 0:1734
X sinh:  max_er = 0x380000001fffffff 7516192768.9999, avg_er = 809.674, #>=1:0.5 = 8089378:8598420
X sinh:  max_er = 0x380000001fffffff 7516192769.0000, avg_er = 1428.864, #>=1:0.5 = 4108592:4646473
X sinh:  max_er = 0x380000001fffffff 7516192768.9999, avg_er = 1428.893, #>=1:0.5 = 4123299:4654026
X sqrt:  max_er =  0xffffc47 0.5000, avg_er = 0.124, #>=1:0.5 = 0:0
X sqrt:  max_er = 0x1ffffcca 0.9999, avg_er = 0.247, #>=1:0.5 = 0:4147423
X sqrt:  max_er = 0x1fffd9d7 1.0000, avg_er = 0.250, #>=1:0.5 = 0:4192033
X sqrt:  max_er = 0x1ffffcca 0.9999, avg_er = 0.247, #>=1:0.5 = 0:4147423
X tan:   max_er = 0x1997ff78 0.7998, avg_er = 0.150, #>=1:0.5 = 0:1188178
X tan:   max_er = 0xe0f2b5b350a325 117937581.6035, avg_er = 1715.855, #>=1:0.5 = 1378705:5016746
X tan:   max_er = 0xe0f2b5b35198ca 117937581.6038, avg_er = 1715.881, #>=1:0.5 = 1385644:5030977
X tan:   max_er = 0xe0f2b5b35479ad 117937581.6040, avg_er = 1715.880, #>=1:0.5 = 1385644:5030977
X tanh:  max_er = 0x11710b6d 0.5450, avg_er = 0.016, #>=1:0.5 = 0:5702
X tanh:  max_er = 0x349ef98c7 26.3104, avg_er = 0.981, #>=1:0.5 = 14807030:16230146
X tanh:  max_er = 0x367cb0c6c 27.2436, avg_er = 0.272, #>=1:0.5 = 3378729:4287978
X tanh:  max_er = 0x369ef98c7 27.3104, avg_er = 0.516, #>=1:0.5 = 7436992:8380189
X trunc: max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X trunc: max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X trunc: max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X trunc: max_er =          0 0.0000, avg_er = 0.000, #>=1:0.5 = 0:0
X y0:    max_er = 0x830232be000 16769.0991, avg_er = 0.315, #>=1:0.5 = 1356271:4560040
X y0:    max_er = 0x1a77ca8c125e1f 13876820.3772, avg_er = 10.160, #>=1:0.5 = 5678376:7876031
X y0:    max_er = 0x800c3393437ab82d 17186266266.1088, avg_er = 1038.278, #>=1:0.5 = 4674471:7301111
X y0:    max_er = 0x1a77cacc125e17 13876822.3772, avg_er = 9.955, #>=1:0.5 = 3116737:4399584
X y1:    max_er = 0x859fa2773502b 4378577.2328, avg_er = 1.307, #>=1:0.5 = 1472099:4660542
X y1:    max_er = 0xc45f308dc9c883 102955396.4308, avg_er = 1577.846, #>=1:0.5 = 4247855:7239448
X y1:    max_er = 0xc45f308dc9c883 102955396.4309, avg_er = 1572.738, #>=1:0.5 = 4781943:7380419
X y1:    max_er = 0x19ddad63a68667 13561195.1140, avg_er = 14.622, #>=1:0.5 = 2746030:3875201

The other trig, hyperbolic and Bessel functions are consistently broken.

pow() is probably broken like exp().  My quick test didn't reach it.  It is
mentioned in the glibc thread.  The glibc fixes for exp seem to be more
subtle than clobbering the rounding mode.

Bruce

From owner-freebsd-numerics@FreeBSD.ORG  Fri Mar 13 13:44:00 2015
Return-Path: <owner-freebsd-numerics@FreeBSD.ORG>
Delivered-To: freebsd-numerics@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 0446ABE4
 for <freebsd-numerics@freebsd.org>; Fri, 13 Mar 2015 13:44:00 +0000 (UTC)
Received: from mail110.syd.optusnet.com.au (mail110.syd.optusnet.com.au
 [211.29.132.97]) by mx1.freebsd.org (Postfix) with ESMTP id 603A9AD9
 for <freebsd-numerics@freebsd.org>; Fri, 13 Mar 2015 13:43:59 +0000 (UTC)
Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au
 (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197])
 by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id 1972C7819C2;
 Sat, 14 Mar 2015 00:43:48 +1100 (AEDT)
Date: Sat, 14 Mar 2015 00:43:47 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
Subject: Re: libm functions sensitive to current rounding mode
In-Reply-To: <20150313224213.W2994@besplex.bde.org>
Message-ID: <20150313234029.N6922@besplex.bde.org>
References: <CAJgzZor=EgG2+E3L6MW-6DD7geUy=Ensa7G1B=viQBXLZKciLw@mail.gmail.com>
 <20150313192625.B885@besplex.bde.org> <20150313224213.W2994@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=Za4kaKlA c=1 sm=1 tr=0
 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=kj9zAlcOel0A:10
 a=JzwRw_2MAAAA:8 a=pijS6xPEOuVcagq0fuIA:9 a=gQ6FDVexZt2O46Ez:21
 a=Q-miSVZZwoN2xv3S:21 a=CjuIK1q_8ugA:10
Cc: enh <enh@google.com>, freebsd-numerics@freebsd.org
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 13 Mar 2015 13:44:00 -0000

On Fri, 13 Mar 2015, Bruce Evans wrote:

> On Fri, 13 Mar 2015, Bruce Evans wrote:
>
>> On Thu, 12 Mar 2015, enh via freebsd-numerics wrote:
>> ...
>> - ... but FreeBSD software sin has an enormous error for DOWNWARDS on
>>  1.570796 -- it rounds in the wrong direction by 0x11553 ulps :-(.
>>  About 1 Mulp (1 Mulp = 1 mega-ulp).
>> ...
> A quick rerun of the test finds problems in much the same areas as the
> glibc thread.  I must have put off fixing this since most functions work
> OK and there are larger bugs to fix (like the multi-Pulp errors given
> by using the i387).
> X exp:   max_er = 0x1047a00a 0.5087, avg_er = 0.033, #>=1:0.5 = 0:2032
> X exp:   max_er = 0x3800000020000000 7516192769.0000, avg_er = 1343.574, 
> #>=1:0.5 = 7259405:8747670
> X exp:   max_er = 0x39a138016 28.8149, avg_er = 1.162, #>=1:0.5 = 
> 7256189:8776222
> X exp:   max_er = 0x3800000020000000 7516192769.0000, avg_er = 1343.574, 
> #>=1:0.5 = 7259405:8747670
> X exp2:  max_er = 0x1006a936 0.5008, avg_er = 0.033, #>=1:0.5 = 0:581
> X exp2:  max_er = 0x380000001fffffff 7516192768.9999, avg_er = 1056.147, 
> #>=1:0.5 = 6412147:8351665
> X exp2:  max_er = 0x20669100 1.0126, avg_er = 0.500, #>=1:0.5 = 
> 6429089:8371868
> X exp2:  max_er = 0x380000001fffffff 7516192768.9999, avg_er = 1056.147, 
> #>=1:0.5 = 6412147:8351665
> X expm1: max_er = 0x105e998d 0.5115, avg_er = 0.031, #>=1:0.5 = 0:508
> X expm1: max_er = 0x380000001fffffff 7516192768.9999, avg_er = 1342.318, 
> #>=1:0.5 = 3985247:4976979
> X expm1: max_er = 0x231f0351 1.0976, avg_er = 0.061, #>=1:0.5 = 1888:1014283
> X expm1: max_er = 0x380000001fffffff 7516192768.9999, avg_er = 1342.317, 
> #>=1:0.5 = 3986102:4957995
>
> All the functions that are exp*f or use exp*f have problems, except exp*f
> in RP mode.  coshf has lesser problems in RP mode.  Its errors are then
> large but not enormous.  Probably a small error in expf turns into a larger
> one after addition.  (My coshf uses an expf kernel like coshl uses an
> expl kernel in -current.)

Bah, I just remember that exp and trig functions intentionally depend
on round-to-nearest mode, since other modes aren't really supported
and efficiency is gained by assuming this mode.  The optimizations
started in the range reduction for exp2*.  I didn't like exp2* being
so much faster than more important functions, and copied some of its
methods to trig and plain exp.  Poorly documented copies of the
optimization replicated.  My version now uses inline functions in
math_private.h for this.  There is still a problem choosing the
algorithm.  To work, calculations like double x; ... x + 0x1p52 - 0x1p52
need not only round-to-nearest, but also no extra precision.  With
extra precision, a different adjustment must be used.  But if the caller
can change the precision, this adjustment varies at runtime.  On i386,
it assumed that the precision is normally 53 bits, and ENTERI() does the
necessary switch to 64-bit precision which matches the variation of
the adjustment to 0x1p63.

Here is my current centralized version:

X /*
X  * The rnint() family rounds to the nearest integer for a restricted range
X  * range of args (up to about 2**MANT_DIG).  We assume that the current
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
X  * rounding mode is FE_TONEAREST so that this can be done efficiently.
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
X  * Extra precision causes more problems in practice, and we only centralize
X  * this here to reduce those problems, and have not solved the efficiency
X  * problems.  The exp2() family uses a more delicate version of this that
X  * requires extracting bits from the intermediate value, so it is not
X  * centralized here and should copy any solution of the efficiency problems.
X  */
X 
X static inline double
X rnint(__double_t x)
X {
X 	/*
X 	 * This casts to double to kill any extra precision.  This depends
X 	 * on the cast being applied to a double_t to avoid compiler bugs
X 	 * (this is a cleaner version of STRICT_ASSIGN()).  This is
X 	 * inefficient if there actually is extra precision, but is hard
X 	 * to improve on.  We use double_t in the API to minimise conversions
X 	 * for just calling here.  Note that we cannot easily change the
X 	 * magic number to the one that works directly with double_t, since
X 	 * the rounding precision is variable at runtime on x86 so the
X 	 * magic number would need to be variable.  Assuming that the
X 	 * rounding precision is always the default is too fragile.  This
X 	 * and many other complications will move when the default is
X 	 * changed to FP_PE.
X 	 */
X 	return ((double)(x + 0x1.8p52) - 0x1.8p52);
X }
X 
X static inline float
X rnintf(__float_t x)
X {
X 	/*
X 	 * As for rnint(), except we could just call that to handle the
X 	 * extra precision case, usually without losing efficiency.
X 	 */
X 	return ((float)(x + 0x1.8p23F) - 0x1.8p23F);

Switching the rounding mode to FP_RN here fixed the large error in expf().
I don't actually use this, since the cast to float is slow.  I actually
use

X }
X 
X #ifdef LDBL_MANT_DIG
X /*
X  * The complications for extra precision are smaller for rnintl() since it
X  * can safely assume that the rounding precision has been increased from
X  * its default to FP_PE on x86.  We don't exploit that here to get small
X  * optimizations from limiting the rangle to double.  We just need it for
X  * the magic number to work with long doubles.  ld128 callers should use
X  * rnint() instead of this if possible.  ld80 callers should prefer
X  * rnintl() since for amd64 this avoids swapping the register set, while
X  * for i386 it makes no difference (assuming FP_PE), and for other arches
X  * it makes little difference.
X  */
X static inline long double
X rnintl(long double x)
X {
X 	return (x + __CONCAT(0x1.8p, LDBL_MANT_DIG) / 2 -
X 	    __CONCAT(0x1.8p, LDBL_MANT_DIG) / 2);
X }
X #endif /* LDBL_MANT_DIG */
X 
X /*
X  * irint() and irint64() give the same result as casting to their integer
X  * return type provided their arg is a floating point integer.  They can
X  * sometimes be more efficient because no rounding is required.
X  */
X #if (defined(amd64) || defined(__i386__)) && defined(__GNUCLIKE_ASM)
X #define	irint(x)						\
X     (sizeof(x) == sizeof(float) &&				\
X     sizeof(__float_t) == sizeof(long double) ? irintf(x) :	\
X     sizeof(x) == sizeof(double) &&				\
X     sizeof(__double_t) == sizeof(long double) ? irintd(x) :	\
X     sizeof(x) == sizeof(long double) ? irintl(x) : (int)(x))
X #else
X #define	irint(x)	((int)(x))
X #endif
X 
X #define	i64rint(x)	((int64_t)(x))	/* only needed for ld128 so not opt. */
X 
X #if defined(__i386__) && defined(__GNUCLIKE_ASM)
X static __inline int
X irintf(float x)
X {
X 	int n;
X 
X 	asm("fistl %0" : "=m" (n) : "t" (x));
X 	return (n);
X }
X 
X static __inline int
X irintd(double x)
X {
X 	int n;
X 
X 	asm("fistl %0" : "=m" (n) : "t" (x));
X 	return (n);
X }
X #endif
X 
X #if (defined(__amd64__) || defined(__i386__)) && defined(__GNUCLIKE_ASM)
X static __inline int
X irintl(long double x)
X {
X 	int n;
X 
X 	asm("fistl %0" : "=m" (n) : "t" (x));
X 	return (n);
X }
X #endif

amd64 doesn't need irintf() or irintd() since it has SSE instructions that
work better than fistl since they don't depend on the rounding mode and
don't require using a different register set, and compilers use these
automatically so the above optimizations are not useful.  fistl uses the
current rounding mode, but the rounding mode doesn't matter so much for
irint*() because its arg is already an integer for its intended uses.

The above is used for range reduction in trig functions too.  See
e_rem_pio2*.[ch].  Old versions of these used the following very slow
code so they worked in all rounding modes:  first, a diff to use the
above:

X Index: e_rem_pio2.c
X ===================================================================
X RCS file: /home/ncvs/src/lib/msun/src/e_rem_pio2.c,v
X retrieving revision 1.25
X diff -u -2 -r1.25 e_rem_pio2.c
X --- e_rem_pio2.c	17 Nov 2012 01:50:07 -0000	1.25
X +++ e_rem_pio2.c	11 Nov 2013 09:17:08 -0000
X @@ -128,12 +128,6 @@
X  	if(ix<0x413921fb) {	/* |x| ~< 2^20*(pi/2), medium size */
X  medium:
X -	    /* Use a specialized rint() to get fn.  Assume round-to-nearest. */
X -	    STRICT_ASSIGN(double,fn,x*invpio2+0x1.8p52);
X -	    fn = fn-0x1.8p52;
X -#ifdef HAVE_EFFICIENT_IRINT
X +	    fn = rnint((double_t)x*invpio2);
X  	    n  = irint(fn);
X -#else
X -	    n  = (int32_t)fn;
X -#endif
X  	    r  = x-fn*pio2_1;
X  	    w  = fn*pio2_1t;	/* 1st round good to 85 bit */

(The STRICT_ASSIGN() here is very slow on i386.  Using double_t does it
it in essentially the same way.  Note that rnint() returns double_t, as
is best for better-written functions.  The slow conversion occurs on
assignment to 'double fn' here.  fn and most other variables here should
have type double_t to avoid conversions.)

X Index: e_rem_pio2.c
X ===================================================================
X RCS file: /home/ncvs/src/lib/msun/src/e_rem_pio2.c,v
X retrieving revision 1.1
X diff -u -2 -r1.1 e_rem_pio2.c
X --- e_rem_pio2.c	19 Aug 1994 09:39:43 -0000	1.1
X +++ e_rem_pio2.c	11 Nov 2013 09:17:08 -0000
X @@ -98,15 +62,75 @@
X  	GET_HIGH_WORD(hx,x);		/* high word of x */
X  	ix = hx&0x7fffffff;
X +#if 0 /* Must be handled in caller. */
X  	if(ix<=0x3fe921fb)   /* |x| ~<= pi/4 , no need for reduction */
X  	    {y[0] = x; y[1] = 0; return 0;}
X -	if(ix<=0x413921fb) { /* |x| ~<= 2^19*(pi/2), medium size */
X -	    t  = fabs(x);
X -	    n  = (int32_t) (t*invpio2+half);
X -	    fn = (double)n;
X -	    r  = t-fn*pio2_1;
X -	    w  = fn*pio2_1t;	/* 1st round good to 85 bit */
X ...
X +	    fn = rnint((double_t)x*invpio2);
X +	    n  = irint(fn);
X +	    r  = x-fn*pio2_1;
X +	    w  = fn*pio2_1t;	/* 1st round good to 85 bit */
X ...

The old method is:
- work with t = fabs(x) so that we know which way the rounding goes.
   Elswhere, working with |x| often requires another slow step to
   adjust things based on the sign of x.
- add half.  This is part of manual rounding to nearest
- assign to 'int n'.  The rounding of this is specified to be towards 0.
   This tends to be the slowest step.  It may involve changing the register
   step.  On i387 and early SSE, there is only an instruction that uses the
   current rounding mode, which is usually FP_RN and not known to the
   compiler, so the compiler must switch the rounding mode.

Inefficiencies here cost at least 20 cycles (throughput) on modern x86.
This is a lot when you are trying to make the whole function run in 50
cycles.

The algorithm depends critically on round-to-nearest.  Another algorithm
is to extract the bits of x*invpio2 and adjust them in some way.  The
equivalent of rounding to nearest is to add a bias before right-shifting
the bits to get n.  Then assign n to fn.  This tends to be slower because 
the FPU has to wait for the integer ALU to calculate n.  The above produces
fn as soon as possible.  The integer ALU has to wait.  This wait can be
a bottleneck, but it is usually not as slow as making the FPU wait.

exp has a chance of rounding results in the correct direction if it just
does arg reduction in that direction, since it is monotic.  At least for
FP_RP and FP_RM.  Not for FP_RZ -- rounding of the result towards zero
requires rounding of the arg towards minus infinity.

Trig functions have no similar chances, since they are not monotonic.
Arg reduction almost always gives an arg that is too large or too small.
Then the result may be too large or too small, respectively, depending
on the slope of the function and the rounding mode.  I don't plan to
fix this.  Just keep the error down to 1.5 ulps.

Bruce