Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 26 Feb 2019 23:48:11 -0800
From:      Steve Kargl <sgk@troutmask.apl.washington.edu>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-numerics@freebsd.org
Subject:   Re: Update ENTERI() macro
Message-ID:  <20190227074811.GA75972@troutmask.apl.washington.edu>
In-Reply-To: <20190227145002.P907@besplex.bde.org>
References:  <20190226191825.GA68479@troutmask.apl.washington.edu> <20190227145002.P907@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Feb 27, 2019 at 05:05:15PM +1100, Bruce Evans wrote:
> On Tue, 26 Feb 2019, Steve Kargl wrote:
> 
> > I have submitted
> >
> > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236063
> >
> > Update the ENTERI() macro in math_private.h to take a parameter.
> > This parameter is used by __typeof() to set the type for an
> > intermediate variable that holds the return value of a function
> > on i386 FreeBSD.  The patch updates the consumers of ENTERI().
> 
> I don't like this.  It churns and complicates all the simple cases
> that only need ENTERI().  It bogotifies the existence of ENTERIT(),
> whose reason for existence is to avoid this.  (Actually, it removes
> ENTERIT().)  __typeof() is unnecessarily unportable.  The implementation
> uses literal typesin thousands of places so gains little or less than
> nothing by not spelling types literally for ENTERIT().

Okay.  The other option is an ENTERC() and RETURNC() as
we need to toggle FP_PE for long double complex functions.
I suppose I could follow the one example currently in the
tree that use 

	ENTERIT(long double complex)

I find it somewhat odd that we have

	ENTERI() /* Implicit declaration of __retval to long double. */

but must use directly ENTERIT(long double complex).

> > ...
> > --- /data/kargl/freebsd/src/lib/msun/src/math_private.h	2018-10-31 16:15:21.809877000 -0700
> > +++ ./src/math_private.h	2019-02-25 16:54:26.897247000 -0800
> > @@ -335,15 +348,14 @@
> >
> > /* Support switching the mode to FP_PE if necessary. */
> > #if defined(__i386__) && !defined(NO_FPSETPREC)
> > -#define	ENTERI() ENTERIT(long double)
> > -#define	ENTERIT(returntype)			\
> > -	returntype __retval;			\
> > +#define	ENTERI(a)				\
> > +	__typeof(a) __retval;			\
> > 	fp_prec_t __oprec;			\
> > 						\
> > 	if ((__oprec = fpgetprec()) != FP_PE)	\
> > 		fpsetprec(FP_PE)
> > -#define	RETURNI(x) do {				\
> > -	__retval = (x);				\
> > +#define	RETURNI(a) do {				\
> > +	__retval = (a);				\
> > 	if (__oprec != FP_PE)			\
> > 		fpsetprec(__oprec);		\
> > 	RETURNF(__retval);			\
> > @@ -359,9 +371,8 @@
> > 	return;			\
> > } while (0)
> > #else
> > -#define	ENTERI()
> > -#define	ENTERIT(x)
> > -#define	RETURNI(x)	RETURNF(x)
> > +#define	ENTERI(a)
> > +#define	RETURNI(a)	RETURNF(a)
> > #define	ENTERV()
> > #define	RETURNV()	return
> > #endif
> 
> This also changes RETURNI(), by unimproving its parameter name.  'x' for
> ENTERI() wasn't a very good name for a type, but is good for a variable.
> 'x' for RETURNI() is slightly worse than 'r', but better than 'a'

The renaming is for consistency.  I can use 'r'.

> I had forgotten that ENTERIT() creates a local variable, and don't like this,
> but this is its main reason for existence.  Now I see a better way,
> especially if if we accept unportabilities like __typeof():
> 
> ENTERIT() doesn't need to declare local variable with nearly function scope,
> since the variable is only needed for returning.  Only the precision variable
> needs to have nearly function scope.  The return variable can be created in
> block scope in RETURNI().  Something like:
> 
> #define	RETURNI(r) do {				\
>  	__typeof(r) __retval;			\
>  						\
>  	__retval = (r);				\
>  	if ((__oprec = fpgetprec()) != FP_PE)	\
>  		fpsetprec(FP_PE)		\
>  	RETURNF(__retval);			\
> } while (0)
> 
> You already added ENTERV() and RETURNV().  That seemed reasonable, but now
> I don't like it.  With the above change (and of course also remove the
> variable from ENTERI()), ENTERV() becomes the same as ENTERI().  RETURNV()
> still needs to be different.
> 
> Another bug in ENTERV() is its name.  It is missing the 'I' to indicate that
> it is a complication for only i386.

You probably told at some point in the distant past what the
'I' meant, but that is long forgotten.  I'm fine with you
changing these names.  ENTERIV() and RETURNIV() are needed
for sincosl(), which is a void function.

> RETURNV() has the same naming problem, and more.  It can complement
> ENTERI() perfectly since it doesn't need to return.  It should be named
> LEAVEI(), and the return should be done by the caller.
> 
> But I now see 3 more problems.  The return in RETURNI() is not direct,
> but goes through the macro RETURNF(x).  In the committed version, this
> is a default that just returns x, but in my version it returns
> hackdouble_t(x) or hackfloat_t(x) in some cases (no cases are needed
> for long doubles, so there is no interaction with ENTERI()/LEAVEI(),
> and I only do this in a few simple cases not including any with
> complex types).

I'm fine with making ENTERI() only toggle precision, and adding
a LEAVEI() to reset precision.  RETURNI(r) would then be

#define RETURNI(r)	\
do {			\
   LEAVEI();		\
   return (r);		\
} while (0) 

ENTERV and RETURNV can then go away as ENTERV is simply
ENTERI and RETURNV is now spelled LEAVEI.

> In the hack*_t() cases, x has type double_t or float_t, and we want
> to avoid C99's requirement to destroy efficiency and accuracy on return
> by removing any extra precision in x.  (C99 only requires this
> destruction for careful code that uses double_t or float_t in the
> function.  Then since the return type is different from the internal
> type, destruction is required.  C99 also requires removing extra
> precision on assignment or casts, but most compilers with most CFLAGS
> don't do this since it is pessimal.  Careful code uses double_t and
> float_t internally so as to not depend on compiler bugs for efficiency
> and to avoid unintended loss of accuracy on assignment ands casts for
> compilers without bugs.  C11 adds further destruction for sloppy code
> that uses the return type internally.)  hackdouble_t(x) and
> hackfloat_t(x) change from type xxx_t to type xxx using a type pun so
> that the destruction is not required in the C99 case.
> 
> The 3 problems are:
> - the type pun interacts in a complicated way with changing the precision.
>    This problem is null since type type pun is not needed for long doubles
>    (unless there is a type with more precision than long double, but there
>    is no such type for any supported arch, and C99 is missing support for
>    it).  If this were a problem, then we would have to be carefull about
>    the order of the type pun and the precision change.  Do the type pun
>    first.
> 
> - also, be careful to apply any __typeof() to an arg with the correct
>    type.  Normally do the type pun first, and use the return type for
>    the local variable.  However, for printing, the arg should be the
>    type that has extra precision, and __typeof() on that doesn't give
>    the return type in general.
> 
>    In simple cases, The modified RETURNF() expands to something like
>      'return (hackfloat_t(hi + lo));',
>    where hi + lo is evalated in extra precision using float_t if possible,
>    hackfloat_t puns the resulting float_t value to plain float without
>    any destruction, and we finally return the value.
> 
>    In more complicated cases, the punned returned value is stored in a
>    variable for printing using RETURNP().  I forget if the type of this
>    variable is the return type or the arg type (printing in the return
>    type shows what callers will normally see but printing in the arg
>    type gives more details).  The caller uses RETURNP*(), and RETURNP*()
>    uses RETURNF() and RETURNF() might do the type pun.  My code is not
>    careful enough to handle all cases.  For the common hi + lo case,
>    RETURN2P[I]() is used to print hi and lo separately, so the extra
>    precision in hi + lo can be checked even on arches where hi + lo is
>    not evaluated in extra precision.
> 
> - finally, it is necessary to go through a macro to reach the printing.
>    Not many functions do this.  But most long double functions get this
>    at no extra cost (except complexity in the macros) by having to go
>    though ENTERI()/RETURNI().  Currently, the caller must decide, but
>    it is a trivaial edit to change RETURNI() to RETURNP() and a simple
>    edit to change RETURNI(hi + lo) to RETURN2P(hi, lo).  There is also
>    SUM2P() to handle more complicated cases of hi + lo.
> 
> The return value is not passed to RETURNV(), so it is harder to get
> automatic printing at no extra cost.  In s_sincosf.c and s_sincos.c,
> the macros are just not used, the same as for s_sinf.c, etc.  The
> float case double on all arches instead of double_t.  On i386, this
> reduces to a slightly pessimized use of double_t, and on arches with
> no extra precision it sacrifices a little efficiency to get more
> accuracy.  In s_sincosl.c, the values are calculated and stored in
> local variables, and then RETURNV() returns void.
> 
> This reminds me of a reason why I don't like sincos*().  Its API
> requires destruction of efficiency and accuracy by returning the values
> indirectly.  On i386 with not very old CPUs, this costs about 8 cycles per
> long double value.  Float and double values cost about half as much.  On
> amd64, the long double case is the same and the float and double cases
> are faster.

Not sure your efficiency claim holds.  I've seen significant improves 
in cexp and cexpf where sin[f]() and cos[f]() are replaced by 
sincos[f].  On my core2 running i386 freebsd, I see 0.1779 usecs/call
for cexpf with sinf and cosf and 0.12522 usecs/call for sincosf.
Yes, that's a 29.6% improvement.  For cexp the numbers are 0.2697
usecs/call for sin and cos and 0.20586 for sincos (ie, 23.7% improvement).
This is for z = x + I y with x and y in the non-exceptable case.

> sinf() and cosf() on small args take only 15-20 cycles (thoughput) on
> amd64 with not very old CPUs, so 2-8 extra cycles for the 2 indirect
> return values is a lot.  sincosf() still ends up being slightly faster
> than separate sinf()/cosf().

Seems to be much faster when used in other functions.

-- 
Steve



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190227074811.GA75972>