From owner-freebsd-numerics@freebsd.org  Wed Feb 27 16:19:11 2019
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id BE29A1521000
 for <freebsd-numerics@mailman.ysv.freebsd.org>;
 Wed, 27 Feb 2019 16:19:11 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 server-signature RSA-PSS (4096 bits)
 client-signature RSA-PSS (2048 bits) client-digest SHA256)
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id B8D9B84AFF
 for <freebsd-numerics@freebsd.org>; Wed, 27 Feb 2019 16:19:09 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id x1RGJ7J6077887
 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO);
 Wed, 27 Feb 2019 08:19:07 -0800 (PST)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id x1RGJ6kv077886;
 Wed, 27 Feb 2019 08:19:06 -0800 (PST) (envelope-from sgk)
Date: Wed, 27 Feb 2019 08:19:06 -0800
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-numerics@freebsd.org
Subject: Re: Update ENTERI() macro
Message-ID: <20190227161906.GA77785@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20190226191825.GA68479@troutmask.apl.washington.edu>
 <20190227145002.P907@besplex.bde.org>
 <20190227074811.GA75972@troutmask.apl.washington.edu>
 <20190227201214.V1823@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20190227201214.V1823@besplex.bde.org>
User-Agent: Mutt/1.11.2 (2019-01-07)
X-Rspamd-Queue-Id: B8D9B84AFF
X-Spamd-Bar: +
Authentication-Results: mx1.freebsd.org
X-Spamd-Result: default: False [1.32 / 15.00]; ARC_NA(0.00)[];
 HAS_REPLYTO(0.00)[sgk@troutmask.apl.washington.edu];
 NEURAL_HAM_MEDIUM(-0.47)[-0.472,0]; FROM_HAS_DN(0.00)[];
 TO_DN_SOME(0.00)[]; NEURAL_SPAM_SHORT(0.90)[0.899,0];
 MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_LAST(0.00)[];
 DMARC_NA(0.00)[washington.edu]; AUTH_NA(1.00)[];
 REPLYTO_ADDR_EQ_FROM(0.00)[]; RCVD_COUNT_THREE(0.00)[3];
 TO_MATCH_ENVRCPT_SOME(0.00)[];
 RCVD_IN_DNSWL_MED(-0.20)[21.76.95.128.list.dnswl.org : 127.0.11.2];
 MX_GOOD(-0.01)[cached: troutmask.apl.washington.edu];
 RCPT_COUNT_TWO(0.00)[2]; NEURAL_SPAM_LONG(0.15)[0.149,0];
 R_SPF_NA(0.00)[]; FREEMAIL_TO(0.00)[optusnet.com.au];
 FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[];
 MIME_TRACE(0.00)[0:+];
 ASN(0.00)[asn:73, ipnet:128.95.0.0/16, country:US];
 MID_RHS_MATCH_FROM(0.00)[];
 IP_SCORE(0.05)[ip: (0.11), ipnet: 128.95.0.0/16(0.16), asn: 73(0.06),
 country: US(-0.07)]
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 27 Feb 2019 16:19:12 -0000

On Wed, Feb 27, 2019 at 09:15:52PM +1100, Bruce Evans wrote:
> On Tue, 26 Feb 2019, Steve Kargl wrote:
> 
> > On Wed, Feb 27, 2019 at 05:05:15PM +1100, Bruce Evans wrote:
> >> On Tue, 26 Feb 2019, Steve Kargl wrote:
> >* ...
> >>> Update the ENTERI() macro in math_private.h to take a parameter.
> >> ...
> >> I don't like this.  It churns and complicates all the simple cases
> >> that only need ENTERI().  It bogotifies the existence of ENTERIT(),
> > ...
> > Okay.  The other option is an ENTERC() and RETURNC() as
> > we need to toggle FP_PE for long double complex functions.
> > I suppose I could follow the one example currently in the
> > tree that use
> >
> > 	ENTERIT(long double complex)
> >
> > I find it somewhat odd that we have
> >
> > 	ENTERI() /* Implicit declaration of __retval to long double. */
> >
> > but must use directly ENTERIT(long double complex).
> 
> ENTERI() hard-codes the long double for simplicity.  Remember, it is only
> needed for long double precision on i386.  But I forgot about long double
> complex types, and didn't dream about indirect long double types in sincosl().

That simplicity does not work for long double complex.  We will
need either ENTERIC as in

#define ENTERIC() ENTERIT(long double complex)

or a direct use of ENTERIT as you have done s_clogl.c

> 
> > ...
> >>> -#define	RETURNI(x)	RETURNF(x)
> >>> +#define	ENTERI(a)
> >>> +#define	RETURNI(a)	RETURNF(a)
> >>> #define	ENTERV()
> >>> #define	RETURNV()	return
> >>> #endif
> >>
> >> This also changes RETURNI(), by unimproving its parameter name.  'x' for
> >> ENTERI() wasn't a very good name for a type, but is good for a variable.
> >> 'x' for RETURNI() is slightly worse than 'r', but better than 'a'
> >
> > The renaming is for consistency.  I can use 'r'.
> 
> 'r' is not quite right either, since the arg can be and is often an
> expression.  'a' is good for 'arg'.
> 
> >> ...
> >> But I now see 3 more problems.  The return in RETURNI() is not direct,
> >> but goes through the macro RETURNF(x).  In the committed version, this
> >> is a default that just returns x, but in my version it returns
> >> hackdouble_t(x) or hackfloat_t(x) in some cases (no cases are needed
> >> for long doubles, so there is no interaction with ENTERI()/LEAVEI(),
> >> and I only do this in a few simple cases not including any with
> >> complex types).
> >
> > I'm fine with making ENTERI() only toggle precision, and adding
> > a LEAVEI() to reset precision.  RETURNI(r) would then be
> >
> > #define RETURNI(r)	\
> > do {		\
> >   LEAVEI();		\
> >   return (r);	\
> > } while (0)
> 
> No, may be an expression, so it must be evaluated before LEAVEI().  This
> is the reason for existence of the variable to hold the result.

So, we'll need RETURNI for long double and one for long double complex.
Or, we give RETURNI a second parameter, which is the input parameter of
the function

#define RETURNI(x, r)	\
do {			\
   x = (r)		\
   LEAVEI();		\
   return (r);		\
 } while (0)
 
This will cause a lot of churn.

So, it seems that ENTERIC is the way forward.

> >> [... about complications for the general case]
> 
> >> This reminds me of a reason why I don't like sincos*().  Its API
> >> requires destruction of efficiency and accuracy by returning the values
> >> indirectly.  On i386 with not very old CPUs, this costs about 8 cycles per
> >> long double value.  Float and double values cost about half as much.  On
> >> amd64, the long double case is the same and the float and double cases
> >> are faster.
> >
> > Not sure your efficiency claim holds.  I've seen significant improves
> > in cexp and cexpf where sin[f]() and cos[f]() are replaced by
> > sincos[f].  On my core2 running i386 freebsd, I see 0.1779 usecs/call
> > for cexpf with sinf and cosf and 0.12522 usecs/call for sincosf.
> > Yes, that's a 29.6% improvement.  For cexp the numbers are 0.2697
> > usecs/call for sin and cos and 0.20586 for sincos (ie, 23.7% improvement).
> > This is for z = x + I y with x and y in the non-exceptable case.
> 
> Combined sin and cos probably does work better outside of benchmarks for
> sin and cos alone, since it does less work so leaves more resources for
> the, more useful things.

Exactly!  I have a significant amount of Fortran code that does

   z = cmplx(cos(x), sin(x))

in modern C this is 'z = CMPLX(cos(x), sin(x))'.  GCC with optimization
enables will convert this to z = cexp(cmplx(0,x)) where it expects cexp
to optimize this to sincos().  GCC on FreeBSD will not do this optimization
because FreeBSD's libm is not C99 compliant.

> >> sinf() and cosf() on small args take only 15-20 cycles (thoughput) on
> >> amd64 with not very old CPUs, so 2-8 extra cycles for the 2 indirect
> >> return values is a lot.  sincosf() still ends up being slightly faster
> >> than separate sinf()/cosf().
> >
> > Seems to be much faster when used in other functions.
> 
> It's hard tp be much faster than 15-20 cycles.  The latency is more like
> 50 cycles, with 3 sinf()'s or cosf()'s running in parallel.
> 
> sincos() is far from the best possible optimization for repeated calls on
> the same or nearby args.  If sin() and cos() cached the arg reduction, then
> separate sin() and cos() on the same arg would run about as fast as sincos(),
> and repeated sin()'s on the same arg would run much faster than now.
> Caching the arg reduction may also be good when the arg changes slightly.
> However, caching is slower if the args are not close.  Even a 1-entry cache
> takes a long time to look up relative to the 15-20 cycles taken by sinf()
> and cosf().  Caching is complicated by signal handlers and threads.  Perhaps
> the right API one that has to ask for caching and provides the cache storage.
> Then sincos() could be:
> 
>  	...
>  	_dh_init(x, &dh);		/* prefill 1-entry cache dh */
>  	s = _sin_cache(x, &dh, 1);	/* cache hit unless x is NaN
>  					/* cache misses update dh */
>  	c = _cos_cache(x, &dh, 1);	/* cache hit unless x is NaN
>  	...
> 
> and with everything inlined this is little different from the current
> sincos() except for NaNs.  NaNs can be cache hits too if you compare
> them as bits, but the comparison should probably be x == dhp->dh_x
> for a 1-entry cache, so as to not to have to extract the bits of x.

When I worked on sincos() I tried a few variations.  This included
the simpliest implementation:

void
sincos(double x, double *s, double *c)
{
  *c = cos(x);
  *s=  sin(x);
}

I tried argument reduction with kernels.

void
sincos(double x, double *s, double *c)
{
  a = inline argument reduction done to set a.
  *c = k_cos(x);
  *s=  k_sin(x);
}

And finally the version that was committed where k_cos and k_sin
were manually inlined and re-arranged to reduce redundant computations.

Never thought about some caching mechanism.  It seems to be more
complicated than it may be worth.

-- 
Steve