From owner-freebsd-numerics@freebsd.org Wed Feb 27 16:19:11 2019 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id BE29A1521000 for ; Wed, 27 Feb 2019 16:19:11 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id B8D9B84AFF for ; Wed, 27 Feb 2019 16:19:09 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id x1RGJ7J6077887 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Wed, 27 Feb 2019 08:19:07 -0800 (PST) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id x1RGJ6kv077886; Wed, 27 Feb 2019 08:19:06 -0800 (PST) (envelope-from sgk) Date: Wed, 27 Feb 2019 08:19:06 -0800 From: Steve Kargl To: Bruce Evans Cc: freebsd-numerics@freebsd.org Subject: Re: Update ENTERI() macro Message-ID: <20190227161906.GA77785@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20190226191825.GA68479@troutmask.apl.washington.edu> <20190227145002.P907@besplex.bde.org> <20190227074811.GA75972@troutmask.apl.washington.edu> <20190227201214.V1823@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190227201214.V1823@besplex.bde.org> User-Agent: Mutt/1.11.2 (2019-01-07) X-Rspamd-Queue-Id: B8D9B84AFF X-Spamd-Bar: + Authentication-Results: mx1.freebsd.org X-Spamd-Result: default: False [1.32 / 15.00]; ARC_NA(0.00)[]; HAS_REPLYTO(0.00)[sgk@troutmask.apl.washington.edu]; NEURAL_HAM_MEDIUM(-0.47)[-0.472,0]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; NEURAL_SPAM_SHORT(0.90)[0.899,0]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_LAST(0.00)[]; DMARC_NA(0.00)[washington.edu]; AUTH_NA(1.00)[]; REPLYTO_ADDR_EQ_FROM(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_MED(-0.20)[21.76.95.128.list.dnswl.org : 127.0.11.2]; MX_GOOD(-0.01)[cached: troutmask.apl.washington.edu]; RCPT_COUNT_TWO(0.00)[2]; NEURAL_SPAM_LONG(0.15)[0.149,0]; R_SPF_NA(0.00)[]; FREEMAIL_TO(0.00)[optusnet.com.au]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:73, ipnet:128.95.0.0/16, country:US]; MID_RHS_MATCH_FROM(0.00)[]; IP_SCORE(0.05)[ip: (0.11), ipnet: 128.95.0.0/16(0.16), asn: 73(0.06), country: US(-0.07)] X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Feb 2019 16:19:12 -0000 On Wed, Feb 27, 2019 at 09:15:52PM +1100, Bruce Evans wrote: > On Tue, 26 Feb 2019, Steve Kargl wrote: > > > On Wed, Feb 27, 2019 at 05:05:15PM +1100, Bruce Evans wrote: > >> On Tue, 26 Feb 2019, Steve Kargl wrote: > >* ... > >>> Update the ENTERI() macro in math_private.h to take a parameter. > >> ... > >> I don't like this. It churns and complicates all the simple cases > >> that only need ENTERI(). It bogotifies the existence of ENTERIT(), > > ... > > Okay. The other option is an ENTERC() and RETURNC() as > > we need to toggle FP_PE for long double complex functions. > > I suppose I could follow the one example currently in the > > tree that use > > > > ENTERIT(long double complex) > > > > I find it somewhat odd that we have > > > > ENTERI() /* Implicit declaration of __retval to long double. */ > > > > but must use directly ENTERIT(long double complex). > > ENTERI() hard-codes the long double for simplicity. Remember, it is only > needed for long double precision on i386. But I forgot about long double > complex types, and didn't dream about indirect long double types in sincosl(). That simplicity does not work for long double complex. We will need either ENTERIC as in #define ENTERIC() ENTERIT(long double complex) or a direct use of ENTERIT as you have done s_clogl.c > > > ... > >>> -#define RETURNI(x) RETURNF(x) > >>> +#define ENTERI(a) > >>> +#define RETURNI(a) RETURNF(a) > >>> #define ENTERV() > >>> #define RETURNV() return > >>> #endif > >> > >> This also changes RETURNI(), by unimproving its parameter name. 'x' for > >> ENTERI() wasn't a very good name for a type, but is good for a variable. > >> 'x' for RETURNI() is slightly worse than 'r', but better than 'a' > > > > The renaming is for consistency. I can use 'r'. > > 'r' is not quite right either, since the arg can be and is often an > expression. 'a' is good for 'arg'. > > >> ... > >> But I now see 3 more problems. The return in RETURNI() is not direct, > >> but goes through the macro RETURNF(x). In the committed version, this > >> is a default that just returns x, but in my version it returns > >> hackdouble_t(x) or hackfloat_t(x) in some cases (no cases are needed > >> for long doubles, so there is no interaction with ENTERI()/LEAVEI(), > >> and I only do this in a few simple cases not including any with > >> complex types). > > > > I'm fine with making ENTERI() only toggle precision, and adding > > a LEAVEI() to reset precision. RETURNI(r) would then be > > > > #define RETURNI(r) \ > > do { \ > > LEAVEI(); \ > > return (r); \ > > } while (0) > > No, may be an expression, so it must be evaluated before LEAVEI(). This > is the reason for existence of the variable to hold the result. So, we'll need RETURNI for long double and one for long double complex. Or, we give RETURNI a second parameter, which is the input parameter of the function #define RETURNI(x, r) \ do { \ x = (r) \ LEAVEI(); \ return (r); \ } while (0) This will cause a lot of churn. So, it seems that ENTERIC is the way forward. > >> [... about complications for the general case] > > >> This reminds me of a reason why I don't like sincos*(). Its API > >> requires destruction of efficiency and accuracy by returning the values > >> indirectly. On i386 with not very old CPUs, this costs about 8 cycles per > >> long double value. Float and double values cost about half as much. On > >> amd64, the long double case is the same and the float and double cases > >> are faster. > > > > Not sure your efficiency claim holds. I've seen significant improves > > in cexp and cexpf where sin[f]() and cos[f]() are replaced by > > sincos[f]. On my core2 running i386 freebsd, I see 0.1779 usecs/call > > for cexpf with sinf and cosf and 0.12522 usecs/call for sincosf. > > Yes, that's a 29.6% improvement. For cexp the numbers are 0.2697 > > usecs/call for sin and cos and 0.20586 for sincos (ie, 23.7% improvement). > > This is for z = x + I y with x and y in the non-exceptable case. > > Combined sin and cos probably does work better outside of benchmarks for > sin and cos alone, since it does less work so leaves more resources for > the, more useful things. Exactly! I have a significant amount of Fortran code that does z = cmplx(cos(x), sin(x)) in modern C this is 'z = CMPLX(cos(x), sin(x))'. GCC with optimization enables will convert this to z = cexp(cmplx(0,x)) where it expects cexp to optimize this to sincos(). GCC on FreeBSD will not do this optimization because FreeBSD's libm is not C99 compliant. > >> sinf() and cosf() on small args take only 15-20 cycles (thoughput) on > >> amd64 with not very old CPUs, so 2-8 extra cycles for the 2 indirect > >> return values is a lot. sincosf() still ends up being slightly faster > >> than separate sinf()/cosf(). > > > > Seems to be much faster when used in other functions. > > It's hard tp be much faster than 15-20 cycles. The latency is more like > 50 cycles, with 3 sinf()'s or cosf()'s running in parallel. > > sincos() is far from the best possible optimization for repeated calls on > the same or nearby args. If sin() and cos() cached the arg reduction, then > separate sin() and cos() on the same arg would run about as fast as sincos(), > and repeated sin()'s on the same arg would run much faster than now. > Caching the arg reduction may also be good when the arg changes slightly. > However, caching is slower if the args are not close. Even a 1-entry cache > takes a long time to look up relative to the 15-20 cycles taken by sinf() > and cosf(). Caching is complicated by signal handlers and threads. Perhaps > the right API one that has to ask for caching and provides the cache storage. > Then sincos() could be: > > ... > _dh_init(x, &dh); /* prefill 1-entry cache dh */ > s = _sin_cache(x, &dh, 1); /* cache hit unless x is NaN > /* cache misses update dh */ > c = _cos_cache(x, &dh, 1); /* cache hit unless x is NaN > ... > > and with everything inlined this is little different from the current > sincos() except for NaNs. NaNs can be cache hits too if you compare > them as bits, but the comparison should probably be x == dhp->dh_x > for a 1-entry cache, so as to not to have to extract the bits of x. When I worked on sincos() I tried a few variations. This included the simpliest implementation: void sincos(double x, double *s, double *c) { *c = cos(x); *s= sin(x); } I tried argument reduction with kernels. void sincos(double x, double *s, double *c) { a = inline argument reduction done to set a. *c = k_cos(x); *s= k_sin(x); } And finally the version that was committed where k_cos and k_sin were manually inlined and re-arranged to reduce redundant computations. Never thought about some caching mechanism. It seems to be more complicated than it may be worth. -- Steve