Date:Thu, 27 Jun 2013 12:34:59 +1000 (EST)From:Bruce Evans <brde@optusnet.com.au>To:enh <enh@google.com>Cc:freebsd-numerics@FreeBSD.orgSubject:Re: sincos?Message-ID:<20130627112404.T1215@besplex.bde.org>In-Reply-To:<CAJgzZopTzfYXecu7zRKhVNEEBOCtz8Z2qK8ka74c5LKZxC8mEw@mail.gmail.com>References:<CAJgzZopTzfYXecu7zRKhVNEEBOCtz8Z2qK8ka74c5LKZxC8mEw@mail.gmail.com>

Next in thread | Previous in thread | Raw E-Mail | Index | Archive | Help

On Wed, 26 Jun 2013, enh wrote: > i'm a recent lurker on this list; i've inherited Android's C library, and > among other things i'm trying to track FreeBSD's lib/msun much more closely > than we have traditionally. We haven't bothered with it because there are more important optimizations to do fitst. > i was just reminded of the existence of a change submitted to us (Android) > a while back that adds a sincos/sincosf implementation cobbled together > from your s_sin.c/s_sinf.c and s_cos.c/s_cosf.c implementations: > https://android-review.googlesource.com/#/c/47585/<https://android-review.googlesource.com/#/c/47585/1>; I couldn't read it due to a javascript problem. > the submitter (Intel) rightly points out that at the moment GCC carefully > optimizes paired sin/cos calls into a sincos call which we deoptimize back > into separate sin/cos calls. i personally don't want to take on maintenance > of this, but i would be happy to include you guys' sincos implementation if > you had one. is there a reason you don't have one? what's the clang story > with this optimization (it's my understanding you're moving away from GCC > in favor of clang)? A quick check of current speeds show that separate sin/cos calls are fairly efficient on corei7. They get pipelined and run in parallel, and you can only avoid parameter passing and arg reduction overheads by using a single call. % #include <math.h> % % #define FREQ 2010168339 /* sysctl -n machdep.tsc_freq */ % % int % main(void) % { % volatile double c, s, x; % int i; % % #if 0 % /* 106 cycles on Athlon64 (i386): */ % /* 102 cycles on corei7 (amd64): */ % for (i = 0; i < FREQ / 10; i++) % asm("fld1; fsincos; fstp %st(0); fstp %st(0)"); The i387 sincos instruction is very slow (just like all i387 instructions excep addition and multiplication). FreeBSD still uses the slow sin and cos instructions on i386 (except in my version), to get their slowness and huge inaccuracy. Fixing this is more important. Note that the above is not a full sincos implementation, and isn't a C implementation. It is missing support for large args, and cheats by not passing args or using using the results or accessing memory. However, for the test arg of 1, the the arg is not large so no special arg reduction is needed. Also the test arg is not very near a multiple of pi/2, so the i387 accuracy is more than good enough for double precision (it is good enough for long double precision). % #else % /* 255 cycles on Athlon64 (i386) :-(: */ % /* 74 cycles on corei7 (amd64): */ % for (i = 0; i < FREQ / 10; i++) { % x = 1; % c = cos(x); % s = sin(x); % } The library implementation is complete, and the test does full parameter passing. However, the arg reduction is trivial for the test arg, so this tests a case where repeating the arg reduction for sin and cos does't take very long. For medium-sized args, the library sin and cos are about twice as slow the i387 is broken near multiples of pi/2. For huge args, the library sin and cos are 10-20 times slower and the i387 is broken for all args. It is in the unimportant huge-arg case that combining sin and cos is most beneficial. Most of the 10-20 times slowness factor is for the arg reduction, so avoiding doing it once would make sincos twice as fast as sin+cos. On corei7, the library implementaion easily beats the i387 for all args between -2*Pi and 2*Pi. The libary does special optimizations for this range. The i387 is also faster for a smaller range (between -Pi/4 and Pi/4 IIRC). More careful tests than the above give the following times for on corei7: cos: 28 cycles; sin: 24 cycles. So the combined time of 74 cycles is not very good. The slowness of the library implementation on Athlon64 (i386) is strange. More careful tests than the above give the following times: cos: 51 cycles; sin: 68 cycles for args between -2*Pi and 2*Pi. I forgot that although my libm doesn't use i387 sin or cos, it is not optimized for Athlon64 (it is optimized for i386 and tuned for athlon-xp). The more careful tests optimize it using -march=athlon*. I thought that the Athlon64-specific optionizations (using SSE for some things) were only important for medium-sized args. After changing sin and cos to cosf and sinf, the test runs at the expected speed (72 cycles on Athlon64 (i386, with library not using Athlon64 features) and 43 cycles on corei7 (amd64)). Optimizing double precision on i386/Athlon64 is more important. On newer CPUs, double precision doesn't have the extra penalties relative to float precision that it has on Athlon64, at least when the library is optimized for the newer CPU, so i386 libm runs at about the same speed as amd64 libm in all precisions. Note that i387 sincos delivers long double precision for some args, while library sinl and cosl deliver long double precision for all args, but are quite slow. In particular, the library isn't optimized for args between -2*Pi and 2*Pi, but only for args between -Pi/4 and Pi/4. The test arg of 1 is outside of the smaller range. After modifying the test program to use long doubles, it takes 591 cycles for cos and sin on Athlon64 (i386) :-(. Optimizing this is more important. Oops, I forgot to change the default rounding precision to 64 bits. 591 cycles is with every call to cos and every call to sin switching the rounding precision back and forth. After fixing this, the test program only takes 472 cycles. This is still larger than expected. In more careful tests, cosl takes 122 cycles and sinl takes 180 cycles for args in the range -2*Pi to 2*Pi. My cosl has minor optimizations that aren't in the committed version. 472 is interestingly more than 122+180. On corei7, the penalties for long doubles relative to doubles are smaller, so cosl takes only 60 cycles and sinl only 56, and the test program only 178. The rounding mode doesn't need switching on amd64. 178 is still interestingly more than 60+56. Apparently, overheads outside of the functions are larger than the time taken by each function. Probably this is only apparent and the overheads are really for the separate calls messing up each others scheduling. Combining the calls can give better scheduling, but optimizations related to scheduling are hard to get right. % #endif % return (0); % } Bruce

Want to link to this message? Use this URL: <http://docs.FreeBSD.org/cgi/mid.cgi?20130627112404.T1215>