Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 27 Jun 2013 12:34:59 +1000 (EST)
From:      Bruce Evans <>
To:        enh <>
Subject:   Re: sincos?
Message-ID:  <>
In-Reply-To: <>
References:  <>

Next in thread | Previous in thread | Raw E-Mail | Index | Archive | Help
On Wed, 26 Jun 2013, enh wrote:

> i'm a recent lurker on this list; i've inherited Android's C library, and
> among other things i'm trying to track FreeBSD's lib/msun much more closely
> than we have traditionally.

We haven't bothered with it because there are more important optimizations
to do fitst.

> i was just reminded of the existence of a change submitted to us (Android)
> a while back that adds a sincos/sincosf implementation cobbled together
> from your s_sin.c/s_sinf.c and s_cos.c/s_cosf.c implementations:

I couldn't read it due to a javascript problem.

> the submitter (Intel) rightly points out that at the moment GCC carefully
> optimizes paired sin/cos calls into a sincos call which we deoptimize back
> into separate sin/cos calls. i personally don't want to take on maintenance
> of this, but i would be happy to include you guys' sincos implementation if
> you had one. is there a reason you don't have one? what's the clang story
> with this optimization (it's my understanding you're moving away from GCC
> in favor of clang)?

A quick check of current speeds show that separate sin/cos calls are fairly
efficient on corei7.  They get pipelined and run in parallel, and you can
only avoid parameter passing and arg reduction overheads by using a single

% #include <math.h>
% #define	FREQ	2010168339	/* sysctl -n machdep.tsc_freq */
% int
% main(void)
% {
% 	volatile double c, s, x;
% 	int i;
% #if 0
% 	/* 106 cycles on Athlon64 (i386): */
% 	/* 102 cycles on corei7 (amd64): */
% 	for (i = 0; i < FREQ / 10; i++)
% 		asm("fld1; fsincos; fstp %st(0); fstp %st(0)");

The i387 sincos instruction is very slow (just like all i387 instructions
excep addition and multiplication).  FreeBSD still uses the slow sin and
cos instructions on i386 (except in my version), to get their slowness
and huge inaccuracy.  Fixing this is more important.

Note that the above is not a full sincos implementation, and isn't a C
implementation.  It is missing support for large args, and cheats by not
passing args or using using the results or accessing memory.  However,
for the test arg of 1, the the arg is not large so no special arg
reduction is needed.  Also the test arg is not very near a multiple of
pi/2, so the i387 accuracy is more than good enough for double precision
(it is good enough for long double precision).

% #else
% 	/* 255 cycles on Athlon64 (i386) :-(: */
% 	/* 74 cycles on corei7 (amd64): */
% 	for (i = 0; i < FREQ / 10; i++) {
% 		x = 1;
% 		c = cos(x);
% 		s = sin(x);
% 	}

The library implementation is complete, and the test does full parameter
passing.  However, the arg reduction is trivial for the test arg, so this
tests a case where repeating the arg reduction for sin and cos does't
take very long.  For medium-sized args, the library sin and cos are about
twice as slow the i387 is broken near multiples of pi/2.  For huge args,
the library sin and cos are 10-20 times slower and the i387 is broken for
all args.  It is in the unimportant huge-arg case that combining sin and
cos is most beneficial.  Most of the 10-20 times slowness factor is for
the arg reduction, so avoiding doing it once would make sincos twice as
fast as sin+cos.

On corei7, the library implementaion easily beats the i387 for all args
between -2*Pi and 2*Pi.  The libary does special optimizations for this
range.  The i387 is also faster for a smaller range (between -Pi/4 and
Pi/4 IIRC).  More careful tests than the above give the following times
for on corei7: cos: 28 cycles; sin: 24 cycles.  So the combined time of
74 cycles is not very good.

The slowness of the library implementation on Athlon64 (i386) is strange.
More careful tests than the above give the following times: cos: 51
cycles; sin: 68 cycles for args between -2*Pi and 2*Pi.  I forgot that
although my libm doesn't use i387 sin or cos, it is not optimized for
Athlon64 (it is optimized for i386 and tuned for athlon-xp).  The more
careful tests optimize it using -march=athlon*.  I thought that the
Athlon64-specific optionizations (using SSE for some things) were only
important for medium-sized args.  After changing sin and cos to cosf
and sinf, the test runs at the expected speed (72 cycles on Athlon64
(i386, with library not using Athlon64 features) and 43 cycles on corei7
(amd64)).  Optimizing double precision on i386/Athlon64 is more important.
On newer CPUs, double precision doesn't have the extra penalties relative
to float precision that it has on Athlon64, at least when the library is
optimized for the newer CPU, so i386 libm runs at about the same speed
as amd64 libm in all precisions.

Note that i387 sincos delivers long double precision for some args, while
library sinl and cosl deliver long double precision for all args, but are
quite slow.  In particular, the library isn't optimized for args between
-2*Pi and 2*Pi, but only for args between -Pi/4 and Pi/4.  The test arg
of 1 is outside of the smaller range.  After modifying the test program
to use long doubles, it takes 591 cycles for cos and sin on Athlon64 (i386)
:-(.  Optimizing this is more important.

Oops, I forgot to change the default rounding precision to 64 bits.
591 cycles is with every call to cos and every call to sin switching
the rounding precision back and forth.  After fixing this, the test
program only takes 472 cycles.  This is still larger than expected.
In more careful tests, cosl takes 122 cycles and sinl takes 180 cycles
for args in the range -2*Pi to 2*Pi.  My cosl has minor optimizations
that aren't in the committed version.  472 is interestingly more than

On corei7, the penalties for long doubles relative to doubles are
smaller, so cosl takes only 60 cycles and sinl only 56, and the test
program only 178.  The rounding mode doesn't need switching on amd64.
178 is still interestingly more than 60+56.  Apparently, overheads
outside of the functions are larger than the time taken by each
function.  Probably this is only apparent and the overheads are really
for the separate calls messing up each others scheduling.  Combining
the calls can give better scheduling, but optimizations related to
scheduling are hard to get right.

% #endif
% 	return (0);
% }


Want to link to this message? Use this URL: <>