Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 18 Dec 2015 13:15:21 -0600
From:      Eric van Gyzen <vangyzen@FreeBSD.org>
To:        David Chisnall <theraven@FreeBSD.org>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-10@freebsd.org
Subject:   Re: svn commit: r290014 - in stable/10: lib/libthr/arch/amd64 lib/libthr/arch/i386 libexec/rtld-elf/amd64 libexec/rtld-elf/i386 share/mk
Message-ID:  <56745B49.6050903@FreeBSD.org>
In-Reply-To: <71109998-711D-4ECA-9B44-5A7B1F8705F3@FreeBSD.org>
References:  <201510261621.t9QGLuL2028872@repo.freebsd.org> <71109998-711D-4ECA-9B44-5A7B1F8705F3@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
David,

I apologize for the slow reply.  Your message went to my "stable" box,
which I read less often.

On 11/14/2015 12:30, David Chisnall wrote:
> On 26 Oct 2015, at 16:21, Eric van Gyzen <vangyzen@FreeBSD.org>
> wrote:
>> 
>> One counter-argument to this change is that most applications
>> already use SIMD, and the number of applications and amount of SIMD
>> usage are only increasing.
> 
> Note that SSE and SIMD are not the same thing.  The x86-64 ABI uses
> SSE registers for floating point arguments, so even a purely scalar
> application that uses floating point will end up faulting in the SSE
> state.

I'm aware.  Using the term "SIMD" was an admittedly weak attempt to be
platform agnostic.

> I believe that the no-sse option for clang is ABI-preserving, so will
> not actually disable all SSE unless you also specify -msoft-float.

I'm afraid that's not the case:

$ cat square.c
double square(double x) { return (x*x); }

$ clang -mno-sse -c square.c
fatal error: error in backend: SSE register return with SSE disabled
clang: error: clang frontend command failed with exit code 70 (use -v to
see invocation)
FreeBSD clang version 3.7.0 (tags/RELEASE_370/final 246257) 20150906
Target: x86_64-unknown-freebsd11.0
[snip]

Shall I file the bug report, as it suggests?

> I don’t think that libthr uses floating point anywhere, but libc does
> and you only need to call one function that takes a floating point
> argument in between context switches to lose this gain on x86-64.
> With this change, we’re making the compiler emit less efficient code,
> on the assumption that nothing will touch the fpu in the quantum
> before the next context switch.  I’d really like to see the set of
> applications that you benchmarked the change with on x86-64 to reach
> the conclusion that this is a net win overall.
>
> Or, to put it another way: How many applications are multithreaded
> but don’t use any floating point code?

If I showed you the applications that I care about the most, I would
risk losing my job.  When we updated from FreeBSD 9 to 10, we measured a
significant loss in performance.  This was due to multiple factors, one
of which was that clang started using SSE widely.  We were not yet using
that version of clang for our own code, so most of the performance loss
was due to the usage of SSE in libthr.  Using -mno-sse restored the lost
performance.  It's possible that we lost performance due to SSE in other
libraries; I haven't pursued this.

These applications only use floating-point in some rare corners of
management code, not in any performance-sensitive paths.  They also
don't use libc very much.

On a recent head, I used this script

    https://people.freebsd.org/~vangyzen/thr_sse/thr_sse_file_line.sh

to generate this list

    https://people.freebsd.org/~vangyzen/thr_sse/thr_sse_file_line.txt

of line numbers in libthr that use SSE.  I manually reviewed those to
write this list:

    https://people.freebsd.org/~vangyzen/thr_sse/thr_sse_uses.txt

The vast majority of these simply aren't interesting, because they would
not be called in a performance-sensitive code path, or the code that
uses SSE pales in comparison to the weight of the surrounding code.

The only one that I find truly interesting is mutex_unlock_common(),
which uses SSE to NULL two pointers in the "fast path", which is rather
lightweight.  So, I wrote this nanobenchmark

    https://people.freebsd.org/~vangyzen/thr_sse/movups/

to measure the effect of using SSE in such a way.  I ran it on five
machines and got these results:

    https://people.freebsd.org/~vangyzen/thr_sse/movups/summary.txt

As you can see, most of them show no significant difference.  One
machine, however, showed a 16.7% improvement with SSE.  I find this
fascinating, and I honestly can't explain it.  As always, I welcome
feedback.

I then wrote this /slightly/ more realistic microbenchmark

    https://people.freebsd.org/~vangyzen/thr_sse/mutex_bench/

which uses pthread_mutex_unlock and therefore mutex_unlock_common.  I
ran it on /that/ machine.  I got these results:

    https://people.freebsd.org/~vangyzen/thr_sse/mutex_bench/summary.txt

When libthr was compiled without SSE, the throughput was improved by
7.25%.  Performance of a real-world application improved 3-5%.

I honestly don't like the change any more than you do.  I committed it
just because it helped us measurably, it might help others, and I doubt
it hurts anybody.  If that last point is disproven, I'll be happy to
revert it.  Now, I look forward to a lively discussion.  :)

Eric



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?56745B49.6050903>