Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 27 Mar 2015 13:49:03 -0700
From:      Rui Paulo <rpaulo@me.com>
To:        Eric van Gyzen <vangyzen@FreeBSD.org>
Cc:        current@FreeBSD.org
Subject:   Re: SSE in libthr
Message-ID:  <3A96AAEC-9C1C-444E-9A73-3CD2AED33116@me.com>
In-Reply-To: <5515AED9.8040408@FreeBSD.org>
References:  <5515AED9.8040408@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mar 27, 2015, at 12:26, Eric van Gyzen <vangyzen@FreeBSD.org> wrote:
>=20
> In a nutshell:
>=20
> Clang emits SSE instructions on amd64 in the common path of
> pthread_mutex_unlock.  This reduces performance by a non-trivial =
amount.  I'd
> like to disable SSE in libthr.
>=20
> In more detail:
>=20
> In libthr/thread/thr_mutex.c, we find the following:
>=20
> 	#define MUTEX_INIT_LINK(m)              do {            \
> 	        (m)->m_qe.tqe_prev =3D NULL;                      \
> 	        (m)->m_qe.tqe_next =3D NULL;                      \
> 	} while (0)
>=20
> In 9.1, clang 3.1 emits two ordinary mov instructions:
>=20
> 	movq   $0x0,0x8(%rax)
> 	movq   $0x0,(%rax)
>=20
> Since 10.0 and clang 3.3, clang emits these SSE instructions:
>=20
> 	xorps  %xmm0,%xmm0
> 	movups %xmm0,(%rax)
>=20
> Although these look harmless enough, using the FPU can reduce =
performance by
> incurring extra overhead due to context-switching the FPU state.
>=20
> As I mentioned, this code is used in the common path of =
pthread_mutex_unlock.  I
> have a simple test program that creates four threads, all contending =
for a
> single mutex, and measures the total number of lock acquisitions over =
several
> seconds.  When libthr is built with SSE, as is current, I get around =
53 million
> locks in 5 seconds.  Without SSE, I get around 60 million (13% more).  =
DTrace
> shows around 790,000 calls to fpudna versus 10 calls.  There could be =
other
> factors involved, but I presume that the FPU context switches account =
for most
> of the change in performance.
>=20
> Even when I add some SSE usage in the application--incidentally, these =
same
> instructions--building libthr without SSE improves performance from =
53.5 million
> to 55.8 million (4.3%).
>=20
> In the real-world application where I first noticed this, performance =
improves
> by 3-5%.
>=20
> I would appreciate your thoughts and feedback.  The proposed patch is =
below.
>=20
> Eric
>=20
>=20
>=20
> Index: base/head/lib/libthr/arch/amd64/Makefile.inc
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- base/head/lib/libthr/arch/amd64/Makefile.inc	(revision =
280703)
> +++ base/head/lib/libthr/arch/amd64/Makefile.inc	(working copy)
> @@ -1,3 +1,8 @@
> #$FreeBSD$
>=20
> SRCS+=3D	_umtx_op_err.S
> +
> +# Using SSE incurs extra overhead per context switch,
> +# which measurably impacts performance when the application
> +# does not otherwise use FP/SSE.
> +CFLAGS+=3D-mno-sse

Good catch!

Regarding your patch, I think we should disable even more, if possible.  =
How about:

CFLAGS+=3D        -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3

--
Rui Paulo






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3A96AAEC-9C1C-444E-9A73-3CD2AED33116>