Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 27 Mar 2015 23:44:52 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Rui Paulo <rpaulo@me.com>
Cc:        Eric van Gyzen <vangyzen@FreeBSD.org>, current@FreeBSD.org
Subject:   Re: SSE in libthr
Message-ID:  <20150327214452.GR2379@kib.kiev.ua>
In-Reply-To: <3A96AAEC-9C1C-444E-9A73-3CD2AED33116@me.com>
References:  <5515AED9.8040408@FreeBSD.org> <3A96AAEC-9C1C-444E-9A73-3CD2AED33116@me.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Mar 27, 2015 at 01:49:03PM -0700, Rui Paulo wrote:
> On Mar 27, 2015, at 12:26, Eric van Gyzen <vangyzen@FreeBSD.org> wrote:
> > 
> > In a nutshell:
> > 
> > Clang emits SSE instructions on amd64 in the common path of
> > pthread_mutex_unlock.  This reduces performance by a non-trivial amount.  I'd
> > like to disable SSE in libthr.
> > 
> > In more detail:
> > 
> > In libthr/thread/thr_mutex.c, we find the following:
> > 
> > 	#define MUTEX_INIT_LINK(m)              do {            \
> > 	        (m)->m_qe.tqe_prev = NULL;                      \
> > 	        (m)->m_qe.tqe_next = NULL;                      \
> > 	} while (0)
> > 
> > In 9.1, clang 3.1 emits two ordinary mov instructions:
> > 
> > 	movq   $0x0,0x8(%rax)
> > 	movq   $0x0,(%rax)
> > 
> > Since 10.0 and clang 3.3, clang emits these SSE instructions:
> > 
> > 	xorps  %xmm0,%xmm0
> > 	movups %xmm0,(%rax)
> > 
> > Although these look harmless enough, using the FPU can reduce performance by
> > incurring extra overhead due to context-switching the FPU state.
> > 
> > As I mentioned, this code is used in the common path of pthread_mutex_unlock.  I
> > have a simple test program that creates four threads, all contending for a
> > single mutex, and measures the total number of lock acquisitions over several
> > seconds.  When libthr is built with SSE, as is current, I get around 53 million
> > locks in 5 seconds.  Without SSE, I get around 60 million (13% more).  DTrace
> > shows around 790,000 calls to fpudna versus 10 calls.  There could be other
> > factors involved, but I presume that the FPU context switches account for most
> > of the change in performance.
> > 
> > Even when I add some SSE usage in the application--incidentally, these same
> > instructions--building libthr without SSE improves performance from 53.5 million
> > to 55.8 million (4.3%).
> > 
> > In the real-world application where I first noticed this, performance improves
> > by 3-5%.
> > 
> > I would appreciate your thoughts and feedback.  The proposed patch is below.
> > 
> > Eric
> > 
> > 
> > 
> > Index: base/head/lib/libthr/arch/amd64/Makefile.inc
> > ===================================================================
> > --- base/head/lib/libthr/arch/amd64/Makefile.inc	(revision 280703)
> > +++ base/head/lib/libthr/arch/amd64/Makefile.inc	(working copy)
> > @@ -1,3 +1,8 @@
> > #$FreeBSD$
> > 
> > SRCS+=	_umtx_op_err.S
> > +
> > +# Using SSE incurs extra overhead per context switch,
> > +# which measurably impacts performance when the application
> > +# does not otherwise use FP/SSE.
> > +CFLAGS+=-mno-sse
> 
> Good catch!
> 
> Regarding your patch, I think we should disable even more, if possible.  How about:
> 
> CFLAGS+=        -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3

I think so.

Also, this should be done for libc as well, both on i386 and amd64.
I am not sure, should compiler-rt be included into the set ?



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150327214452.GR2379>