Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 28 Mar 2015 21:54:08 +0800
From:      Julian Elischer <julian@freebsd.org>
To:        freebsd-current@freebsd.org
Subject:   Re: SSE in libthr
Message-ID:  <5516B280.6060002@freebsd.org>
In-Reply-To: <20150327214452.GR2379@kib.kiev.ua>
References:  <5515AED9.8040408@FreeBSD.org> <3A96AAEC-9C1C-444E-9A73-3CD2AED33116@me.com> <20150327214452.GR2379@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On 3/28/15 5:44 AM, Konstantin Belousov wrote:
> On Fri, Mar 27, 2015 at 01:49:03PM -0700, Rui Paulo wrote:
>> On Mar 27, 2015, at 12:26, Eric van Gyzen <vangyzen@FreeBSD.org> wrote:
>>> In a nutshell:
>>>
>>> Clang emits SSE instructions on amd64 in the common path of
>>> pthread_mutex_unlock.  This reduces performance by a non-trivial amount.  I'd
>>> like to disable SSE in libthr.
>>>
>>> In more detail:
>>>
>>> In libthr/thread/thr_mutex.c, we find the following:
>>>
>>> 	#define MUTEX_INIT_LINK(m)              do {            \
>>> 	        (m)->m_qe.tqe_prev = NULL;                      \
>>> 	        (m)->m_qe.tqe_next = NULL;                      \
>>> 	} while (0)
>>>
>>> In 9.1, clang 3.1 emits two ordinary mov instructions:
>>>
>>> 	movq   $0x0,0x8(%rax)
>>> 	movq   $0x0,(%rax)
>>>
>>> Since 10.0 and clang 3.3, clang emits these SSE instructions:
>>>
>>> 	xorps  %xmm0,%xmm0
>>> 	movups %xmm0,(%rax)
>>>
>>> Although these look harmless enough, using the FPU can reduce performance by
>>> incurring extra overhead due to context-switching the FPU state.
>>>
>>> As I mentioned, this code is used in the common path of pthread_mutex_unlock.  I
>>> have a simple test program that creates four threads, all contending for a
>>> single mutex, and measures the total number of lock acquisitions over several
>>> seconds.  When libthr is built with SSE, as is current, I get around 53 million
>>> locks in 5 seconds.  Without SSE, I get around 60 million (13% more).  DTrace
>>> shows around 790,000 calls to fpudna versus 10 calls.  There could be other
>>> factors involved, but I presume that the FPU context switches account for most
>>> of the change in performance.
>>>
>>> Even when I add some SSE usage in the application--incidentally, these same
>>> instructions--building libthr without SSE improves performance from 53.5 million
>>> to 55.8 million (4.3%).
>>>
>>> In the real-world application where I first noticed this, performance improves
>>> by 3-5%.
>>>
>>> I would appreciate your thoughts and feedback.  The proposed patch is below.
>>>
>>> Eric
>>>
>>>
>>>
>>> Index: base/head/lib/libthr/arch/amd64/Makefile.inc
>>> ===================================================================
>>> --- base/head/lib/libthr/arch/amd64/Makefile.inc	(revision 280703)
>>> +++ base/head/lib/libthr/arch/amd64/Makefile.inc	(working copy)
>>> @@ -1,3 +1,8 @@
>>> #$FreeBSD$
>>>
>>> SRCS+=	_umtx_op_err.S
>>> +
>>> +# Using SSE incurs extra overhead per context switch,
>>> +# which measurably impacts performance when the application
>>> +# does not otherwise use FP/SSE.
>>> +CFLAGS+=-mno-sse
>> Good catch!
>>
>> Regarding your patch, I think we should disable even more, if possible.  How about:
>>
>> CFLAGS+=        -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3
> I think so.
>
> Also, this should be done for libc as well, both on i386 and amd64.
> I am not sure, should compiler-rt be included into the set ?
the point is that clang will do this anywhere it can, because it isn't 
taking into account the
side effects, just the speed of the commands themselves.

> _______________________________________________
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
>




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5516B280.6060002>