FreeBSD Mail Archives

Date:      Sat, 28 Mar 2015 14:25:21 +0900
From:      Tomoaki AOKI <junchoon@dec.sakura.ne.jp>
To:        freebsd-current@freebsd.org
Subject:   Re: SSE in libthr
Message-ID:  <20150328142521.a249be6d9c9f04ae09cc2d5f@dec.sakura.ne.jp>
In-Reply-To: <5515AED9.8040408@FreeBSD.org>
References:  <5515AED9.8040408@FreeBSD.org>

Possibly related information.

Recently, I tried to build world/kernel (head, r280410, amd64) with
CPUTYPE setting in make.conf.  Real CPU is sandybridge (corei7-avx).

Running in VirtualBox VM, installworld fails with CPUTYPE?=corei7-avx,
while with CPUTYPE?=corei7 everything goes OK.
 *Rebooting after installkernel and etcupdate -p goes OK, but rebooting
  after failed installworld causes even /bin/sh fail to start (kernel
  starts OK).

Yes, it would be the problem (or limitation) of VirtualBox and NOT of
FreeBSD, as memstick image built from /usr/obj with CPUTYPE?=corei7-avx
runs OK in real hardware. This should mean some AVX instructions are
generated by clang 3.6.0 for userland, and VirtualBox doesn't like them.


On Fri, 27 Mar 2015 15:26:17 -0400
Eric van Gyzen <vangyzen@freebsd.org> wrote:

> In a nutshell:
> 
> Clang emits SSE instructions on amd64 in the common path of
> pthread_mutex_unlock.  This reduces performance by a non-trivial amount.  I'd
> like to disable SSE in libthr.
> 
> In more detail:
> 
> In libthr/thread/thr_mutex.c, we find the following:
> 
> 	#define MUTEX_INIT_LINK(m)              do {            \
> 	        (m)->m_qe.tqe_prev = NULL;                      \
> 	        (m)->m_qe.tqe_next = NULL;                      \
> 	} while (0)
> 
> In 9.1, clang 3.1 emits two ordinary mov instructions:
> 
> 	movq   $0x0,0x8(%rax)
> 	movq   $0x0,(%rax)
> 
> Since 10.0 and clang 3.3, clang emits these SSE instructions:
> 
> 	xorps  %xmm0,%xmm0
> 	movups %xmm0,(%rax)
> 
> Although these look harmless enough, using the FPU can reduce performance by
> incurring extra overhead due to context-switching the FPU state.
> 
> As I mentioned, this code is used in the common path of pthread_mutex_unlock.  I
> have a simple test program that creates four threads, all contending for a
> single mutex, and measures the total number of lock acquisitions over several
> seconds.  When libthr is built with SSE, as is current, I get around 53 million
> locks in 5 seconds.  Without SSE, I get around 60 million (13% more).  DTrace
> shows around 790,000 calls to fpudna versus 10 calls.  There could be other
> factors involved, but I presume that the FPU context switches account for most
> of the change in performance.
> 
> Even when I add some SSE usage in the application--incidentally, these same
> instructions--building libthr without SSE improves performance from 53.5 million
> to 55.8 million (4.3%).
> 
> In the real-world application where I first noticed this, performance improves
> by 3-5%.
> 
> I would appreciate your thoughts and feedback.  The proposed patch is below.
> 
> Eric
> 
> 
> 
> Index: base/head/lib/libthr/arch/amd64/Makefile.inc
> ===================================================================
> --- base/head/lib/libthr/arch/amd64/Makefile.inc	(revision 280703)
> +++ base/head/lib/libthr/arch/amd64/Makefile.inc	(working copy)
> @@ -1,3 +1,8 @@
>  #$FreeBSD$
> 
>  SRCS+=	_umtx_op_err.S
> +
> +# Using SSE incurs extra overhead per context switch,
> +# which measurably impacts performance when the application
> +# does not otherwise use FP/SSE.
> +CFLAGS+=-mno-sse
> _______________________________________________
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
> 


-- 
Tomoaki AOKI    junchoon@dec.sakura.ne.jp

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150328142521.a249be6d9c9f04ae09cc2d5f>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation