Date: Sat, 28 Mar 2015 14:25:21 +0900 From: Tomoaki AOKI <junchoon@dec.sakura.ne.jp> To: freebsd-current@freebsd.org Subject: Re: SSE in libthr Message-ID: <20150328142521.a249be6d9c9f04ae09cc2d5f@dec.sakura.ne.jp> In-Reply-To: <5515AED9.8040408@FreeBSD.org> References: <5515AED9.8040408@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Possibly related information. Recently, I tried to build world/kernel (head, r280410, amd64) with CPUTYPE setting in make.conf. Real CPU is sandybridge (corei7-avx). Running in VirtualBox VM, installworld fails with CPUTYPE?=corei7-avx, while with CPUTYPE?=corei7 everything goes OK. *Rebooting after installkernel and etcupdate -p goes OK, but rebooting after failed installworld causes even /bin/sh fail to start (kernel starts OK). Yes, it would be the problem (or limitation) of VirtualBox and NOT of FreeBSD, as memstick image built from /usr/obj with CPUTYPE?=corei7-avx runs OK in real hardware. This should mean some AVX instructions are generated by clang 3.6.0 for userland, and VirtualBox doesn't like them. On Fri, 27 Mar 2015 15:26:17 -0400 Eric van Gyzen <vangyzen@freebsd.org> wrote: > In a nutshell: > > Clang emits SSE instructions on amd64 in the common path of > pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd > like to disable SSE in libthr. > > In more detail: > > In libthr/thread/thr_mutex.c, we find the following: > > #define MUTEX_INIT_LINK(m) do { \ > (m)->m_qe.tqe_prev = NULL; \ > (m)->m_qe.tqe_next = NULL; \ > } while (0) > > In 9.1, clang 3.1 emits two ordinary mov instructions: > > movq $0x0,0x8(%rax) > movq $0x0,(%rax) > > Since 10.0 and clang 3.3, clang emits these SSE instructions: > > xorps %xmm0,%xmm0 > movups %xmm0,(%rax) > > Although these look harmless enough, using the FPU can reduce performance by > incurring extra overhead due to context-switching the FPU state. > > As I mentioned, this code is used in the common path of pthread_mutex_unlock. I > have a simple test program that creates four threads, all contending for a > single mutex, and measures the total number of lock acquisitions over several > seconds. When libthr is built with SSE, as is current, I get around 53 million > locks in 5 seconds. Without SSE, I get around 60 million (13% more). DTrace > shows around 790,000 calls to fpudna versus 10 calls. There could be other > factors involved, but I presume that the FPU context switches account for most > of the change in performance. > > Even when I add some SSE usage in the application--incidentally, these same > instructions--building libthr without SSE improves performance from 53.5 million > to 55.8 million (4.3%). > > In the real-world application where I first noticed this, performance improves > by 3-5%. > > I would appreciate your thoughts and feedback. The proposed patch is below. > > Eric > > > > Index: base/head/lib/libthr/arch/amd64/Makefile.inc > =================================================================== > --- base/head/lib/libthr/arch/amd64/Makefile.inc (revision 280703) > +++ base/head/lib/libthr/arch/amd64/Makefile.inc (working copy) > @@ -1,3 +1,8 @@ > #$FreeBSD$ > > SRCS+= _umtx_op_err.S > + > +# Using SSE incurs extra overhead per context switch, > +# which measurably impacts performance when the application > +# does not otherwise use FP/SSE. > +CFLAGS+=-mno-sse > _______________________________________________ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" > -- Tomoaki AOKI junchoon@dec.sakura.ne.jp
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150328142521.a249be6d9c9f04ae09cc2d5f>