From owner-freebsd-current@FreeBSD.ORG Fri Mar 27 19:27:27 2015 Return-Path: Delivered-To: current@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E2CC8DB9 for ; Fri, 27 Mar 2015 19:27:27 +0000 (UTC) Received: from smtp.vangyzen.net (hotblack.vangyzen.net [IPv6:2607:fc50:1000:7400:216:3eff:fe72:314f]) by mx1.freebsd.org (Postfix) with ESMTP id C79D69AD for ; Fri, 27 Mar 2015 19:27:27 +0000 (UTC) Received: from marvin.lab.vangyzen.net (c-73-147-253-17.hsd1.va.comcast.net [73.147.253.17]) by smtp.vangyzen.net (Postfix) with ESMTPSA id E6F3456467 for ; Fri, 27 Mar 2015 14:27:26 -0500 (CDT) Message-ID: <5515AED9.8040408@FreeBSD.org> Date: Fri, 27 Mar 2015 15:26:17 -0400 From: Eric van Gyzen User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: current@FreeBSD.org Subject: SSE in libthr Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Mar 2015 19:27:28 -0000 In a nutshell: Clang emits SSE instructions on amd64 in the common path of pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd like to disable SSE in libthr. In more detail: In libthr/thread/thr_mutex.c, we find the following: #define MUTEX_INIT_LINK(m) do { \ (m)->m_qe.tqe_prev = NULL; \ (m)->m_qe.tqe_next = NULL; \ } while (0) In 9.1, clang 3.1 emits two ordinary mov instructions: movq $0x0,0x8(%rax) movq $0x0,(%rax) Since 10.0 and clang 3.3, clang emits these SSE instructions: xorps %xmm0,%xmm0 movups %xmm0,(%rax) Although these look harmless enough, using the FPU can reduce performance by incurring extra overhead due to context-switching the FPU state. As I mentioned, this code is used in the common path of pthread_mutex_unlock. I have a simple test program that creates four threads, all contending for a single mutex, and measures the total number of lock acquisitions over several seconds. When libthr is built with SSE, as is current, I get around 53 million locks in 5 seconds. Without SSE, I get around 60 million (13% more). DTrace shows around 790,000 calls to fpudna versus 10 calls. There could be other factors involved, but I presume that the FPU context switches account for most of the change in performance. Even when I add some SSE usage in the application--incidentally, these same instructions--building libthr without SSE improves performance from 53.5 million to 55.8 million (4.3%). In the real-world application where I first noticed this, performance improves by 3-5%. I would appreciate your thoughts and feedback. The proposed patch is below. Eric Index: base/head/lib/libthr/arch/amd64/Makefile.inc =================================================================== --- base/head/lib/libthr/arch/amd64/Makefile.inc (revision 280703) +++ base/head/lib/libthr/arch/amd64/Makefile.inc (working copy) @@ -1,3 +1,8 @@ #$FreeBSD$ SRCS+= _umtx_op_err.S + +# Using SSE incurs extra overhead per context switch, +# which measurably impacts performance when the application +# does not otherwise use FP/SSE. +CFLAGS+=-mno-sse