From owner-freebsd-current@FreeBSD.ORG  Fri Mar 27 19:27:27 2015
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id E2CC8DB9
 for <current@FreeBSD.org>; Fri, 27 Mar 2015 19:27:27 +0000 (UTC)
Received: from smtp.vangyzen.net (hotblack.vangyzen.net
 [IPv6:2607:fc50:1000:7400:216:3eff:fe72:314f])
 by mx1.freebsd.org (Postfix) with ESMTP id C79D69AD
 for <current@FreeBSD.org>; Fri, 27 Mar 2015 19:27:27 +0000 (UTC)
Received: from marvin.lab.vangyzen.net (c-73-147-253-17.hsd1.va.comcast.net
 [73.147.253.17])
 by smtp.vangyzen.net (Postfix) with ESMTPSA id E6F3456467
 for <current@FreeBSD.org>; Fri, 27 Mar 2015 14:27:26 -0500 (CDT)
Message-ID: <5515AED9.8040408@FreeBSD.org>
Date: Fri, 27 Mar 2015 15:26:17 -0400
From: Eric van Gyzen <vangyzen@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: current@FreeBSD.org
Subject: SSE in libthr
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 27 Mar 2015 19:27:28 -0000

In a nutshell:

Clang emits SSE instructions on amd64 in the common path of
pthread_mutex_unlock.  This reduces performance by a non-trivial amount.  I'd
like to disable SSE in libthr.

In more detail:

In libthr/thread/thr_mutex.c, we find the following:

	#define MUTEX_INIT_LINK(m)              do {            \
	        (m)->m_qe.tqe_prev = NULL;                      \
	        (m)->m_qe.tqe_next = NULL;                      \
	} while (0)

In 9.1, clang 3.1 emits two ordinary mov instructions:

	movq   $0x0,0x8(%rax)
	movq   $0x0,(%rax)

Since 10.0 and clang 3.3, clang emits these SSE instructions:

	xorps  %xmm0,%xmm0
	movups %xmm0,(%rax)

Although these look harmless enough, using the FPU can reduce performance by
incurring extra overhead due to context-switching the FPU state.

As I mentioned, this code is used in the common path of pthread_mutex_unlock.  I
have a simple test program that creates four threads, all contending for a
single mutex, and measures the total number of lock acquisitions over several
seconds.  When libthr is built with SSE, as is current, I get around 53 million
locks in 5 seconds.  Without SSE, I get around 60 million (13% more).  DTrace
shows around 790,000 calls to fpudna versus 10 calls.  There could be other
factors involved, but I presume that the FPU context switches account for most
of the change in performance.

Even when I add some SSE usage in the application--incidentally, these same
instructions--building libthr without SSE improves performance from 53.5 million
to 55.8 million (4.3%).

In the real-world application where I first noticed this, performance improves
by 3-5%.

I would appreciate your thoughts and feedback.  The proposed patch is below.

Eric


Index: base/head/lib/libthr/arch/amd64/Makefile.inc
===================================================================
--- base/head/lib/libthr/arch/amd64/Makefile.inc	(revision 280703)
+++ base/head/lib/libthr/arch/amd64/Makefile.inc	(working copy)
@@ -1,3 +1,8 @@
 #$FreeBSD$

 SRCS+=	_umtx_op_err.S
+
+# Using SSE incurs extra overhead per context switch,
+# which measurably impacts performance when the application
+# does not otherwise use FP/SSE.
+CFLAGS+=-mno-sse