From owner-freebsd-current@freebsd.org  Fri Nov  6 14:22:39 2015
Return-Path: <owner-freebsd-current@freebsd.org>
Delivered-To: freebsd-current@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id D24CEA2872E
 for <freebsd-current@mailman.ysv.freebsd.org>;
 Fri,  6 Nov 2015 14:22:39 +0000 (UTC) (envelope-from hps@selasky.org)
Received: from mail.turbocat.net (mail.turbocat.net
 [IPv6:2a01:4f8:d16:4514::2])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 795E21AC6;
 Fri,  6 Nov 2015 14:22:39 +0000 (UTC) (envelope-from hps@selasky.org)
Received: from laptop015.home.selasky.org (unknown [62.141.129.119])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by mail.turbocat.net (Postfix) with ESMTPSA id 6E7031FE023;
 Fri,  6 Nov 2015 15:22:37 +0100 (CET)
Subject: Re: [PATCH] microoptimize by trying to avoid locking a locked mutex
To: Mateusz Guzik <mjguzik@gmail.com>, Ian Lepore <ian@freebsd.org>,
 John Baldwin <jhb@freebsd.org>, Adrian Chadd <adrian.chadd@gmail.com>,
 freebsd-current <freebsd-current@freebsd.org>,
 Konstantin Belousov <kostikbel@gmail.com>
References: <20151104233218.GA27709@dft-labs.eu>
 <20151105192623.GB27709@dft-labs.eu>
 <CAJ-VmonnH4JJg0XqX1SoBXBa+9Xfmk+HFv58ETaQ9v1-uAAhdQ@mail.gmail.com>
 <1563180.x0Z3Ou4xid@ralph.baldwin.cx>
 <1446766522.91534.412.camel@freebsd.org> <20151106005524.GA4877@dft-labs.eu>
From: Hans Petter Selasky <hps@selasky.org>
Message-ID: <563CB814.80905@selasky.org>
Date: Fri, 6 Nov 2015 15:24:20 +0100
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:38.0) Gecko/20100101
 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <20151106005524.GA4877@dft-labs.eu>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 06 Nov 2015 14:22:40 -0000

On 11/06/15 01:55, Mateusz Guzik wrote:
> On Thu, Nov 05, 2015 at 04:35:22PM -0700, Ian Lepore wrote:
>> On Thu, 2015-11-05 at 14:19 -0800, John Baldwin wrote:
>>> On Thursday, November 05, 2015 01:45:19 PM Adrian Chadd wrote:
>>>> On 5 November 2015 at 11:26, Mateusz Guzik <mjguzik@gmail.com>
>>>> wrote:
>>>>> On Thu, Nov 05, 2015 at 11:04:13AM -0800, John Baldwin wrote:
>>>>>> On Thursday, November 05, 2015 04:26:28 PM Konstantin Belousov
>>>>>> wrote:
>>>>>>> On Thu, Nov 05, 2015 at 12:32:18AM +0100, Mateusz Guzik
>>>>>>> wrote:
>>>>>>>> mtx_lock will unconditionally try to grab the lock and if
>>>>>>>> that fails,
>>>>>>>> will call __mtx_lock_sleep which will immediately try to do
>>>>>>>> the same
>>>>>>>> atomic op again.
>>>>>>>>
>>>>>>>> So, the obvious microoptimization is to check the state in
>>>>>>>> __mtx_lock_sleep and avoid the operation if the lock is not
>>>>>>>> free.
>>>>>>>>
>>>>>>>> This gives me ~40% speedup in a microbenchmark of 40 find
>>>>>>>> processes
>>>>>>>> traversing tmpfs and contending on mount mtx (only used as
>>>>>>>> an easy
>>>>>>>> benchmark, I have WIP patches to get rid of it).
>>>>>>>>
>>>>>>>> Second part of the patch is optional and just checks the
>>>>>>>> state of the
>>>>>>>> lock prior to doing any atomic operations, but it gives a
>>>>>>>> very modest
>>>>>>>> speed up when applied on top of the __mtx_lock_sleep
>>>>>>>> change. As such,
>>>>>>>> I'm not going to defend this part.
>>>>>>> Shouldn't the same consideration applied to all spinning
>>>>>>> loops, i.e.
>>>>>>> also to the spin/thread mutexes, and to the spinning parts of
>>>>>>> sx and
>>>>>>> lockmgr ?
>>>>>>
>>>>>> I agree.  I think both changes are good and worth doing in our
>>>>>> other
>>>>>> primitives.
>>>>>>
>>>>>
>>>>> I glanced over e.g. rw_rlock and it did not have the issue, now
>>>>> that I
>>>>> see _sx_xlock_hard it wuld indeed use fixing.
>>>>>
>>>>> Expect a patch in few h for all primitives I'll find. I'll stress
>>>>> test
>>>>> the kernel, but it is unlikely I'll do microbenchmarks for
>>>>> remaining
>>>>> primitives.
>>>>
>>>> Is this stuff you're proposing still valid for non-x86 platforms?
>>>
>>> Yes.  It just does a read before trying the atomic compare and swap
>>> and
>>> falls through to the hard case as if the atomic op failed if the
>>> result
>>> of the read would result in a compare failure.
>>>
>>
>> The atomic ops include barriers, the new pre-read of the variable
>> doesn't.  Will that cause problems, especially for code inside a loop
>> where the compiler may decide to shuffle things around?
>>
>
> Lock value is volatile, so the compiler should not screw us up. I removed
> the store to the variable due to a different concern. In worst case we
> would have slept instead of spinning. (see below)
>
>> I suspect the performance gain will be biggest on the platforms where
>> atomic ops are expensive (I gather from various code comments that's
>> the case on x86).
>>
>
> I don't know about other architectures, on x86 atomic op performance on
> contended cachelines deteriorates drastically.
>
> =======================================================
>
> For the most port the patch below does 2 simple kinds of changes:
> 1. in macros for lock/unlock primitives the lock value is fetched and
>    compared against relevant _UNLOCKED macro prior to the atomic op
> 2. in functions:
>
> before:
> while (!_mtx_obtain_lock(m, tid)) {
>      	v = m->mtx_lock;
> 	if (v != MTX_UNOWNED) {
> 		....
> 	}
> 	.....
> }
>
> after:
> for (;;) {
> 	if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
> 		break;
> 	v = m->mtx_lock;
> 	if (v != MTX_UNOWNED) {
> 		....
> 	}
> 	.....
> }
>
> The original patch preloaded the 'v' variable. If the value was
> MTX_UNOWNED and _mtx_obtain_lock failed, we just lost the race.  In
> order to spin waiting for the other thread to release the lock, we have
> to re-read the variable. Thus for simplicity it is no longer stored in
> 'v'. Note this is contrary to kib's earlier suggestion which would put
> such threads directly to sleep. I don't have proper measurements to have
> any strong opinion here, I went the aforementioned route to minimise
> changes.
>
> Note this is a trivial patch to improve the situation a little bit, I
> have no interest in trying to polish these primitives at this time.
>
> For overall results, on a machine with 32GB of RAM and the following:
> CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (2800.06-MHz K8-class
> CPU)
> FreeBSD/SMP: 2 package(s) x 10 core(s) x 2 SMT threads
>
> this reduced make -j 40 kernel in a 10.2 jail from ~2:15 to ~2:05.
>
> =======================================================
>
> diff --git a/sys/kern/kern_lock.c b/sys/kern/kern_lock.c
> index aa67180..198e93e 100644
> --- a/sys/kern/kern_lock.c
> +++ b/sys/kern/kern_lock.c
> @@ -787,8 +787,10 @@ __lockmgr_args(struct lock *lk, u_int flags, struct lock_object *ilk,
>   			break;
>   		}
>
> -		while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED,
> -		    tid)) {
> +		for (;;) {
> +			if (lk->lk_lock == LK_UNLOCKED &&
> +			    atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid))
> +				break;
>   #ifdef HWPMC_HOOKS
>   			PMC_SOFT_CALL( , , lock, failed);
>   #endif
> @@ -1124,7 +1126,11 @@ __lockmgr_args(struct lock *lk, u_int flags, struct lock_object *ilk,
>   			    __func__, iwmesg, file, line);
>   		}
>
> -		while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid)) {
> +		for (;;) {
> +			if (lk->lk_lock == LK_UNLOCKED &&
> +			    atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid))
> +				break;
> +
>   #ifdef HWPMC_HOOKS
>   			PMC_SOFT_CALL( , , lock, failed);
>   #endif
> diff --git a/sys/kern/kern_mutex.c b/sys/kern/kern_mutex.c
> index bec8f6b..51e82a3 100644
> --- a/sys/kern/kern_mutex.c
> +++ b/sys/kern/kern_mutex.c
> @@ -419,7 +419,9 @@ __mtx_lock_sleep(volatile uintptr_t *c, uintptr_t tid, int opts,
>   	all_time -= lockstat_nsecs(&m->lock_object);
>   #endif
>
> -	while (!_mtx_obtain_lock(m, tid)) {
> +	for (;;) {
> +		if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
> +			break;
>   #ifdef KDTRACE_HOOKS
>   		spin_cnt++;
>   #endif
> @@ -602,8 +604,9 @@ _mtx_lock_spin_cookie(volatile uintptr_t *c, uintptr_t tid, int opts,
>   #ifdef KDTRACE_HOOKS
>   	spin_time -= lockstat_nsecs(&m->lock_object);
>   #endif
> -	while (!_mtx_obtain_lock(m, tid)) {
> -
> +	for (;;) {
> +		if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
> +			break;
>   		/* Give interrupts a chance while we spin. */
>   		spinlock_exit();
>   		while (m->mtx_lock != MTX_UNOWNED) {
> @@ -675,7 +678,9 @@ retry:
>   			    m->lock_object.lo_name, file, line));
>   		WITNESS_CHECKORDER(&m->lock_object,
>   		    opts | LOP_NEWORDER | LOP_EXCLUSIVE, file, line, NULL);
> -		while (!_mtx_obtain_lock(m, tid)) {
> +		for (;;) {
> +			if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
> +				break;
>   			if (m->mtx_lock == tid) {
>   				m->mtx_recurse++;
>   				break;
> diff --git a/sys/kern/kern_rwlock.c b/sys/kern/kern_rwlock.c
> index 6541724..6a904d2 100644
> --- a/sys/kern/kern_rwlock.c
> +++ b/sys/kern/kern_rwlock.c
> @@ -771,7 +771,9 @@ __rw_wlock_hard(volatile uintptr_t *c, uintptr_t tid, const char *file,
>   	all_time -= lockstat_nsecs(&rw->lock_object);
>   	state = rw->rw_lock;
>   #endif
> -	while (!_rw_write_lock(rw, tid)) {
> +	for (;;) {
> +		if (rw->rw_lock == RW_UNLOCKED && _rw_write_lock(rw, tid))
> +			break;
>   #ifdef KDTRACE_HOOKS
>   		spin_cnt++;
>   #endif
> diff --git a/sys/kern/kern_sx.c b/sys/kern/kern_sx.c
> index 96e117b..2a81c04 100644
> --- a/sys/kern/kern_sx.c
> +++ b/sys/kern/kern_sx.c
> @@ -544,7 +544,10 @@ _sx_xlock_hard(struct sx *sx, uintptr_t tid, int opts, const char *file,
>   	all_time -= lockstat_nsecs(&sx->lock_object);
>   	state = sx->sx_lock;
>   #endif
> -	while (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid)) {
> +	for (;;) {
> +		if (sx->sx_lock == SX_LOCK_UNLOCKED &&
> +		    atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
> +			break;
>   #ifdef KDTRACE_HOOKS
>   		spin_cnt++;
>   #endif
> diff --git a/sys/sys/mutex.h b/sys/sys/mutex.h
> index a9ec072..0443922 100644
> --- a/sys/sys/mutex.h
> +++ b/sys/sys/mutex.h
> @@ -185,7 +185,7 @@ void	thread_lock_flags_(struct thread *, int, const char *, int);
>   #define __mtx_lock(mp, tid, opts, file, line) do {			\
>   	uintptr_t _tid = (uintptr_t)(tid);				\
>   									\
> -	if (!_mtx_obtain_lock((mp), _tid))				\
> +	if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid)))\
>   		_mtx_lock_sleep((mp), _tid, (opts), (file), (line));	\
>   	else								\
>   		LOCKSTAT_PROFILE_OBTAIN_LOCK_SUCCESS(adaptive__acquire,	\
> @@ -203,7 +203,7 @@ void	thread_lock_flags_(struct thread *, int, const char *, int);
>   	uintptr_t _tid = (uintptr_t)(tid);				\
>   									\
>   	spinlock_enter();						\
> -	if (!_mtx_obtain_lock((mp), _tid)) {				\
> +	if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid))) {\
>   		if ((mp)->mtx_lock == _tid)				\
>   			(mp)->mtx_recurse++;				\
>   		else							\
> @@ -232,7 +232,7 @@ void	thread_lock_flags_(struct thread *, int, const char *, int);
>   									\
>   	if ((mp)->mtx_recurse == 0)					\
>   		LOCKSTAT_PROFILE_RELEASE_LOCK(adaptive__release, mp);	\
> -	if (!_mtx_release_lock((mp), _tid))				\
> +	if ((mp)->mtx_lock != _tid || !_mtx_release_lock((mp), _tid))	\
>   		_mtx_unlock_sleep((mp), (opts), (file), (line));	\
>   } while (0)
>
> diff --git a/sys/sys/rwlock.h b/sys/sys/rwlock.h
> index f8947c5..6b4f505 100644
> --- a/sys/sys/rwlock.h
> +++ b/sys/sys/rwlock.h
> @@ -96,7 +96,7 @@
>   #define	__rw_wlock(rw, tid, file, line) do {				\
>   	uintptr_t _tid = (uintptr_t)(tid);				\
>   									\
> -	if (!_rw_write_lock((rw), _tid))				\
> +	if ((rw)->rw_lock != RW_UNLOCKED || !_rw_write_lock((rw), _tid))\
>   		_rw_wlock_hard((rw), _tid, (file), (line));		\
>   	else 								\
>   		LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(rw__acquire, rw,	\
> @@ -112,7 +112,7 @@
>   	else {								\
>   		LOCKSTAT_PROFILE_RELEASE_RWLOCK(rw__release, rw,	\
>   		    LOCKSTAT_WRITER);					\
> -		if (!_rw_write_unlock((rw), _tid))			\
> +		if ((rw)->rw_lock != _tid || !_rw_write_unlock((rw), _tid))\
>   			_rw_wunlock_hard((rw), _tid, (file), (line));	\
>   	}								\
>   } while (0)
> diff --git a/sys/sys/sx.h b/sys/sys/sx.h
> index 96a664f..144cab5 100644
> --- a/sys/sys/sx.h
> +++ b/sys/sys/sx.h
> @@ -150,7 +150,8 @@ __sx_xlock(struct sx *sx, struct thread *td, int opts, const char *file,
>   	uintptr_t tid = (uintptr_t)td;
>   	int error = 0;
>
> -	if (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
> +	if (sx->sx_lock != SX_LOCK_UNLOCKED ||
> +	    !atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
>   		error = _sx_xlock_hard(sx, tid, opts, file, line);
>   	else
>   		LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(sx__acquire, sx,
> @@ -168,7 +169,8 @@ __sx_xunlock(struct sx *sx, struct thread *td, const char *file, int line)
>   	if (sx->sx_recurse == 0)
>   		LOCKSTAT_PROFILE_RELEASE_RWLOCK(sx__release, sx,
>   		    LOCKSTAT_WRITER);
> -	if (!atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED))
> +	if (sx->sx_lock != tid ||
> +	    !atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED))
>   		_sx_xunlock_hard(sx, tid, file, line);
>   }
>

Just gave this patch a spin on my RPI2 and it seems to work fine.

--HPS