From owner-freebsd-current@freebsd.org Fri Nov 6 14:22:39 2015 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D24CEA2872E for ; Fri, 6 Nov 2015 14:22:39 +0000 (UTC) (envelope-from hps@selasky.org) Received: from mail.turbocat.net (mail.turbocat.net [IPv6:2a01:4f8:d16:4514::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 795E21AC6; Fri, 6 Nov 2015 14:22:39 +0000 (UTC) (envelope-from hps@selasky.org) Received: from laptop015.home.selasky.org (unknown [62.141.129.119]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.turbocat.net (Postfix) with ESMTPSA id 6E7031FE023; Fri, 6 Nov 2015 15:22:37 +0100 (CET) Subject: Re: [PATCH] microoptimize by trying to avoid locking a locked mutex To: Mateusz Guzik , Ian Lepore , John Baldwin , Adrian Chadd , freebsd-current , Konstantin Belousov References: <20151104233218.GA27709@dft-labs.eu> <20151105192623.GB27709@dft-labs.eu> <1563180.x0Z3Ou4xid@ralph.baldwin.cx> <1446766522.91534.412.camel@freebsd.org> <20151106005524.GA4877@dft-labs.eu> From: Hans Petter Selasky Message-ID: <563CB814.80905@selasky.org> Date: Fri, 6 Nov 2015 15:24:20 +0100 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: <20151106005524.GA4877@dft-labs.eu> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 06 Nov 2015 14:22:40 -0000 On 11/06/15 01:55, Mateusz Guzik wrote: > On Thu, Nov 05, 2015 at 04:35:22PM -0700, Ian Lepore wrote: >> On Thu, 2015-11-05 at 14:19 -0800, John Baldwin wrote: >>> On Thursday, November 05, 2015 01:45:19 PM Adrian Chadd wrote: >>>> On 5 November 2015 at 11:26, Mateusz Guzik >>>> wrote: >>>>> On Thu, Nov 05, 2015 at 11:04:13AM -0800, John Baldwin wrote: >>>>>> On Thursday, November 05, 2015 04:26:28 PM Konstantin Belousov >>>>>> wrote: >>>>>>> On Thu, Nov 05, 2015 at 12:32:18AM +0100, Mateusz Guzik >>>>>>> wrote: >>>>>>>> mtx_lock will unconditionally try to grab the lock and if >>>>>>>> that fails, >>>>>>>> will call __mtx_lock_sleep which will immediately try to do >>>>>>>> the same >>>>>>>> atomic op again. >>>>>>>> >>>>>>>> So, the obvious microoptimization is to check the state in >>>>>>>> __mtx_lock_sleep and avoid the operation if the lock is not >>>>>>>> free. >>>>>>>> >>>>>>>> This gives me ~40% speedup in a microbenchmark of 40 find >>>>>>>> processes >>>>>>>> traversing tmpfs and contending on mount mtx (only used as >>>>>>>> an easy >>>>>>>> benchmark, I have WIP patches to get rid of it). >>>>>>>> >>>>>>>> Second part of the patch is optional and just checks the >>>>>>>> state of the >>>>>>>> lock prior to doing any atomic operations, but it gives a >>>>>>>> very modest >>>>>>>> speed up when applied on top of the __mtx_lock_sleep >>>>>>>> change. As such, >>>>>>>> I'm not going to defend this part. >>>>>>> Shouldn't the same consideration applied to all spinning >>>>>>> loops, i.e. >>>>>>> also to the spin/thread mutexes, and to the spinning parts of >>>>>>> sx and >>>>>>> lockmgr ? >>>>>> >>>>>> I agree. I think both changes are good and worth doing in our >>>>>> other >>>>>> primitives. >>>>>> >>>>> >>>>> I glanced over e.g. rw_rlock and it did not have the issue, now >>>>> that I >>>>> see _sx_xlock_hard it wuld indeed use fixing. >>>>> >>>>> Expect a patch in few h for all primitives I'll find. I'll stress >>>>> test >>>>> the kernel, but it is unlikely I'll do microbenchmarks for >>>>> remaining >>>>> primitives. >>>> >>>> Is this stuff you're proposing still valid for non-x86 platforms? >>> >>> Yes. It just does a read before trying the atomic compare and swap >>> and >>> falls through to the hard case as if the atomic op failed if the >>> result >>> of the read would result in a compare failure. >>> >> >> The atomic ops include barriers, the new pre-read of the variable >> doesn't. Will that cause problems, especially for code inside a loop >> where the compiler may decide to shuffle things around? >> > > Lock value is volatile, so the compiler should not screw us up. I removed > the store to the variable due to a different concern. In worst case we > would have slept instead of spinning. (see below) > >> I suspect the performance gain will be biggest on the platforms where >> atomic ops are expensive (I gather from various code comments that's >> the case on x86). >> > > I don't know about other architectures, on x86 atomic op performance on > contended cachelines deteriorates drastically. > > ======================================================= > > For the most port the patch below does 2 simple kinds of changes: > 1. in macros for lock/unlock primitives the lock value is fetched and > compared against relevant _UNLOCKED macro prior to the atomic op > 2. in functions: > > before: > while (!_mtx_obtain_lock(m, tid)) { > v = m->mtx_lock; > if (v != MTX_UNOWNED) { > .... > } > ..... > } > > after: > for (;;) { > if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid)) > break; > v = m->mtx_lock; > if (v != MTX_UNOWNED) { > .... > } > ..... > } > > The original patch preloaded the 'v' variable. If the value was > MTX_UNOWNED and _mtx_obtain_lock failed, we just lost the race. In > order to spin waiting for the other thread to release the lock, we have > to re-read the variable. Thus for simplicity it is no longer stored in > 'v'. Note this is contrary to kib's earlier suggestion which would put > such threads directly to sleep. I don't have proper measurements to have > any strong opinion here, I went the aforementioned route to minimise > changes. > > Note this is a trivial patch to improve the situation a little bit, I > have no interest in trying to polish these primitives at this time. > > For overall results, on a machine with 32GB of RAM and the following: > CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (2800.06-MHz K8-class > CPU) > FreeBSD/SMP: 2 package(s) x 10 core(s) x 2 SMT threads > > this reduced make -j 40 kernel in a 10.2 jail from ~2:15 to ~2:05. > > ======================================================= > > diff --git a/sys/kern/kern_lock.c b/sys/kern/kern_lock.c > index aa67180..198e93e 100644 > --- a/sys/kern/kern_lock.c > +++ b/sys/kern/kern_lock.c > @@ -787,8 +787,10 @@ __lockmgr_args(struct lock *lk, u_int flags, struct lock_object *ilk, > break; > } > > - while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, > - tid)) { > + for (;;) { > + if (lk->lk_lock == LK_UNLOCKED && > + atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid)) > + break; > #ifdef HWPMC_HOOKS > PMC_SOFT_CALL( , , lock, failed); > #endif > @@ -1124,7 +1126,11 @@ __lockmgr_args(struct lock *lk, u_int flags, struct lock_object *ilk, > __func__, iwmesg, file, line); > } > > - while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid)) { > + for (;;) { > + if (lk->lk_lock == LK_UNLOCKED && > + atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid)) > + break; > + > #ifdef HWPMC_HOOKS > PMC_SOFT_CALL( , , lock, failed); > #endif > diff --git a/sys/kern/kern_mutex.c b/sys/kern/kern_mutex.c > index bec8f6b..51e82a3 100644 > --- a/sys/kern/kern_mutex.c > +++ b/sys/kern/kern_mutex.c > @@ -419,7 +419,9 @@ __mtx_lock_sleep(volatile uintptr_t *c, uintptr_t tid, int opts, > all_time -= lockstat_nsecs(&m->lock_object); > #endif > > - while (!_mtx_obtain_lock(m, tid)) { > + for (;;) { > + if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid)) > + break; > #ifdef KDTRACE_HOOKS > spin_cnt++; > #endif > @@ -602,8 +604,9 @@ _mtx_lock_spin_cookie(volatile uintptr_t *c, uintptr_t tid, int opts, > #ifdef KDTRACE_HOOKS > spin_time -= lockstat_nsecs(&m->lock_object); > #endif > - while (!_mtx_obtain_lock(m, tid)) { > - > + for (;;) { > + if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid)) > + break; > /* Give interrupts a chance while we spin. */ > spinlock_exit(); > while (m->mtx_lock != MTX_UNOWNED) { > @@ -675,7 +678,9 @@ retry: > m->lock_object.lo_name, file, line)); > WITNESS_CHECKORDER(&m->lock_object, > opts | LOP_NEWORDER | LOP_EXCLUSIVE, file, line, NULL); > - while (!_mtx_obtain_lock(m, tid)) { > + for (;;) { > + if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid)) > + break; > if (m->mtx_lock == tid) { > m->mtx_recurse++; > break; > diff --git a/sys/kern/kern_rwlock.c b/sys/kern/kern_rwlock.c > index 6541724..6a904d2 100644 > --- a/sys/kern/kern_rwlock.c > +++ b/sys/kern/kern_rwlock.c > @@ -771,7 +771,9 @@ __rw_wlock_hard(volatile uintptr_t *c, uintptr_t tid, const char *file, > all_time -= lockstat_nsecs(&rw->lock_object); > state = rw->rw_lock; > #endif > - while (!_rw_write_lock(rw, tid)) { > + for (;;) { > + if (rw->rw_lock == RW_UNLOCKED && _rw_write_lock(rw, tid)) > + break; > #ifdef KDTRACE_HOOKS > spin_cnt++; > #endif > diff --git a/sys/kern/kern_sx.c b/sys/kern/kern_sx.c > index 96e117b..2a81c04 100644 > --- a/sys/kern/kern_sx.c > +++ b/sys/kern/kern_sx.c > @@ -544,7 +544,10 @@ _sx_xlock_hard(struct sx *sx, uintptr_t tid, int opts, const char *file, > all_time -= lockstat_nsecs(&sx->lock_object); > state = sx->sx_lock; > #endif > - while (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid)) { > + for (;;) { > + if (sx->sx_lock == SX_LOCK_UNLOCKED && > + atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid)) > + break; > #ifdef KDTRACE_HOOKS > spin_cnt++; > #endif > diff --git a/sys/sys/mutex.h b/sys/sys/mutex.h > index a9ec072..0443922 100644 > --- a/sys/sys/mutex.h > +++ b/sys/sys/mutex.h > @@ -185,7 +185,7 @@ void thread_lock_flags_(struct thread *, int, const char *, int); > #define __mtx_lock(mp, tid, opts, file, line) do { \ > uintptr_t _tid = (uintptr_t)(tid); \ > \ > - if (!_mtx_obtain_lock((mp), _tid)) \ > + if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid)))\ > _mtx_lock_sleep((mp), _tid, (opts), (file), (line)); \ > else \ > LOCKSTAT_PROFILE_OBTAIN_LOCK_SUCCESS(adaptive__acquire, \ > @@ -203,7 +203,7 @@ void thread_lock_flags_(struct thread *, int, const char *, int); > uintptr_t _tid = (uintptr_t)(tid); \ > \ > spinlock_enter(); \ > - if (!_mtx_obtain_lock((mp), _tid)) { \ > + if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid))) {\ > if ((mp)->mtx_lock == _tid) \ > (mp)->mtx_recurse++; \ > else \ > @@ -232,7 +232,7 @@ void thread_lock_flags_(struct thread *, int, const char *, int); > \ > if ((mp)->mtx_recurse == 0) \ > LOCKSTAT_PROFILE_RELEASE_LOCK(adaptive__release, mp); \ > - if (!_mtx_release_lock((mp), _tid)) \ > + if ((mp)->mtx_lock != _tid || !_mtx_release_lock((mp), _tid)) \ > _mtx_unlock_sleep((mp), (opts), (file), (line)); \ > } while (0) > > diff --git a/sys/sys/rwlock.h b/sys/sys/rwlock.h > index f8947c5..6b4f505 100644 > --- a/sys/sys/rwlock.h > +++ b/sys/sys/rwlock.h > @@ -96,7 +96,7 @@ > #define __rw_wlock(rw, tid, file, line) do { \ > uintptr_t _tid = (uintptr_t)(tid); \ > \ > - if (!_rw_write_lock((rw), _tid)) \ > + if ((rw)->rw_lock != RW_UNLOCKED || !_rw_write_lock((rw), _tid))\ > _rw_wlock_hard((rw), _tid, (file), (line)); \ > else \ > LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(rw__acquire, rw, \ > @@ -112,7 +112,7 @@ > else { \ > LOCKSTAT_PROFILE_RELEASE_RWLOCK(rw__release, rw, \ > LOCKSTAT_WRITER); \ > - if (!_rw_write_unlock((rw), _tid)) \ > + if ((rw)->rw_lock != _tid || !_rw_write_unlock((rw), _tid))\ > _rw_wunlock_hard((rw), _tid, (file), (line)); \ > } \ > } while (0) > diff --git a/sys/sys/sx.h b/sys/sys/sx.h > index 96a664f..144cab5 100644 > --- a/sys/sys/sx.h > +++ b/sys/sys/sx.h > @@ -150,7 +150,8 @@ __sx_xlock(struct sx *sx, struct thread *td, int opts, const char *file, > uintptr_t tid = (uintptr_t)td; > int error = 0; > > - if (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid)) > + if (sx->sx_lock != SX_LOCK_UNLOCKED || > + !atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid)) > error = _sx_xlock_hard(sx, tid, opts, file, line); > else > LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(sx__acquire, sx, > @@ -168,7 +169,8 @@ __sx_xunlock(struct sx *sx, struct thread *td, const char *file, int line) > if (sx->sx_recurse == 0) > LOCKSTAT_PROFILE_RELEASE_RWLOCK(sx__release, sx, > LOCKSTAT_WRITER); > - if (!atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED)) > + if (sx->sx_lock != tid || > + !atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED)) > _sx_xunlock_hard(sx, tid, file, line); > } > Just gave this patch a spin on my RPI2 and it seems to work fine. --HPS