From owner-freebsd-smp@FreeBSD.ORG Sun Sep 23 02:57:50 2007 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F11CE16A417 for ; Sun, 23 Sep 2007 02:57:50 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mu-out-0910.google.com (mu-out-0910.google.com [209.85.134.191]) by mx1.freebsd.org (Postfix) with ESMTP id 9194213C459 for ; Sun, 23 Sep 2007 02:57:50 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: by mu-out-0910.google.com with SMTP id w9so1495646mue for ; Sat, 22 Sep 2007 19:57:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth; bh=iiUvjAMtMydCs3r99oSiL8ZVSK+ENH6vXGacQdqDpQM=; b=g1iP1YJJkN0bZX0zG6kpRUL0G28h99XNXoc29uUxIEd/IjbnDP8ftcKd8USxTcqu+JhvIwjQl1Jrm3UL/FR+ItWu/f2LB7wZ8avmP1BZ3vwtB3mwzLrpBB6WxmFr1U90iktaxg/zEDFc8Y+cWCGp9cHbwoh03+LXgh9iTNNBlnY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:cc:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth; b=iWgivIQ7NXNN0XY5Jkaqr5xq/lcUGOjxvROnqBlWyXWPktg3E5q1lwgalmZVsy1CyN4hnqzpdKb7lnuc7C9zvoO73MlMsNChlfbb8kp0+o96ISr41ys1SK5Tas7lpY4RBvXQZGV3vcHPATRvTCChrT2a77oj7j+4pn0D6LowUkM= Received: by 10.78.149.15 with SMTP id w15mr2190939hud.1190514726280; Sat, 22 Sep 2007 19:32:06 -0700 (PDT) Received: by 10.78.97.18 with HTTP; Sat, 22 Sep 2007 19:32:06 -0700 (PDT) Message-ID: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> Date: Sun, 23 Sep 2007 04:32:06 +0200 From: "Attilio Rao" Sender: asmrookie@gmail.com To: freebsd-smp@freebsd.org, freebsd-arch@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline X-Google-Sender-Auth: 2e21c3221234266c Cc: Subject: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Sep 2007 02:57:51 -0000 Recently several people have reported problems of starvation with rwlocks. In particular, users which tried to use rwlock on big SMP environment (16+ CPUs) found them rather subjected to poor performances and to starvation of waiters. Inspecting the code, something strange about adaptive spinning popped up: basically, for rwlocks, adaptive spinning stubs seem to be customed too down in the decisioning-loop. The desposition of the stub will let the thread that would adaptively spin, to set the respecitve (both read or write) waiters flag on, which means that the owner of the lock will go down in the hard path of locking functions and will performe a full wakeup even if the waiters queues can result empty. This is a big penalty for adaptive spinning which can make it completely useless. In addiction to this, adaptive spinning only runs in the turnstile spinlock path which is not ideal. This patch ports the approach alredy used for adaptive spinning in sx locks to rwlocks: http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff In sx it is unlikely to see big benefits because they are held for too long times, but for rwlocks situation is rather different. I would like to see if people can do benchmarks with this patch (maybe in private environments?) as I'm not able to do them in short times. Adaptive spinning in rwlocks can be improved further with other tricks (like adding a backoff counter, for example, or trying to spin with the lock held in read mode too), but we first should be sure to start with a solid base. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-smp@FreeBSD.ORG Mon Sep 24 16:29:34 2007 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5F36E16A419 for ; Mon, 24 Sep 2007 16:29:34 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id 4338A13C448 for ; Mon, 24 Sep 2007 16:29:34 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8p) with ESMTP id 211141650-1834499 for multiple; Mon, 24 Sep 2007 12:11:55 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l8OGD3hJ098542; Mon, 24 Sep 2007 12:13:07 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: "Attilio Rao" Date: Mon, 24 Sep 2007 11:52:41 -0400 User-Agent: KMail/1.9.6 References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> In-Reply-To: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709241152.41660.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Mon, 24 Sep 2007 12:13:07 -0400 (EDT) X-Virus-Scanned: ClamAV 0.88.3/4378/Mon Sep 24 08:25:35 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Sep 2007 16:29:34 -0000 On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote: > Recently several people have reported problems of starvation with rwlocks. > In particular, users which tried to use rwlock on big SMP environment > (16+ CPUs) found them rather subjected to poor performances and to > starvation of waiters. > > Inspecting the code, something strange about adaptive spinning popped > up: basically, for rwlocks, adaptive spinning stubs seem to be > customed too down in the decisioning-loop. > The desposition of the stub will let the thread that would adaptively > spin, to set the respecitve (both read or write) waiters flag on, > which means that the owner of the lock will go down in the hard path > of locking functions and will performe a full wakeup even if the > waiters queues can result empty. This is a big penalty for adaptive > spinning which can make it completely useless. > In addiction to this, adaptive spinning only runs in the turnstile > spinlock path which is not ideal. > This patch ports the approach alredy used for adaptive spinning in sx > locks to rwlocks: > http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff > > In sx it is unlikely to see big benefits because they are held for too > long times, but for rwlocks situation is rather different. > I would like to see if people can do benchmarks with this patch (maybe > in private environments?) as I'm not able to do them in short times. > > Adaptive spinning in rwlocks can be improved further with other tricks > (like adding a backoff counter, for example, or trying to spin with > the lock held in read mode too), but we first should be sure to start > with a solid base. I did this for mutexes and rwlocks over a year ago and Kris found it was slower in benchmarks. www.freebsd.org/~jhb/patches/lock_adapt.patch is the last thing I sent kris@ to test (it only has the mutex changes). This might be more optimal post-thread_lock since thread_lock seems to have heavily pessimized adaptive spinning because it now enqueues the thread and then dequeues it again before doing the adaptive spin. I liked the approach orginially because it simplifies the code a lot. A separate issue is that writers don't spin at all if a reader holds the lock, and I think one thing to test for that would be an adaptive spin with a static timeout. -- John Baldwin From owner-freebsd-smp@FreeBSD.ORG Mon Sep 24 20:54:27 2007 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8B32C16A419; Mon, 24 Sep 2007 20:54:27 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 213EF13C48D; Mon, 24 Sep 2007 20:54:27 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.103] (c-67-160-44-208.hsd1.wa.comcast.net [67.160.44.208]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l8OKsO9I012940 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Mon, 24 Sep 2007 16:54:25 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Mon, 24 Sep 2007 13:57:06 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: John Baldwin In-Reply-To: <200709241152.41660.jhb@freebsd.org> Message-ID: <20070924135554.F547@10.0.0.1> References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <200709241152.41660.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Sep 2007 20:54:27 -0000 On Mon, 24 Sep 2007, John Baldwin wrote: > On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote: >> Recently several people have reported problems of starvation with rwlocks. >> In particular, users which tried to use rwlock on big SMP environment >> (16+ CPUs) found them rather subjected to poor performances and to >> starvation of waiters. >> >> Inspecting the code, something strange about adaptive spinning popped >> up: basically, for rwlocks, adaptive spinning stubs seem to be >> customed too down in the decisioning-loop. >> The desposition of the stub will let the thread that would adaptively >> spin, to set the respecitve (both read or write) waiters flag on, >> which means that the owner of the lock will go down in the hard path >> of locking functions and will performe a full wakeup even if the >> waiters queues can result empty. This is a big penalty for adaptive >> spinning which can make it completely useless. >> In addiction to this, adaptive spinning only runs in the turnstile >> spinlock path which is not ideal. >> This patch ports the approach alredy used for adaptive spinning in sx >> locks to rwlocks: >> http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff >> >> In sx it is unlikely to see big benefits because they are held for too >> long times, but for rwlocks situation is rather different. >> I would like to see if people can do benchmarks with this patch (maybe >> in private environments?) as I'm not able to do them in short times. >> >> Adaptive spinning in rwlocks can be improved further with other tricks >> (like adding a backoff counter, for example, or trying to spin with >> the lock held in read mode too), but we first should be sure to start >> with a solid base. > > I did this for mutexes and rwlocks over a year ago and Kris found it was > slower in benchmarks. www.freebsd.org/~jhb/patches/lock_adapt.patch is the > last thing I sent kris@ to test (it only has the mutex changes). This might > be more optimal post-thread_lock since thread_lock seems to have heavily > pessimized adaptive spinning because it now enqueues the thread and then > dequeues it again before doing the adaptive spin. I liked the approach > orginially because it simplifies the code a lot. A separate issue is that > writers don't spin at all if a reader holds the lock, and I think one thing > to test for that would be an adaptive spin with a static timeout. We don't enqueue the thread until the same place. We just acquire an extra spinlock. The thread is not enqueued until turnstile_wait() as before. Jeff > > -- > John Baldwin > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-smp@FreeBSD.ORG Mon Sep 24 21:23:22 2007 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4F1CA16A417; Mon, 24 Sep 2007 21:23:22 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id D749613C447; Mon, 24 Sep 2007 21:23:21 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8p) with ESMTP id 211191859-1834499 for multiple; Mon, 24 Sep 2007 17:21:18 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l8OLMZef001128; Mon, 24 Sep 2007 17:22:35 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Jeff Roberson Date: Mon, 24 Sep 2007 17:22:28 -0400 User-Agent: KMail/1.9.6 References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1> In-Reply-To: <20070924135554.F547@10.0.0.1> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709241722.28670.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Mon, 24 Sep 2007 17:22:35 -0400 (EDT) X-Virus-Scanned: ClamAV 0.88.3/4381/Mon Sep 24 13:20:51 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Attilio Rao , freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Sep 2007 21:23:22 -0000 On Monday 24 September 2007 04:57:06 pm Jeff Roberson wrote: > On Mon, 24 Sep 2007, John Baldwin wrote: > > > On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote: > >> Recently several people have reported problems of starvation with rwlocks. > >> In particular, users which tried to use rwlock on big SMP environment > >> (16+ CPUs) found them rather subjected to poor performances and to > >> starvation of waiters. > >> > >> Inspecting the code, something strange about adaptive spinning popped > >> up: basically, for rwlocks, adaptive spinning stubs seem to be > >> customed too down in the decisioning-loop. > >> The desposition of the stub will let the thread that would adaptively > >> spin, to set the respecitve (both read or write) waiters flag on, > >> which means that the owner of the lock will go down in the hard path > >> of locking functions and will performe a full wakeup even if the > >> waiters queues can result empty. This is a big penalty for adaptive > >> spinning which can make it completely useless. > >> In addiction to this, adaptive spinning only runs in the turnstile > >> spinlock path which is not ideal. > >> This patch ports the approach alredy used for adaptive spinning in sx > >> locks to rwlocks: > >> http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff > >> > >> In sx it is unlikely to see big benefits because they are held for too > >> long times, but for rwlocks situation is rather different. > >> I would like to see if people can do benchmarks with this patch (maybe > >> in private environments?) as I'm not able to do them in short times. > >> > >> Adaptive spinning in rwlocks can be improved further with other tricks > >> (like adding a backoff counter, for example, or trying to spin with > >> the lock held in read mode too), but we first should be sure to start > >> with a solid base. > > > > I did this for mutexes and rwlocks over a year ago and Kris found it was > > slower in benchmarks. www.freebsd.org/~jhb/patches/lock_adapt.patch is the > > last thing I sent kris@ to test (it only has the mutex changes). This might > > be more optimal post-thread_lock since thread_lock seems to have heavily > > pessimized adaptive spinning because it now enqueues the thread and then > > dequeues it again before doing the adaptive spin. I liked the approach > > orginially because it simplifies the code a lot. A separate issue is that > > writers don't spin at all if a reader holds the lock, and I think one thing > > to test for that would be an adaptive spin with a static timeout. > > We don't enqueue the thread until the same place. We just acquire an > extra spinlock. The thread is not enqueued until turnstile_wait() as > before. Oh. That's what I get for assuming what trywait() and cancel() did based on their names. It is still more overhead than before though, so simplifying adaptive spinning might still be a win now as opposed to before. -- John Baldwin From owner-freebsd-smp@FreeBSD.ORG Tue Sep 25 17:17:50 2007 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 618D216A418; Tue, 25 Sep 2007 17:17:50 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id D0CD713C459; Tue, 25 Sep 2007 17:17:49 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8p) with ESMTP id 211331053-1834499 for multiple; Tue, 25 Sep 2007 13:15:47 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l8PHGvJV011655; Tue, 25 Sep 2007 13:17:00 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Jeff Roberson Date: Tue, 25 Sep 2007 13:14:42 -0400 User-Agent: KMail/1.9.6 References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1> In-Reply-To: <20070924135554.F547@10.0.0.1> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709251314.42707.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Tue, 25 Sep 2007 13:17:01 -0400 (EDT) X-Virus-Scanned: ClamAV 0.88.3/4391/Tue Sep 25 11:10:50 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Attilio Rao , freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Sep 2007 17:17:50 -0000 On Monday 24 September 2007 04:57:06 pm Jeff Roberson wrote: > On Mon, 24 Sep 2007, John Baldwin wrote: > > > On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote: > >> Recently several people have reported problems of starvation with rwlocks. > >> In particular, users which tried to use rwlock on big SMP environment > >> (16+ CPUs) found them rather subjected to poor performances and to > >> starvation of waiters. > >> > >> Inspecting the code, something strange about adaptive spinning popped > >> up: basically, for rwlocks, adaptive spinning stubs seem to be > >> customed too down in the decisioning-loop. > >> The desposition of the stub will let the thread that would adaptively > >> spin, to set the respecitve (both read or write) waiters flag on, > >> which means that the owner of the lock will go down in the hard path > >> of locking functions and will performe a full wakeup even if the > >> waiters queues can result empty. This is a big penalty for adaptive > >> spinning which can make it completely useless. > >> In addiction to this, adaptive spinning only runs in the turnstile > >> spinlock path which is not ideal. > >> This patch ports the approach alredy used for adaptive spinning in sx > >> locks to rwlocks: > >> http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff > >> > >> In sx it is unlikely to see big benefits because they are held for too > >> long times, but for rwlocks situation is rather different. > >> I would like to see if people can do benchmarks with this patch (maybe > >> in private environments?) as I'm not able to do them in short times. > >> > >> Adaptive spinning in rwlocks can be improved further with other tricks > >> (like adding a backoff counter, for example, or trying to spin with > >> the lock held in read mode too), but we first should be sure to start > >> with a solid base. > > > > I did this for mutexes and rwlocks over a year ago and Kris found it was > > slower in benchmarks. www.freebsd.org/~jhb/patches/lock_adapt.patch is the > > last thing I sent kris@ to test (it only has the mutex changes). This might > > be more optimal post-thread_lock since thread_lock seems to have heavily > > pessimized adaptive spinning because it now enqueues the thread and then > > dequeues it again before doing the adaptive spin. I liked the approach > > orginially because it simplifies the code a lot. A separate issue is that > > writers don't spin at all if a reader holds the lock, and I think one thing > > to test for that would be an adaptive spin with a static timeout. > > We don't enqueue the thread until the same place. We just acquire an > extra spinlock. The thread is not enqueued until turnstile_wait() as > before. So I looked at this some more and now I don't understand why the trywait() and cancel() changes were done. Some background: When the initial turnstile code was split out of the the mutex code, the main loop in mtx_lock_sleep() looked like this: while (!_obtain_lock) { ts = turnstile_lookup(); /* (A) locked ts chain and did O(n) lookup to find existing ts */ if (lock unlocked) { (B) turnstile_release(); continue; } if (failed to set contested flag) { (C) turnstile_release(); continue; } if (owner is running) { (D) turnstile_release(); ... spin ... continue; } turnstile_wait(ts, ...); (E) } So one thing I noticed after a while is that every iteration of this loop was doing the O(n) lookup to find the turnstile even though we only used that in (E). The lookup was wasted overhead for the (B), (C), and (D) cases. Hence this commit a while later: jhb 2004-10-12 18:36:20 UTC FreeBSD src repository Modified files: sys/kern kern_condvar.c kern_mutex.c kern_synch.c subr_sleepqueue.c subr_turnstile.c sys/sys sleepqueue.h turnstile.h Log: Refine the turnstile and sleep queue interfaces just a bit: - Add a new _lock() call to each API that locks the associated chain lock for a lock_object pointer or wait channel. The _lookup() functions now require that the chain lock be locked via _lock() when they are called. - Change sleepq_add(), turnstile_wait() and turnstile_claim() to lookup the associated queue structure internally via _lookup() rather than accepting a pointer from the caller. For turnstiles, this means that the actual lookup of the turnstile in the hash table is only done when the thread actually blocks rather than being done on each loop iteration in _mtx_lock_sleep(). For sleep queues, this means that sleepq_lookup() is no longer used outside of the sleep queue code except to implement an assertion in cv_destroy(). - Change sleepq_broadcast() and sleepq_signal() to require that the chain lock is already required. For condition variables, this lets the cv_broadcast() and cv_signal() functions lock the sleep queue chain lock while testing the waiters count. This means that the waiters count internal to condition variables is no longer protected by the interlock mutex and cv_broadcast() and cv_signal() now no longer require that the interlock be held when they are called. This lets consumers of condition variables drop the lock before waking other threads which can result in fewer context switches. MFC after: 1 month Revision Changes Path 1.52 +16 -15 src/sys/kern/kern_condvar.c 1.151 +4 -5 src/sys/kern/kern_mutex.c 1.263 +4 -3 src/sys/kern/kern_synch.c 1.13 +31 -14 src/sys/kern/subr_sleepqueue.c 1.150 +34 -12 src/sys/kern/subr_turnstile.c 1.5 +11 -12 src/sys/sys/sleepqueue.h 1.5 +17 -16 src/sys/sys/turnstile.h After that commit the mutex loop looked like this: while (!_obtain_lock) { turnstile_lock(); /* (A) locked ts chain only */ if (lock unlocked) { (B) turnstile_release(); continue; } if (failed to set contested flag) { (C) turnstile_release(); continue; } if (owner is running) { (D) turnstile_release(); ... spin ... continue; } turnstile_wait(ts, ...); /* (E) Does O(n) lookup */ } That is, after the above commit, we only did the O(n) lookup if we needed the actual turnstile object in (E). This decreased the overhead for the (B), (C), and (D) cases. The changes to add trywait() and cancel() basically seem to have reverted the above commit by going back to doing the O(n) lookup in (A) thus re-adding the overhead to the (B), (C), and (D) cases. In addition trywait() / cancel() actually maintain some additional state and acquire another lock so that (B), (C), and (D) have even more overhead than the first cut of the turnstile code including having to back out some of the changes in cancel() rather than just a single spinlock unlock. This is actually why I had assumed you had enqueued the turnstile in trywait() as I thought you had dropped the chain lock in trywait() (which would require enqueing the thread in trywait(), but would also greatly complicate cancel()). Otherwise, the current trywait() / cancel() changes seem like a step backwards to me. Do you really think it is necessary to do the O(n) lookup even when we don't need it (B, C, D) rather than just doing it in turnstile_wait() (E)? -- John Baldwin From owner-freebsd-smp@FreeBSD.ORG Tue Sep 25 23:00:54 2007 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CF99416A418; Tue, 25 Sep 2007 23:00:54 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 7C0C813C44B; Tue, 25 Sep 2007 23:00:54 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.103] (c-67-160-44-208.hsd1.wa.comcast.net [67.160.44.208]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l8PN0qhh066072 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Tue, 25 Sep 2007 19:00:53 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Tue, 25 Sep 2007 16:03:41 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: John Baldwin In-Reply-To: <200709251314.42707.jhb@freebsd.org> Message-ID: <20070925155504.W556@10.0.0.1> References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1> <200709251314.42707.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Sep 2007 23:00:54 -0000 On Tue, 25 Sep 2007, John Baldwin wrote: > On Monday 24 September 2007 04:57:06 pm Jeff Roberson wrote: >> On Mon, 24 Sep 2007, John Baldwin wrote: >> >>> On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote: >>>> Recently several people have reported problems of starvation with > rwlocks. >>>> In particular, users which tried to use rwlock on big SMP environment >>>> (16+ CPUs) found them rather subjected to poor performances and to >>>> starvation of waiters. >>>> >>>> Inspecting the code, something strange about adaptive spinning popped >>>> up: basically, for rwlocks, adaptive spinning stubs seem to be >>>> customed too down in the decisioning-loop. >>>> The desposition of the stub will let the thread that would adaptively >>>> spin, to set the respecitve (both read or write) waiters flag on, >>>> which means that the owner of the lock will go down in the hard path >>>> of locking functions and will performe a full wakeup even if the >>>> waiters queues can result empty. This is a big penalty for adaptive >>>> spinning which can make it completely useless. >>>> In addiction to this, adaptive spinning only runs in the turnstile >>>> spinlock path which is not ideal. >>>> This patch ports the approach alredy used for adaptive spinning in sx >>>> locks to rwlocks: >>>> http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff >>>> >>>> In sx it is unlikely to see big benefits because they are held for too >>>> long times, but for rwlocks situation is rather different. >>>> I would like to see if people can do benchmarks with this patch (maybe >>>> in private environments?) as I'm not able to do them in short times. >>>> >>>> Adaptive spinning in rwlocks can be improved further with other tricks >>>> (like adding a backoff counter, for example, or trying to spin with >>>> the lock held in read mode too), but we first should be sure to start >>>> with a solid base. >>> >>> I did this for mutexes and rwlocks over a year ago and Kris found it was >>> slower in benchmarks. www.freebsd.org/~jhb/patches/lock_adapt.patch is > the >>> last thing I sent kris@ to test (it only has the mutex changes). This > might >>> be more optimal post-thread_lock since thread_lock seems to have heavily >>> pessimized adaptive spinning because it now enqueues the thread and then >>> dequeues it again before doing the adaptive spin. I liked the approach >>> orginially because it simplifies the code a lot. A separate issue is that >>> writers don't spin at all if a reader holds the lock, and I think one > thing >>> to test for that would be an adaptive spin with a static timeout. >> >> We don't enqueue the thread until the same place. We just acquire an >> extra spinlock. The thread is not enqueued until turnstile_wait() as >> before. > > So I looked at this some more and now I don't understand why the trywait() and > cancel() changes were done. Some background: I wanted the turnstile chain lock to protect only the association between the turnstile and lock. This way it further reduces the contention domain for the hard lock/unlock cases. We sometimes see a lot of contention on the turnstile chain that may be due to hash collisions. If this reduced contention doesn't help I think it would be possible to revert this change. I originally had more of the code running without the turnstile chain held but while trying to reduce diffs to understand the problem better it eventually returned to cover almost all of the hard paths. > > When the initial turnstile code was split out of the the mutex code, the main > loop in mtx_lock_sleep() looked like this: > > while (!_obtain_lock) { > ts = turnstile_lookup(); /* (A) locked ts chain > and did O(n) lookup > to find existing ts */ I'm not sure why you're calling this O(n). It's O(n) with the number of hash collisions, but it's a hash lookup. We should consider using a real hash algorithm here if the distribution is bad. Thanks, Jeff > > if (lock unlocked) { (B) > turnstile_release(); > continue; > } > > if (failed to set contested flag) { (C) > turnstile_release(); > continue; > } > > if (owner is running) { (D) > turnstile_release(); > ... spin ... > continue; > } > > turnstile_wait(ts, ...); (E) > } > > So one thing I noticed after a while is that every iteration of this loop was > doing the O(n) lookup to find the turnstile even though we only used that in > (E). The lookup was wasted overhead for the (B), (C), and (D) cases. Hence > this commit a while later: > > jhb 2004-10-12 18:36:20 UTC > > FreeBSD src repository > > Modified files: > sys/kern kern_condvar.c kern_mutex.c kern_synch.c > subr_sleepqueue.c subr_turnstile.c > sys/sys sleepqueue.h turnstile.h > Log: > Refine the turnstile and sleep queue interfaces just a bit: > - Add a new _lock() call to each API that locks the associated chain lock > for a lock_object pointer or wait channel. The _lookup() functions now > require that the chain lock be locked via _lock() when they are called. > - Change sleepq_add(), turnstile_wait() and turnstile_claim() to lookup > the associated queue structure internally via _lookup() rather than > accepting a pointer from the caller. For turnstiles, this means that > the actual lookup of the turnstile in the hash table is only done when > the thread actually blocks rather than being done on each loop iteration > in _mtx_lock_sleep(). For sleep queues, this means that sleepq_lookup() > is no longer used outside of the sleep queue code except to implement an > assertion in cv_destroy(). > - Change sleepq_broadcast() and sleepq_signal() to require that the chain > lock is already required. For condition variables, this lets the > cv_broadcast() and cv_signal() functions lock the sleep queue chain lock > while testing the waiters count. This means that the waiters count > internal to condition variables is no longer protected by the interlock > mutex and cv_broadcast() and cv_signal() now no longer require that the > interlock be held when they are called. This lets consumers of condition > variables drop the lock before waking other threads which can result in > fewer context switches. > > MFC after: 1 month > > Revision Changes Path > 1.52 +16 -15 src/sys/kern/kern_condvar.c > 1.151 +4 -5 src/sys/kern/kern_mutex.c > 1.263 +4 -3 src/sys/kern/kern_synch.c > 1.13 +31 -14 src/sys/kern/subr_sleepqueue.c > 1.150 +34 -12 src/sys/kern/subr_turnstile.c > 1.5 +11 -12 src/sys/sys/sleepqueue.h > 1.5 +17 -16 src/sys/sys/turnstile.h > > After that commit the mutex loop looked like this: > > while (!_obtain_lock) { > turnstile_lock(); /* (A) locked ts chain only */ > > if (lock unlocked) { (B) > turnstile_release(); > continue; > } > > if (failed to set contested flag) { (C) > turnstile_release(); > continue; > } > > if (owner is running) { (D) > turnstile_release(); > ... spin ... > continue; > } > > turnstile_wait(ts, ...); /* (E) Does O(n) lookup */ > } > > That is, after the above commit, we only did the O(n) lookup if we needed the > actual turnstile object in (E). This decreased the overhead for the (B), > (C), and (D) cases. > > The changes to add trywait() and cancel() basically seem to have reverted the > above commit by going back to doing the O(n) lookup in (A) thus re-adding the > overhead to the (B), (C), and (D) cases. In addition trywait() / cancel() > actually maintain some additional state and acquire another lock so that (B), > (C), and (D) have even more overhead than the first cut of the turnstile code > including having to back out some of the changes in cancel() rather than just > a single spinlock unlock. > > This is actually why I had assumed you had enqueued the turnstile in trywait() > as I thought you had dropped the chain lock in trywait() (which would require > enqueing the thread in trywait(), but would also greatly complicate > cancel()). Otherwise, the current trywait() / cancel() changes seem like a > step backwards to me. > > Do you really think it is necessary to do the O(n) lookup even when we don't > need it (B, C, D) rather than just doing it in turnstile_wait() (E)? > > -- > John Baldwin > From owner-freebsd-smp@FreeBSD.ORG Wed Sep 26 06:59:40 2007 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9890F16A421 for ; Wed, 26 Sep 2007 06:59:40 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mu-out-0910.google.com (mu-out-0910.google.com [209.85.134.186]) by mx1.freebsd.org (Postfix) with ESMTP id 258B613C469 for ; Wed, 26 Sep 2007 06:59:39 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: by mu-out-0910.google.com with SMTP id w9so2691129mue for ; Tue, 25 Sep 2007 23:59:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; bh=Yyd0Kt0q2A1GRkUwT62/Rn5k/Erc5kwRjhXvfl2yw4c=; b=MiM9PpjFpF5kG7MEi3NRo3G26GIsXyekFqjPGVXDxrb3agBk5Yyl0SpIZ2JO66Dh2QlU9dHk1t5LozGpue4vgVZf4zadJWWiBPeiz7g9DcKNg6gbYMF7PE6IuAzVZd6ybhG/eOHTWYTnGMmNFyY/rIP1oRiPmShtdjoZhYs2UxA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=Oj+kNUOsyEwDXAQHLqSKUqT6aM5UjPrZV5Z0oHPI4N/NwwJGe2XbIqh3J8FB2D5bgB9N8nwTfstrjzK8V98x7yTuUWL3GzmgO6GUKHyMxXpgbjruIDbQ08XJvbSKqRS3ODpIJoQNWndQmrYmPc4lOJTkcO1WQFY5XY35AmGgPAw= Received: by 10.78.131.8 with SMTP id e8mr151374hud.1190789977617; Tue, 25 Sep 2007 23:59:37 -0700 (PDT) Received: by 10.78.97.18 with HTTP; Tue, 25 Sep 2007 23:59:37 -0700 (PDT) Message-ID: <3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com> Date: Wed, 26 Sep 2007 08:59:37 +0200 From: "Attilio Rao" Sender: asmrookie@gmail.com To: "Jeff Roberson" In-Reply-To: <20070925155504.W556@10.0.0.1> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1> <200709251314.42707.jhb@freebsd.org> <20070925155504.W556@10.0.0.1> X-Google-Sender-Auth: 12c5eeab311d1215 Cc: freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 06:59:40 -0000 2007/9/26, Jeff Roberson : > On Tue, 25 Sep 2007, John Baldwin wrote: > > > > > So I looked at this some more and now I don't understand why the trywait() and > > cancel() changes were done. Some background: > > I wanted the turnstile chain lock to protect only the association between > the turnstile and lock. This way it further reduces the contention domain > for the hard lock/unlock cases. We sometimes see a lot of contention on > the turnstile chain that may be due to hash collisions. If this reduced > contention doesn't help I think it would be possible to revert this > change. Well, the change jhb is suggesting won't reduce tc_lock contention for sure, but it will prevent the turnstile lookup/locking if not necessary. Mainly, we need a spinlock held in order to protect mtx_lock bitfield (as well as rw_lock) by setting waiters flags. While in the past code this lock was only tc_lock, currently mtx_lock is both protected by tc_lock and ts_lock, which it doesn't seem strictly necessary. if, for example, you lock tc_lock than you scan the chain and lock ts_lock (basically what turnstile_trywait() does) and later you find lock is released or that the state of lock changed and you cannot set a waiters flag, you did the ts_locking and hash table scanning even if it wasn't needed. > I'm not sure why you're calling this O(n). It's O(n) with the number of > hash collisions, but it's a hash lookup. We should consider using a real > hash algorithm here if the distribution is bad. Well, the hash lookup (worst case) has O(n) complexity. This linear complexity cames from the O(1) * O(n) which is the access to the list and the scanning of this one. So this is reduced to the same complexity of a linked list. However, some time ago I saw empirically that starting from a 'struct thread' allocated address the highest level of entropy in the word came after a right shift of 10 bits (for ia32) and 12 bits for amd64. Currently, turnstile and sleepqueues only uses 8 bits of right shifting. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-smp@FreeBSD.ORG Wed Sep 26 15:52:12 2007 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7B0C116A476; Wed, 26 Sep 2007 15:52:12 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 5157413C469; Wed, 26 Sep 2007 15:52:12 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.103] (c-67-160-44-208.hsd1.wa.comcast.net [67.160.44.208]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l8QFq9hX054745 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Wed, 26 Sep 2007 11:52:10 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Wed, 26 Sep 2007 08:54:57 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Attilio Rao In-Reply-To: <3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com> Message-ID: <20070926084705.T556@10.0.0.1> References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1> <200709251314.42707.jhb@freebsd.org> <20070925155504.W556@10.0.0.1> <3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 15:52:12 -0000 On Wed, 26 Sep 2007, Attilio Rao wrote: > 2007/9/26, Jeff Roberson : >> On Tue, 25 Sep 2007, John Baldwin wrote: >> >>> >>> So I looked at this some more and now I don't understand why the trywait() and >>> cancel() changes were done. Some background: >> >> I wanted the turnstile chain lock to protect only the association between >> the turnstile and lock. This way it further reduces the contention domain >> for the hard lock/unlock cases. We sometimes see a lot of contention on >> the turnstile chain that may be due to hash collisions. If this reduced >> contention doesn't help I think it would be possible to revert this >> change. > > Well, the change jhb is suggesting won't reduce tc_lock contention for > sure, but it will prevent the turnstile lookup/locking if not > necessary. > Mainly, we need a spinlock held in order to protect mtx_lock bitfield > (as well as rw_lock) by setting waiters flags. > While in the past code this lock was only tc_lock, currently mtx_lock > is both protected by tc_lock and ts_lock, which it doesn't seem > strictly necessary. > if, for example, you lock tc_lock than you scan the chain and lock > ts_lock (basically what turnstile_trywait() does) and later you find > lock is released or that the state of lock changed and you cannot set > a waiters flag, you did the ts_locking and hash table scanning even if > it wasn't needed. Yes I understand, the question is whether the decreased complexity is a bigger win than the reduced contention. The contention is only reduced if there are collisions in the table. So probably jhb is right, we should revert this part of the change. I need to make sure there was not some other reason for it. I had convinced myself at the time that it was necessary, however, I can't recall why. > >> I'm not sure why you're calling this O(n). It's O(n) with the number of >> hash collisions, but it's a hash lookup. We should consider using a real >> hash algorithm here if the distribution is bad. > > Well, the hash lookup (worst case) has O(n) complexity. > This linear complexity cames from the O(1) * O(n) which is the access > to the list and the scanning of this one. So this is reduced to the > same complexity of a linked list. sure, I just don't think it's common to call a hash O(n). > However, some time ago I saw empirically that starting from a 'struct > thread' allocated address the highest level of entropy in the word > came after a right shift of 10 bits (for ia32) and 12 bits for amd64. > Currently, turnstile and sleepqueues only uses 8 bits of right > shifting. I haven't studied the hash distribution in turnstiles, I know there is some debugging code in there for it so jhb must have. Perhaps he can say how it does. > > Thanks, > Attilio > > > -- > Peace can only be achieved by understanding - A. Einstein > From owner-freebsd-smp@FreeBSD.ORG Wed Sep 26 18:33:10 2007 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A6FBC16A418; Wed, 26 Sep 2007 18:33:10 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id 4519E13C4AA; Wed, 26 Sep 2007 18:33:10 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8p) with ESMTP id 211528358-1834499 for multiple; Wed, 26 Sep 2007 14:31:23 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l8QIWPRi024716; Wed, 26 Sep 2007 14:32:32 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Jeff Roberson Date: Wed, 26 Sep 2007 13:40:07 -0400 User-Agent: KMail/1.9.6 References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com> <20070926084705.T556@10.0.0.1> In-Reply-To: <20070926084705.T556@10.0.0.1> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709261340.08344.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Wed, 26 Sep 2007 14:32:32 -0400 (EDT) X-Virus-Scanned: ClamAV 0.88.3/4405/Wed Sep 26 12:29:18 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Attilio Rao , freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 18:33:10 -0000 On Wednesday 26 September 2007 11:54:57 am Jeff Roberson wrote: > On Wed, 26 Sep 2007, Attilio Rao wrote: > > > 2007/9/26, Jeff Roberson : > >> On Tue, 25 Sep 2007, John Baldwin wrote: > >> > >>> > >>> So I looked at this some more and now I don't understand why the trywait() and > >>> cancel() changes were done. Some background: > >> > >> I wanted the turnstile chain lock to protect only the association between > >> the turnstile and lock. This way it further reduces the contention domain > >> for the hard lock/unlock cases. We sometimes see a lot of contention on > >> the turnstile chain that may be due to hash collisions. If this reduced > >> contention doesn't help I think it would be possible to revert this > >> change. > > > > Well, the change jhb is suggesting won't reduce tc_lock contention for > > sure, but it will prevent the turnstile lookup/locking if not > > necessary. > > Mainly, we need a spinlock held in order to protect mtx_lock bitfield > > (as well as rw_lock) by setting waiters flags. > > While in the past code this lock was only tc_lock, currently mtx_lock > > is both protected by tc_lock and ts_lock, which it doesn't seem > > strictly necessary. > > if, for example, you lock tc_lock than you scan the chain and lock > > ts_lock (basically what turnstile_trywait() does) and later you find > > lock is released or that the state of lock changed and you cannot set > > a waiters flag, you did the ts_locking and hash table scanning even if > > it wasn't needed. > > Yes I understand, the question is whether the decreased complexity is a > bigger win than the reduced contention. The contention is only reduced if > there are collisions in the table. So probably jhb is right, we should > revert this part of the change. I need to make sure there was not some > other reason for it. I had convinced myself at the time that it was > necessary, however, I can't recall why. Actually, since you still hold the tc_lock the entire time (trywait() doesn't drop it), I don't think the trywait() / cancel() changes affect spin lock contention at all. I think it's more just a matter of overhead. If you do want to drop the tc_lock after the lookup in trywait() then I think the added complexity you would need to add (queue'ing the thread in trywait() to avoid a race with concurrent contesters on the "real" lock and then backing all that out in cancel()) would be very hairy. > >> I'm not sure why you're calling this O(n). It's O(n) with the number of > >> hash collisions, but it's a hash lookup. We should consider using a real > >> hash algorithm here if the distribution is bad. > > > > Well, the hash lookup (worst case) has O(n) complexity. > > This linear complexity cames from the O(1) * O(n) which is the access > > to the list and the scanning of this one. So this is reduced to the > > same complexity of a linked list. > > sure, I just don't think it's common to call a hash O(n). Well, the lookup of the chain in the bucket is what I call O(n). Getting to the bucket (which you always have to do to lock the turnstile chain) is O(1). The change I'm suggesting would only defer walking the chain to turnstile_wait(), not finding the index in the bucket. It would also defer grabbing the turnstile spin lock which probably dwarfs the cost of the chain walk in practice. > > However, some time ago I saw empirically that starting from a 'struct > > thread' allocated address the highest level of entropy in the word > > came after a right shift of 10 bits (for ia32) and 12 bits for amd64. > > Currently, turnstile and sleepqueues only uses 8 bits of right > > shifting. > > I haven't studied the hash distribution in turnstiles, I know there is > some debugging code in there for it so jhb must have. Perhaps he can say > how it does. I don't think I ever saw it go over 10. Note that the comments about 'struct thread' aren't quite relevant as sleepqueue's and turnstile's don't use thread pointers for hashing, but wait channels and lock pointers. That said, I only used the values they used based on what tsleep() in 4.x used for its hash for the 4.x sleep queues. -- John Baldwin From owner-freebsd-smp@FreeBSD.ORG Wed Sep 26 18:51:09 2007 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B3C2716A419; Wed, 26 Sep 2007 18:51:09 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 8DC6A13C457; Wed, 26 Sep 2007 18:51:09 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.103] (c-67-160-44-208.hsd1.wa.comcast.net [67.160.44.208]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l8QIp60G013788 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Wed, 26 Sep 2007 14:51:07 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Wed, 26 Sep 2007 11:53:54 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: John Baldwin In-Reply-To: <200709261340.08344.jhb@freebsd.org> Message-ID: <20070926114943.X556@10.0.0.1> References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com> <20070926084705.T556@10.0.0.1> <200709261340.08344.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 18:51:09 -0000 On Wed, 26 Sep 2007, John Baldwin wrote: > On Wednesday 26 September 2007 11:54:57 am Jeff Roberson wrote: >> On Wed, 26 Sep 2007, Attilio Rao wrote: >> >>> 2007/9/26, Jeff Roberson : >>>> On Tue, 25 Sep 2007, John Baldwin wrote: >>>> >>>>> >>>>> So I looked at this some more and now I don't understand why the > trywait() and >>>>> cancel() changes were done. Some background: >>>> >>>> I wanted the turnstile chain lock to protect only the association between >>>> the turnstile and lock. This way it further reduces the contention > domain >>>> for the hard lock/unlock cases. We sometimes see a lot of contention on >>>> the turnstile chain that may be due to hash collisions. If this reduced >>>> contention doesn't help I think it would be possible to revert this >>>> change. >>> >>> Well, the change jhb is suggesting won't reduce tc_lock contention for >>> sure, but it will prevent the turnstile lookup/locking if not >>> necessary. >>> Mainly, we need a spinlock held in order to protect mtx_lock bitfield >>> (as well as rw_lock) by setting waiters flags. >>> While in the past code this lock was only tc_lock, currently mtx_lock >>> is both protected by tc_lock and ts_lock, which it doesn't seem >>> strictly necessary. >>> if, for example, you lock tc_lock than you scan the chain and lock >>> ts_lock (basically what turnstile_trywait() does) and later you find >>> lock is released or that the state of lock changed and you cannot set >>> a waiters flag, you did the ts_locking and hash table scanning even if >>> it wasn't needed. >> >> Yes I understand, the question is whether the decreased complexity is a >> bigger win than the reduced contention. The contention is only reduced if >> there are collisions in the table. So probably jhb is right, we should >> revert this part of the change. I need to make sure there was not some >> other reason for it. I had convinced myself at the time that it was >> necessary, however, I can't recall why. > > Actually, since you still hold the tc_lock the entire time (trywait() doesn't > drop it), I don't think the trywait() / cancel() changes affect spin lock > contention at all. I think it's more just a matter of overhead. > > If you do want to drop the tc_lock after the lookup in trywait() then I think > the added complexity you would need to add (queue'ing the thread in trywait() > to avoid a race with concurrent contesters on the "real" lock and then > backing all that out in cancel()) would be very hairy. Yes it's not for concurrency in lock, but in unlock. If the chain gets as deep as 10 there may be some practical value in reducing contention on that chain. We should look at this more before we rush into anything. Right now unlock holds the chain lock for almost the entire duration but it's not really necessary. I think more experimentation is warranted especially since I did not actually measure a slowdown in heavily contended workloads. If we can reduce the number of hash collisions significantly then your approach is definitely a win. However it may be less so if we do not. Thanks, Jeff > >>>> I'm not sure why you're calling this O(n). It's O(n) with the number of >>>> hash collisions, but it's a hash lookup. We should consider using a real >>>> hash algorithm here if the distribution is bad. >>> >>> Well, the hash lookup (worst case) has O(n) complexity. >>> This linear complexity cames from the O(1) * O(n) which is the access >>> to the list and the scanning of this one. So this is reduced to the >>> same complexity of a linked list. >> >> sure, I just don't think it's common to call a hash O(n). > > Well, the lookup of the chain in the bucket is what I call O(n). Getting to > the bucket (which you always have to do to lock the turnstile chain) is O(1). > The change I'm suggesting would only defer walking the chain to > turnstile_wait(), not finding the index in the bucket. It would also defer > grabbing the turnstile spin lock which probably dwarfs the cost of the chain > walk in practice. > >>> However, some time ago I saw empirically that starting from a 'struct >>> thread' allocated address the highest level of entropy in the word >>> came after a right shift of 10 bits (for ia32) and 12 bits for amd64. >>> Currently, turnstile and sleepqueues only uses 8 bits of right >>> shifting. >> >> I haven't studied the hash distribution in turnstiles, I know there is >> some debugging code in there for it so jhb must have. Perhaps he can say >> how it does. > > I don't think I ever saw it go over 10. Note that the comments about 'struct > thread' aren't quite relevant as sleepqueue's and turnstile's don't use > thread pointers for hashing, but wait channels and lock pointers. That said, > I only used the values they used based on what tsleep() in 4.x used for its > hash for the 4.x sleep queues. > > -- > John Baldwin >