From owner-freebsd-smp@FreeBSD.ORG  Sun Sep 23 02:57:50 2007
Return-Path: <owner-freebsd-smp@FreeBSD.ORG>
Delivered-To: freebsd-smp@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id F11CE16A417
	for <freebsd-smp@freebsd.org>; Sun, 23 Sep 2007 02:57:50 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mu-out-0910.google.com (mu-out-0910.google.com [209.85.134.191])
	by mx1.freebsd.org (Postfix) with ESMTP id 9194213C459
	for <freebsd-smp@freebsd.org>; Sun, 23 Sep 2007 02:57:50 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: by mu-out-0910.google.com with SMTP id w9so1495646mue
	for <freebsd-smp@freebsd.org>; Sat, 22 Sep 2007 19:57:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta;
	h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth;
	bh=iiUvjAMtMydCs3r99oSiL8ZVSK+ENH6vXGacQdqDpQM=;
	b=g1iP1YJJkN0bZX0zG6kpRUL0G28h99XNXoc29uUxIEd/IjbnDP8ftcKd8USxTcqu+JhvIwjQl1Jrm3UL/FR+ItWu/f2LB7wZ8avmP1BZ3vwtB3mwzLrpBB6WxmFr1U90iktaxg/zEDFc8Y+cWCGp9cHbwoh03+LXgh9iTNNBlnY=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta;
	h=received:message-id:date:from:sender:to:subject:cc:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth;
	b=iWgivIQ7NXNN0XY5Jkaqr5xq/lcUGOjxvROnqBlWyXWPktg3E5q1lwgalmZVsy1CyN4hnqzpdKb7lnuc7C9zvoO73MlMsNChlfbb8kp0+o96ISr41ys1SK5Tas7lpY4RBvXQZGV3vcHPATRvTCChrT2a77oj7j+4pn0D6LowUkM=
Received: by 10.78.149.15 with SMTP id w15mr2190939hud.1190514726280;
	Sat, 22 Sep 2007 19:32:06 -0700 (PDT)
Received: by 10.78.97.18 with HTTP; Sat, 22 Sep 2007 19:32:06 -0700 (PDT)
Message-ID: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com>
Date: Sun, 23 Sep 2007 04:32:06 +0200
From: "Attilio Rao" <attilio@freebsd.org>
Sender: asmrookie@gmail.com
To: freebsd-smp@freebsd.org, freebsd-arch@freebsd.org
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Google-Sender-Auth: 2e21c3221234266c
Cc: 
Subject: rwlocks: poor performance with adaptive spinning
X-BeenThere: freebsd-smp@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD SMP implementation group <freebsd-smp.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-smp>
List-Post: <mailto:freebsd-smp@freebsd.org>
List-Help: <mailto:freebsd-smp-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 23 Sep 2007 02:57:51 -0000

Recently several people have reported problems of starvation with rwlocks.
In particular, users which tried to use rwlock on big SMP environment
(16+ CPUs) found them rather subjected to poor performances and to
starvation of waiters.

Inspecting the code, something strange about adaptive spinning popped
up: basically, for rwlocks, adaptive spinning stubs seem to be
customed too down in the decisioning-loop.
The desposition of the stub will let the thread that would adaptively
spin, to set the respecitve (both read or write) waiters flag on,
which means that the owner of the lock will go down in the hard path
of locking functions and will performe a full wakeup even if the
waiters queues can result empty. This is a big penalty for adaptive
spinning which can make it completely useless.
In addiction to this, adaptive spinning only runs in the turnstile
spinlock path which is not ideal.
This patch ports the approach alredy used for adaptive spinning in sx
locks to rwlocks:
http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff

In sx it is unlikely to see big benefits because they are held for too
long times, but for rwlocks situation is rather different.
I would like to see if people can do benchmarks with this patch (maybe
in private environments?) as I'm not able to do them in short times.

Adaptive spinning in rwlocks can be improved further with other tricks
(like adding a backoff counter, for example, or trying to spin with
the lock held in read mode too), but we first should be sure to start
with a solid base.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-smp@FreeBSD.ORG  Mon Sep 24 16:29:34 2007
Return-Path: <owner-freebsd-smp@FreeBSD.ORG>
Delivered-To: freebsd-smp@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5F36E16A419
	for <freebsd-smp@freebsd.org>; Mon, 24 Sep 2007 16:29:34 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219])
	by mx1.freebsd.org (Postfix) with ESMTP id 4338A13C448
	for <freebsd-smp@freebsd.org>; Mon, 24 Sep 2007 16:29:34 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.8p) with ESMTP id 211141650-1834499 
	for multiple; Mon, 24 Sep 2007 12:11:55 -0400
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l8OGD3hJ098542;
	Mon, 24 Sep 2007 12:13:07 -0400 (EDT) (envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: "Attilio Rao" <attilio@freebsd.org>
Date: Mon, 24 Sep 2007 11:52:41 -0400
User-Agent: KMail/1.9.6
References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com>
In-Reply-To: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="utf-8"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200709241152.41660.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Mon, 24 Sep 2007 12:13:07 -0400 (EDT)
X-Virus-Scanned: ClamAV 0.88.3/4378/Mon Sep 24 08:25:35 2007 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: freebsd-smp@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: rwlocks: poor performance with adaptive spinning
X-BeenThere: freebsd-smp@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD SMP implementation group <freebsd-smp.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-smp>
List-Post: <mailto:freebsd-smp@freebsd.org>
List-Help: <mailto:freebsd-smp-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Sep 2007 16:29:34 -0000

On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote:
> Recently several people have reported problems of starvation with rwlocks.
> In particular, users which tried to use rwlock on big SMP environment
> (16+ CPUs) found them rather subjected to poor performances and to
> starvation of waiters.
> 
> Inspecting the code, something strange about adaptive spinning popped
> up: basically, for rwlocks, adaptive spinning stubs seem to be
> customed too down in the decisioning-loop.
> The desposition of the stub will let the thread that would adaptively
> spin, to set the respecitve (both read or write) waiters flag on,
> which means that the owner of the lock will go down in the hard path
> of locking functions and will performe a full wakeup even if the
> waiters queues can result empty. This is a big penalty for adaptive
> spinning which can make it completely useless.
> In addiction to this, adaptive spinning only runs in the turnstile
> spinlock path which is not ideal.
> This patch ports the approach alredy used for adaptive spinning in sx
> locks to rwlocks:
> http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff
> 
> In sx it is unlikely to see big benefits because they are held for too
> long times, but for rwlocks situation is rather different.
> I would like to see if people can do benchmarks with this patch (maybe
> in private environments?) as I'm not able to do them in short times.
> 
> Adaptive spinning in rwlocks can be improved further with other tricks
> (like adding a backoff counter, for example, or trying to spin with
> the lock held in read mode too), but we first should be sure to start
> with a solid base.

I did this for mutexes and rwlocks over a year ago and Kris found it was 
slower in benchmarks.  www.freebsd.org/~jhb/patches/lock_adapt.patch is the 
last thing I sent kris@ to test (it only has the mutex changes).  This might 
be more optimal post-thread_lock since thread_lock seems to have heavily 
pessimized adaptive spinning because it now enqueues the thread and then 
dequeues it again before doing the adaptive spin.  I liked the approach 
orginially because it simplifies the code a lot.  A separate issue is that 
writers don't spin at all if a reader holds the lock, and I think one thing 
to test for that would be an adaptive spin with a static timeout.

-- 
John Baldwin

From owner-freebsd-smp@FreeBSD.ORG  Mon Sep 24 20:54:27 2007
Return-Path: <owner-freebsd-smp@FreeBSD.ORG>
Delivered-To: freebsd-smp@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8B32C16A419;
	Mon, 24 Sep 2007 20:54:27 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com
	[216.240.101.25])
	by mx1.freebsd.org (Postfix) with ESMTP id 213EF13C48D;
	Mon, 24 Sep 2007 20:54:27 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from [192.168.1.103] (c-67-160-44-208.hsd1.wa.comcast.net
	[67.160.44.208]) (authenticated bits=0)
	by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id
	l8OKsO9I012940
	(version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO);
	Mon, 24 Sep 2007 16:54:25 -0400 (EDT)
	(envelope-from jroberson@chesapeake.net)
Date: Mon, 24 Sep 2007 13:57:06 -0700 (PDT)
From: Jeff Roberson <jroberson@chesapeake.net>
X-X-Sender: jroberson@10.0.0.1
To: John Baldwin <jhb@freebsd.org>
In-Reply-To: <200709241152.41660.jhb@freebsd.org>
Message-ID: <20070924135554.F547@10.0.0.1>
References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com>
	<200709241152.41660.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Attilio Rao <attilio@freebsd.org>, freebsd-smp@freebsd.org,
	freebsd-arch@freebsd.org
Subject: Re: rwlocks: poor performance with adaptive spinning
X-BeenThere: freebsd-smp@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD SMP implementation group <freebsd-smp.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-smp>
List-Post: <mailto:freebsd-smp@freebsd.org>
List-Help: <mailto:freebsd-smp-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Sep 2007 20:54:27 -0000

On Mon, 24 Sep 2007, John Baldwin wrote:

> On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote:
>> Recently several people have reported problems of starvation with rwlocks.
>> In particular, users which tried to use rwlock on big SMP environment
>> (16+ CPUs) found them rather subjected to poor performances and to
>> starvation of waiters.
>>
>> Inspecting the code, something strange about adaptive spinning popped
>> up: basically, for rwlocks, adaptive spinning stubs seem to be
>> customed too down in the decisioning-loop.
>> The desposition of the stub will let the thread that would adaptively
>> spin, to set the respecitve (both read or write) waiters flag on,
>> which means that the owner of the lock will go down in the hard path
>> of locking functions and will performe a full wakeup even if the
>> waiters queues can result empty. This is a big penalty for adaptive
>> spinning which can make it completely useless.
>> In addiction to this, adaptive spinning only runs in the turnstile
>> spinlock path which is not ideal.
>> This patch ports the approach alredy used for adaptive spinning in sx
>> locks to rwlocks:
>> http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff
>>
>> In sx it is unlikely to see big benefits because they are held for too
>> long times, but for rwlocks situation is rather different.
>> I would like to see if people can do benchmarks with this patch (maybe
>> in private environments?) as I'm not able to do them in short times.
>>
>> Adaptive spinning in rwlocks can be improved further with other tricks
>> (like adding a backoff counter, for example, or trying to spin with
>> the lock held in read mode too), but we first should be sure to start
>> with a solid base.
>
> I did this for mutexes and rwlocks over a year ago and Kris found it was
> slower in benchmarks.  www.freebsd.org/~jhb/patches/lock_adapt.patch is the
> last thing I sent kris@ to test (it only has the mutex changes).  This might
> be more optimal post-thread_lock since thread_lock seems to have heavily
> pessimized adaptive spinning because it now enqueues the thread and then
> dequeues it again before doing the adaptive spin.  I liked the approach
> orginially because it simplifies the code a lot.  A separate issue is that
> writers don't spin at all if a reader holds the lock, and I think one thing
> to test for that would be an adaptive spin with a static timeout.

We don't enqueue the thread until the same place.  We just acquire an 
extra spinlock.  The thread is not enqueued until turnstile_wait() as 
before.

Jeff

>
> -- 
> John Baldwin
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>

From owner-freebsd-smp@FreeBSD.ORG  Mon Sep 24 21:23:22 2007
Return-Path: <owner-freebsd-smp@FreeBSD.ORG>
Delivered-To: freebsd-smp@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4F1CA16A417;
	Mon, 24 Sep 2007 21:23:22 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219])
	by mx1.freebsd.org (Postfix) with ESMTP id D749613C447;
	Mon, 24 Sep 2007 21:23:21 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.8p) with ESMTP id 211191859-1834499 
	for multiple; Mon, 24 Sep 2007 17:21:18 -0400
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l8OLMZef001128;
	Mon, 24 Sep 2007 17:22:35 -0400 (EDT) (envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: Jeff Roberson <jroberson@chesapeake.net>
Date: Mon, 24 Sep 2007 17:22:28 -0400
User-Agent: KMail/1.9.6
References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com>
	<200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1>
In-Reply-To: <20070924135554.F547@10.0.0.1>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200709241722.28670.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Mon, 24 Sep 2007 17:22:35 -0400 (EDT)
X-Virus-Scanned: ClamAV 0.88.3/4381/Mon Sep 24 13:20:51 2007 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: Attilio Rao <attilio@freebsd.org>, freebsd-smp@freebsd.org,
	freebsd-arch@freebsd.org
Subject: Re: rwlocks: poor performance with adaptive spinning
X-BeenThere: freebsd-smp@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD SMP implementation group <freebsd-smp.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-smp>
List-Post: <mailto:freebsd-smp@freebsd.org>
List-Help: <mailto:freebsd-smp-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Sep 2007 21:23:22 -0000

On Monday 24 September 2007 04:57:06 pm Jeff Roberson wrote:
> On Mon, 24 Sep 2007, John Baldwin wrote:
> 
> > On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote:
> >> Recently several people have reported problems of starvation with 
rwlocks.
> >> In particular, users which tried to use rwlock on big SMP environment
> >> (16+ CPUs) found them rather subjected to poor performances and to
> >> starvation of waiters.
> >>
> >> Inspecting the code, something strange about adaptive spinning popped
> >> up: basically, for rwlocks, adaptive spinning stubs seem to be
> >> customed too down in the decisioning-loop.
> >> The desposition of the stub will let the thread that would adaptively
> >> spin, to set the respecitve (both read or write) waiters flag on,
> >> which means that the owner of the lock will go down in the hard path
> >> of locking functions and will performe a full wakeup even if the
> >> waiters queues can result empty. This is a big penalty for adaptive
> >> spinning which can make it completely useless.
> >> In addiction to this, adaptive spinning only runs in the turnstile
> >> spinlock path which is not ideal.
> >> This patch ports the approach alredy used for adaptive spinning in sx
> >> locks to rwlocks:
> >> http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff
> >>
> >> In sx it is unlikely to see big benefits because they are held for too
> >> long times, but for rwlocks situation is rather different.
> >> I would like to see if people can do benchmarks with this patch (maybe
> >> in private environments?) as I'm not able to do them in short times.
> >>
> >> Adaptive spinning in rwlocks can be improved further with other tricks
> >> (like adding a backoff counter, for example, or trying to spin with
> >> the lock held in read mode too), but we first should be sure to start
> >> with a solid base.
> >
> > I did this for mutexes and rwlocks over a year ago and Kris found it was
> > slower in benchmarks.  www.freebsd.org/~jhb/patches/lock_adapt.patch is 
the
> > last thing I sent kris@ to test (it only has the mutex changes).  This 
might
> > be more optimal post-thread_lock since thread_lock seems to have heavily
> > pessimized adaptive spinning because it now enqueues the thread and then
> > dequeues it again before doing the adaptive spin.  I liked the approach
> > orginially because it simplifies the code a lot.  A separate issue is that
> > writers don't spin at all if a reader holds the lock, and I think one 
thing
> > to test for that would be an adaptive spin with a static timeout.
> 
> We don't enqueue the thread until the same place.  We just acquire an 
> extra spinlock.  The thread is not enqueued until turnstile_wait() as 
> before.

Oh.  That's what I get for assuming what trywait() and cancel() did based on 
their names.  It is still more overhead than before though, so simplifying 
adaptive spinning might still be a win now as opposed to before.

-- 
John Baldwin

From owner-freebsd-smp@FreeBSD.ORG  Tue Sep 25 17:17:50 2007
Return-Path: <owner-freebsd-smp@FreeBSD.ORG>
Delivered-To: freebsd-smp@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 618D216A418;
	Tue, 25 Sep 2007 17:17:50 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219])
	by mx1.freebsd.org (Postfix) with ESMTP id D0CD713C459;
	Tue, 25 Sep 2007 17:17:49 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.8p) with ESMTP id 211331053-1834499 
	for multiple; Tue, 25 Sep 2007 13:15:47 -0400
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l8PHGvJV011655;
	Tue, 25 Sep 2007 13:17:00 -0400 (EDT) (envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: Jeff Roberson <jroberson@chesapeake.net>
Date: Tue, 25 Sep 2007 13:14:42 -0400
User-Agent: KMail/1.9.6
References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com>
	<200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1>
In-Reply-To: <20070924135554.F547@10.0.0.1>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200709251314.42707.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Tue, 25 Sep 2007 13:17:01 -0400 (EDT)
X-Virus-Scanned: ClamAV 0.88.3/4391/Tue Sep 25 11:10:50 2007 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: Attilio Rao <attilio@freebsd.org>, freebsd-smp@freebsd.org,
	freebsd-arch@freebsd.org
Subject: Re: rwlocks: poor performance with adaptive spinning
X-BeenThere: freebsd-smp@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD SMP implementation group <freebsd-smp.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-smp>
List-Post: <mailto:freebsd-smp@freebsd.org>
List-Help: <mailto:freebsd-smp-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Sep 2007 17:17:50 -0000

On Monday 24 September 2007 04:57:06 pm Jeff Roberson wrote:
> On Mon, 24 Sep 2007, John Baldwin wrote:
> 
> > On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote:
> >> Recently several people have reported problems of starvation with 
rwlocks.
> >> In particular, users which tried to use rwlock on big SMP environment
> >> (16+ CPUs) found them rather subjected to poor performances and to
> >> starvation of waiters.
> >>
> >> Inspecting the code, something strange about adaptive spinning popped
> >> up: basically, for rwlocks, adaptive spinning stubs seem to be
> >> customed too down in the decisioning-loop.
> >> The desposition of the stub will let the thread that would adaptively
> >> spin, to set the respecitve (both read or write) waiters flag on,
> >> which means that the owner of the lock will go down in the hard path
> >> of locking functions and will performe a full wakeup even if the
> >> waiters queues can result empty. This is a big penalty for adaptive
> >> spinning which can make it completely useless.
> >> In addiction to this, adaptive spinning only runs in the turnstile
> >> spinlock path which is not ideal.
> >> This patch ports the approach alredy used for adaptive spinning in sx
> >> locks to rwlocks:
> >> http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff
> >>
> >> In sx it is unlikely to see big benefits because they are held for too
> >> long times, but for rwlocks situation is rather different.
> >> I would like to see if people can do benchmarks with this patch (maybe
> >> in private environments?) as I'm not able to do them in short times.
> >>
> >> Adaptive spinning in rwlocks can be improved further with other tricks
> >> (like adding a backoff counter, for example, or trying to spin with
> >> the lock held in read mode too), but we first should be sure to start
> >> with a solid base.
> >
> > I did this for mutexes and rwlocks over a year ago and Kris found it was
> > slower in benchmarks.  www.freebsd.org/~jhb/patches/lock_adapt.patch is 
the
> > last thing I sent kris@ to test (it only has the mutex changes).  This 
might
> > be more optimal post-thread_lock since thread_lock seems to have heavily
> > pessimized adaptive spinning because it now enqueues the thread and then
> > dequeues it again before doing the adaptive spin.  I liked the approach
> > orginially because it simplifies the code a lot.  A separate issue is that
> > writers don't spin at all if a reader holds the lock, and I think one 
thing
> > to test for that would be an adaptive spin with a static timeout.
> 
> We don't enqueue the thread until the same place.  We just acquire an 
> extra spinlock.  The thread is not enqueued until turnstile_wait() as 
> before.

So I looked at this some more and now I don't understand why the trywait() and 
cancel() changes were done.  Some background:

When the initial turnstile code was split out of the the mutex code, the main 
loop in mtx_lock_sleep() looked like this:

	while (!_obtain_lock) {
			ts = turnstile_lookup();	/* (A) locked ts chain
							   and did O(n) lookup
							   to find existing ts */

			if (lock unlocked) {		(B)
				turnstile_release();
				continue;
			}

			if (failed to set contested flag) { (C)
				turnstile_release();
				continue;
			}

			if (owner is running) {		(D)
				turnstile_release();
				... spin ...
				continue;
			}

			turnstile_wait(ts, ...);	(E)
	}

So one thing I noticed after a while is that every iteration of this loop was 
doing the O(n) lookup to find the turnstile even though we only used that in 
(E).  The lookup was wasted overhead for the (B), (C), and (D) cases.  Hence 
this commit a while later:

jhb         2004-10-12 18:36:20 UTC

  FreeBSD src repository

  Modified files:
    sys/kern             kern_condvar.c kern_mutex.c kern_synch.c 
                         subr_sleepqueue.c subr_turnstile.c 
    sys/sys              sleepqueue.h turnstile.h 
  Log:
  Refine the turnstile and sleep queue interfaces just a bit:
  - Add a new _lock() call to each API that locks the associated chain lock
    for a lock_object pointer or wait channel.  The _lookup() functions now
    require that the chain lock be locked via _lock() when they are called.
  - Change sleepq_add(), turnstile_wait() and turnstile_claim() to lookup
    the associated queue structure internally via _lookup() rather than
    accepting a pointer from the caller.  For turnstiles, this means that
    the actual lookup of the turnstile in the hash table is only done when
    the thread actually blocks rather than being done on each loop iteration
    in _mtx_lock_sleep().  For sleep queues, this means that sleepq_lookup()
    is no longer used outside of the sleep queue code except to implement an
    assertion in cv_destroy().
  - Change sleepq_broadcast() and sleepq_signal() to require that the chain
    lock is already required.  For condition variables, this lets the
    cv_broadcast() and cv_signal() functions lock the sleep queue chain lock
    while testing the waiters count.  This means that the waiters count
    internal to condition variables is no longer protected by the interlock
    mutex and cv_broadcast() and cv_signal() now no longer require that the
    interlock be held when they are called.  This lets consumers of condition
    variables drop the lock before waking other threads which can result in
    fewer context switches.
  
  MFC after:      1 month
  
  Revision  Changes    Path
  1.52      +16 -15    src/sys/kern/kern_condvar.c
  1.151     +4 -5      src/sys/kern/kern_mutex.c
  1.263     +4 -3      src/sys/kern/kern_synch.c
  1.13      +31 -14    src/sys/kern/subr_sleepqueue.c
  1.150     +34 -12    src/sys/kern/subr_turnstile.c
  1.5       +11 -12    src/sys/sys/sleepqueue.h
  1.5       +17 -16    src/sys/sys/turnstile.h

After that commit the mutex loop looked like this:

	while (!_obtain_lock) {
			turnstile_lock();		/* (A) locked ts chain only */

			if (lock unlocked) {		(B)
				turnstile_release();
				continue;
			}

			if (failed to set contested flag) { (C)
				turnstile_release();
				continue;
			}

			if (owner is running) {		(D)
				turnstile_release();
				... spin ...
				continue;
			}

			turnstile_wait(ts, ...);	/* (E) Does O(n) lookup */
	}

That is, after the above commit, we only did the O(n) lookup if we needed the 
actual turnstile object in (E).  This decreased the overhead for the (B), 
(C), and (D) cases.

The changes to add trywait() and cancel() basically seem to have reverted the 
above commit by going back to doing the O(n) lookup in (A) thus re-adding the 
overhead to the (B), (C), and (D) cases.  In addition trywait() / cancel() 
actually maintain some additional state and acquire another lock so that (B), 
(C), and (D) have even more overhead than the first cut of the turnstile code 
including having to back out some of the changes in cancel() rather than just 
a single spinlock unlock.

This is actually why I had assumed you had enqueued the turnstile in trywait() 
as I thought you had dropped the chain lock in trywait() (which would require 
enqueing the thread in trywait(), but would also greatly complicate 
cancel()).  Otherwise, the current trywait() / cancel() changes seem like a 
step backwards to me.

Do you really think it is necessary to do the O(n) lookup even when we don't 
need it (B, C, D) rather than just doing it in turnstile_wait() (E)?

-- 
John Baldwin

From owner-freebsd-smp@FreeBSD.ORG  Tue Sep 25 23:00:54 2007
Return-Path: <owner-freebsd-smp@FreeBSD.ORG>
Delivered-To: freebsd-smp@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CF99416A418;
	Tue, 25 Sep 2007 23:00:54 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com
	[216.240.101.25])
	by mx1.freebsd.org (Postfix) with ESMTP id 7C0C813C44B;
	Tue, 25 Sep 2007 23:00:54 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from [192.168.1.103] (c-67-160-44-208.hsd1.wa.comcast.net
	[67.160.44.208]) (authenticated bits=0)
	by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id
	l8PN0qhh066072
	(version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO);
	Tue, 25 Sep 2007 19:00:53 -0400 (EDT)
	(envelope-from jroberson@chesapeake.net)
Date: Tue, 25 Sep 2007 16:03:41 -0700 (PDT)
From: Jeff Roberson <jroberson@chesapeake.net>
X-X-Sender: jroberson@10.0.0.1
To: John Baldwin <jhb@freebsd.org>
In-Reply-To: <200709251314.42707.jhb@freebsd.org>
Message-ID: <20070925155504.W556@10.0.0.1>
References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com>
	<200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1>
	<200709251314.42707.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Attilio Rao <attilio@freebsd.org>, freebsd-smp@freebsd.org,
	freebsd-arch@freebsd.org
Subject: Re: rwlocks: poor performance with adaptive spinning
X-BeenThere: freebsd-smp@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD SMP implementation group <freebsd-smp.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-smp>
List-Post: <mailto:freebsd-smp@freebsd.org>
List-Help: <mailto:freebsd-smp-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Sep 2007 23:00:54 -0000

On Tue, 25 Sep 2007, John Baldwin wrote:

> On Monday 24 September 2007 04:57:06 pm Jeff Roberson wrote:
>> On Mon, 24 Sep 2007, John Baldwin wrote:
>>
>>> On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote:
>>>> Recently several people have reported problems of starvation with
> rwlocks.
>>>> In particular, users which tried to use rwlock on big SMP environment
>>>> (16+ CPUs) found them rather subjected to poor performances and to
>>>> starvation of waiters.
>>>>
>>>> Inspecting the code, something strange about adaptive spinning popped
>>>> up: basically, for rwlocks, adaptive spinning stubs seem to be
>>>> customed too down in the decisioning-loop.
>>>> The desposition of the stub will let the thread that would adaptively
>>>> spin, to set the respecitve (both read or write) waiters flag on,
>>>> which means that the owner of the lock will go down in the hard path
>>>> of locking functions and will performe a full wakeup even if the
>>>> waiters queues can result empty. This is a big penalty for adaptive
>>>> spinning which can make it completely useless.
>>>> In addiction to this, adaptive spinning only runs in the turnstile
>>>> spinlock path which is not ideal.
>>>> This patch ports the approach alredy used for adaptive spinning in sx
>>>> locks to rwlocks:
>>>> http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff
>>>>
>>>> In sx it is unlikely to see big benefits because they are held for too
>>>> long times, but for rwlocks situation is rather different.
>>>> I would like to see if people can do benchmarks with this patch (maybe
>>>> in private environments?) as I'm not able to do them in short times.
>>>>
>>>> Adaptive spinning in rwlocks can be improved further with other tricks
>>>> (like adding a backoff counter, for example, or trying to spin with
>>>> the lock held in read mode too), but we first should be sure to start
>>>> with a solid base.
>>>
>>> I did this for mutexes and rwlocks over a year ago and Kris found it was
>>> slower in benchmarks.  www.freebsd.org/~jhb/patches/lock_adapt.patch is
> the
>>> last thing I sent kris@ to test (it only has the mutex changes).  This
> might
>>> be more optimal post-thread_lock since thread_lock seems to have heavily
>>> pessimized adaptive spinning because it now enqueues the thread and then
>>> dequeues it again before doing the adaptive spin.  I liked the approach
>>> orginially because it simplifies the code a lot.  A separate issue is that
>>> writers don't spin at all if a reader holds the lock, and I think one
> thing
>>> to test for that would be an adaptive spin with a static timeout.
>>
>> We don't enqueue the thread until the same place.  We just acquire an
>> extra spinlock.  The thread is not enqueued until turnstile_wait() as
>> before.
>
> So I looked at this some more and now I don't understand why the trywait() and
> cancel() changes were done.  Some background:

I wanted the turnstile chain lock to protect only the association between 
the turnstile and lock.  This way it further reduces the contention domain 
for the hard lock/unlock cases.  We sometimes see a lot of contention on 
the turnstile chain that may be due to hash collisions.  If this reduced 
contention doesn't help I think it would be possible to revert this 
change.

I originally had more of the code running without the turnstile chain 
held but while trying to reduce diffs to understand the problem better it 
eventually returned to cover almost all of the hard paths.

>
> When the initial turnstile code was split out of the the mutex code, the main
> loop in mtx_lock_sleep() looked like this:
>
> 	while (!_obtain_lock) {
> 			ts = turnstile_lookup();	/* (A) locked ts chain
> 							   and did O(n) lookup
> 							   to find existing ts */

I'm not sure why you're calling this O(n).  It's O(n) with the number of 
hash collisions, but it's a hash lookup.  We should consider using a real 
hash algorithm here if the distribution is bad.

Thanks,
Jeff

>
> 			if (lock unlocked) {		(B)
> 				turnstile_release();
> 				continue;
> 			}
>
> 			if (failed to set contested flag) { (C)
> 				turnstile_release();
> 				continue;
> 			}
>
> 			if (owner is running) {		(D)
> 				turnstile_release();
> 				... spin ...
> 				continue;
> 			}
>
> 			turnstile_wait(ts, ...);	(E)
> 	}
>
> So one thing I noticed after a while is that every iteration of this loop was
> doing the O(n) lookup to find the turnstile even though we only used that in
> (E).  The lookup was wasted overhead for the (B), (C), and (D) cases.  Hence
> this commit a while later:
>
> jhb         2004-10-12 18:36:20 UTC
>
>  FreeBSD src repository
>
>  Modified files:
>    sys/kern             kern_condvar.c kern_mutex.c kern_synch.c
>                         subr_sleepqueue.c subr_turnstile.c
>    sys/sys              sleepqueue.h turnstile.h
>  Log:
>  Refine the turnstile and sleep queue interfaces just a bit:
>  - Add a new _lock() call to each API that locks the associated chain lock
>    for a lock_object pointer or wait channel.  The _lookup() functions now
>    require that the chain lock be locked via _lock() when they are called.
>  - Change sleepq_add(), turnstile_wait() and turnstile_claim() to lookup
>    the associated queue structure internally via _lookup() rather than
>    accepting a pointer from the caller.  For turnstiles, this means that
>    the actual lookup of the turnstile in the hash table is only done when
>    the thread actually blocks rather than being done on each loop iteration
>    in _mtx_lock_sleep().  For sleep queues, this means that sleepq_lookup()
>    is no longer used outside of the sleep queue code except to implement an
>    assertion in cv_destroy().
>  - Change sleepq_broadcast() and sleepq_signal() to require that the chain
>    lock is already required.  For condition variables, this lets the
>    cv_broadcast() and cv_signal() functions lock the sleep queue chain lock
>    while testing the waiters count.  This means that the waiters count
>    internal to condition variables is no longer protected by the interlock
>    mutex and cv_broadcast() and cv_signal() now no longer require that the
>    interlock be held when they are called.  This lets consumers of condition
>    variables drop the lock before waking other threads which can result in
>    fewer context switches.
>
>  MFC after:      1 month
>
>  Revision  Changes    Path
>  1.52      +16 -15    src/sys/kern/kern_condvar.c
>  1.151     +4 -5      src/sys/kern/kern_mutex.c
>  1.263     +4 -3      src/sys/kern/kern_synch.c
>  1.13      +31 -14    src/sys/kern/subr_sleepqueue.c
>  1.150     +34 -12    src/sys/kern/subr_turnstile.c
>  1.5       +11 -12    src/sys/sys/sleepqueue.h
>  1.5       +17 -16    src/sys/sys/turnstile.h
>
> After that commit the mutex loop looked like this:
>
> 	while (!_obtain_lock) {
> 			turnstile_lock();		/* (A) locked ts chain only */
>
> 			if (lock unlocked) {		(B)
> 				turnstile_release();
> 				continue;
> 			}
>
> 			if (failed to set contested flag) { (C)
> 				turnstile_release();
> 				continue;
> 			}
>
> 			if (owner is running) {		(D)
> 				turnstile_release();
> 				... spin ...
> 				continue;
> 			}
>
> 			turnstile_wait(ts, ...);	/* (E) Does O(n) lookup */
> 	}
>
> That is, after the above commit, we only did the O(n) lookup if we needed the
> actual turnstile object in (E).  This decreased the overhead for the (B),
> (C), and (D) cases.
>
> The changes to add trywait() and cancel() basically seem to have reverted the
> above commit by going back to doing the O(n) lookup in (A) thus re-adding the
> overhead to the (B), (C), and (D) cases.  In addition trywait() / cancel()
> actually maintain some additional state and acquire another lock so that (B),
> (C), and (D) have even more overhead than the first cut of the turnstile code
> including having to back out some of the changes in cancel() rather than just
> a single spinlock unlock.
>
> This is actually why I had assumed you had enqueued the turnstile in trywait()
> as I thought you had dropped the chain lock in trywait() (which would require
> enqueing the thread in trywait(), but would also greatly complicate
> cancel()).  Otherwise, the current trywait() / cancel() changes seem like a
> step backwards to me.
>
> Do you really think it is necessary to do the O(n) lookup even when we don't
> need it (B, C, D) rather than just doing it in turnstile_wait() (E)?
>
> -- 
> John Baldwin
>

From owner-freebsd-smp@FreeBSD.ORG  Wed Sep 26 06:59:40 2007
Return-Path: <owner-freebsd-smp@FreeBSD.ORG>
Delivered-To: freebsd-smp@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9890F16A421
	for <freebsd-smp@freebsd.org>; Wed, 26 Sep 2007 06:59:40 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mu-out-0910.google.com (mu-out-0910.google.com [209.85.134.186])
	by mx1.freebsd.org (Postfix) with ESMTP id 258B613C469
	for <freebsd-smp@freebsd.org>; Wed, 26 Sep 2007 06:59:39 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: by mu-out-0910.google.com with SMTP id w9so2691129mue
	for <freebsd-smp@freebsd.org>; Tue, 25 Sep 2007 23:59:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta;
	h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth;
	bh=Yyd0Kt0q2A1GRkUwT62/Rn5k/Erc5kwRjhXvfl2yw4c=;
	b=MiM9PpjFpF5kG7MEi3NRo3G26GIsXyekFqjPGVXDxrb3agBk5Yyl0SpIZ2JO66Dh2QlU9dHk1t5LozGpue4vgVZf4zadJWWiBPeiz7g9DcKNg6gbYMF7PE6IuAzVZd6ybhG/eOHTWYTnGMmNFyY/rIP1oRiPmShtdjoZhYs2UxA=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta;
	h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth;
	b=Oj+kNUOsyEwDXAQHLqSKUqT6aM5UjPrZV5Z0oHPI4N/NwwJGe2XbIqh3J8FB2D5bgB9N8nwTfstrjzK8V98x7yTuUWL3GzmgO6GUKHyMxXpgbjruIDbQ08XJvbSKqRS3ODpIJoQNWndQmrYmPc4lOJTkcO1WQFY5XY35AmGgPAw=
Received: by 10.78.131.8 with SMTP id e8mr151374hud.1190789977617;
	Tue, 25 Sep 2007 23:59:37 -0700 (PDT)
Received: by 10.78.97.18 with HTTP; Tue, 25 Sep 2007 23:59:37 -0700 (PDT)
Message-ID: <3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com>
Date: Wed, 26 Sep 2007 08:59:37 +0200
From: "Attilio Rao" <attilio@freebsd.org>
Sender: asmrookie@gmail.com
To: "Jeff Roberson" <jroberson@chesapeake.net>
In-Reply-To: <20070925155504.W556@10.0.0.1>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com>
	<200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1>
	<200709251314.42707.jhb@freebsd.org> <20070925155504.W556@10.0.0.1>
X-Google-Sender-Auth: 12c5eeab311d1215
Cc: freebsd-smp@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: rwlocks: poor performance with adaptive spinning
X-BeenThere: freebsd-smp@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD SMP implementation group <freebsd-smp.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-smp>
List-Post: <mailto:freebsd-smp@freebsd.org>
List-Help: <mailto:freebsd-smp-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 06:59:40 -0000

2007/9/26, Jeff Roberson <jroberson@chesapeake.net>:
> On Tue, 25 Sep 2007, John Baldwin wrote:
>
> >
> > So I looked at this some more and now I don't understand why the trywait() and
> > cancel() changes were done.  Some background:
>
> I wanted the turnstile chain lock to protect only the association between
> the turnstile and lock.  This way it further reduces the contention domain
> for the hard lock/unlock cases.  We sometimes see a lot of contention on
> the turnstile chain that may be due to hash collisions.  If this reduced
> contention doesn't help I think it would be possible to revert this
> change.

Well, the change jhb is suggesting won't reduce tc_lock contention for
sure, but it will prevent the turnstile lookup/locking if not
necessary.
Mainly, we need a spinlock held in order to protect mtx_lock bitfield
(as well as rw_lock) by setting waiters flags.
While in the past code this lock was only tc_lock, currently mtx_lock
is both protected by tc_lock and ts_lock, which it doesn't seem
strictly necessary.
if, for example, you lock tc_lock than you scan the chain and lock
ts_lock (basically what turnstile_trywait() does) and later you find
lock is released or that the state of lock changed and you cannot set
a waiters flag, you did the ts_locking and hash table scanning even if
it wasn't needed.

> I'm not sure why you're calling this O(n).  It's O(n) with the number of
> hash collisions, but it's a hash lookup.  We should consider using a real
> hash algorithm here if the distribution is bad.

Well, the hash lookup (worst case) has O(n) complexity.
This linear complexity cames from the O(1) * O(n) which is the access
to the list and the scanning of this one. So this is reduced to the
same complexity of a linked list.
However, some time ago I saw empirically that starting from a 'struct
thread' allocated address the highest level of entropy in the word
came after a right shift of 10 bits (for ia32) and 12 bits for amd64.
Currently, turnstile and sleepqueues only uses 8 bits of right
shifting.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-smp@FreeBSD.ORG  Wed Sep 26 15:52:12 2007
Return-Path: <owner-freebsd-smp@FreeBSD.ORG>
Delivered-To: freebsd-smp@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7B0C116A476;
	Wed, 26 Sep 2007 15:52:12 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com
	[216.240.101.25])
	by mx1.freebsd.org (Postfix) with ESMTP id 5157413C469;
	Wed, 26 Sep 2007 15:52:12 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from [192.168.1.103] (c-67-160-44-208.hsd1.wa.comcast.net
	[67.160.44.208]) (authenticated bits=0)
	by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id
	l8QFq9hX054745
	(version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO);
	Wed, 26 Sep 2007 11:52:10 -0400 (EDT)
	(envelope-from jroberson@chesapeake.net)
Date: Wed, 26 Sep 2007 08:54:57 -0700 (PDT)
From: Jeff Roberson <jroberson@chesapeake.net>
X-X-Sender: jroberson@10.0.0.1
To: Attilio Rao <attilio@freebsd.org>
In-Reply-To: <3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com>
Message-ID: <20070926084705.T556@10.0.0.1>
References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> 
	<200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1> 
	<200709251314.42707.jhb@freebsd.org> <20070925155504.W556@10.0.0.1>
	<3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-smp@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: rwlocks: poor performance with adaptive spinning
X-BeenThere: freebsd-smp@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD SMP implementation group <freebsd-smp.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-smp>
List-Post: <mailto:freebsd-smp@freebsd.org>
List-Help: <mailto:freebsd-smp-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 15:52:12 -0000

On Wed, 26 Sep 2007, Attilio Rao wrote:

> 2007/9/26, Jeff Roberson <jroberson@chesapeake.net>:
>> On Tue, 25 Sep 2007, John Baldwin wrote:
>>
>>>
>>> So I looked at this some more and now I don't understand why the trywait() and
>>> cancel() changes were done.  Some background:
>>
>> I wanted the turnstile chain lock to protect only the association between
>> the turnstile and lock.  This way it further reduces the contention domain
>> for the hard lock/unlock cases.  We sometimes see a lot of contention on
>> the turnstile chain that may be due to hash collisions.  If this reduced
>> contention doesn't help I think it would be possible to revert this
>> change.
>
> Well, the change jhb is suggesting won't reduce tc_lock contention for
> sure, but it will prevent the turnstile lookup/locking if not
> necessary.
> Mainly, we need a spinlock held in order to protect mtx_lock bitfield
> (as well as rw_lock) by setting waiters flags.
> While in the past code this lock was only tc_lock, currently mtx_lock
> is both protected by tc_lock and ts_lock, which it doesn't seem
> strictly necessary.
> if, for example, you lock tc_lock than you scan the chain and lock
> ts_lock (basically what turnstile_trywait() does) and later you find
> lock is released or that the state of lock changed and you cannot set
> a waiters flag, you did the ts_locking and hash table scanning even if
> it wasn't needed.

Yes I understand, the question is whether the decreased complexity is a 
bigger win than the reduced contention.  The contention is only reduced if 
there are collisions in the table.  So probably jhb is right, we should 
revert this part of the change.  I need to make sure there was not some 
other reason for it.  I had convinced myself at the time that it was 
necessary, however, I can't recall why.

>
>> I'm not sure why you're calling this O(n).  It's O(n) with the number of
>> hash collisions, but it's a hash lookup.  We should consider using a real
>> hash algorithm here if the distribution is bad.
>
> Well, the hash lookup (worst case) has O(n) complexity.
> This linear complexity cames from the O(1) * O(n) which is the access
> to the list and the scanning of this one. So this is reduced to the
> same complexity of a linked list.

sure, I just don't think it's common to call a hash O(n).

> However, some time ago I saw empirically that starting from a 'struct
> thread' allocated address the highest level of entropy in the word
> came after a right shift of 10 bits (for ia32) and 12 bits for amd64.
> Currently, turnstile and sleepqueues only uses 8 bits of right
> shifting.

I haven't studied the hash distribution in turnstiles, I know there is 
some debugging code in there for it so jhb must have.  Perhaps he can say 
how it does.

>
> Thanks,
> Attilio
>
>
> -- 
> Peace can only be achieved by understanding - A. Einstein
>

From owner-freebsd-smp@FreeBSD.ORG  Wed Sep 26 18:33:10 2007
Return-Path: <owner-freebsd-smp@FreeBSD.ORG>
Delivered-To: freebsd-smp@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A6FBC16A418;
	Wed, 26 Sep 2007 18:33:10 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219])
	by mx1.freebsd.org (Postfix) with ESMTP id 4519E13C4AA;
	Wed, 26 Sep 2007 18:33:10 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (unverified [66.23.211.162]) 
	by speedfactory.net (SurgeMail 3.8p) with ESMTP id 211528358-1834499 
	for multiple; Wed, 26 Sep 2007 14:31:23 -0400
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l8QIWPRi024716;
	Wed, 26 Sep 2007 14:32:32 -0400 (EDT) (envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: Jeff Roberson <jroberson@chesapeake.net>
Date: Wed, 26 Sep 2007 13:40:07 -0400
User-Agent: KMail/1.9.6
References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com>
	<3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com>
	<20070926084705.T556@10.0.0.1>
In-Reply-To: <20070926084705.T556@10.0.0.1>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200709261340.08344.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Wed, 26 Sep 2007 14:32:32 -0400 (EDT)
X-Virus-Scanned: ClamAV 0.88.3/4405/Wed Sep 26 12:29:18 2007 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: Attilio Rao <attilio@freebsd.org>, freebsd-smp@freebsd.org,
	freebsd-arch@freebsd.org
Subject: Re: rwlocks: poor performance with adaptive spinning
X-BeenThere: freebsd-smp@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD SMP implementation group <freebsd-smp.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-smp>
List-Post: <mailto:freebsd-smp@freebsd.org>
List-Help: <mailto:freebsd-smp-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 18:33:10 -0000

On Wednesday 26 September 2007 11:54:57 am Jeff Roberson wrote:
> On Wed, 26 Sep 2007, Attilio Rao wrote:
> 
> > 2007/9/26, Jeff Roberson <jroberson@chesapeake.net>:
> >> On Tue, 25 Sep 2007, John Baldwin wrote:
> >>
> >>>
> >>> So I looked at this some more and now I don't understand why the 
trywait() and
> >>> cancel() changes were done.  Some background:
> >>
> >> I wanted the turnstile chain lock to protect only the association between
> >> the turnstile and lock.  This way it further reduces the contention 
domain
> >> for the hard lock/unlock cases.  We sometimes see a lot of contention on
> >> the turnstile chain that may be due to hash collisions.  If this reduced
> >> contention doesn't help I think it would be possible to revert this
> >> change.
> >
> > Well, the change jhb is suggesting won't reduce tc_lock contention for
> > sure, but it will prevent the turnstile lookup/locking if not
> > necessary.
> > Mainly, we need a spinlock held in order to protect mtx_lock bitfield
> > (as well as rw_lock) by setting waiters flags.
> > While in the past code this lock was only tc_lock, currently mtx_lock
> > is both protected by tc_lock and ts_lock, which it doesn't seem
> > strictly necessary.
> > if, for example, you lock tc_lock than you scan the chain and lock
> > ts_lock (basically what turnstile_trywait() does) and later you find
> > lock is released or that the state of lock changed and you cannot set
> > a waiters flag, you did the ts_locking and hash table scanning even if
> > it wasn't needed.
> 
> Yes I understand, the question is whether the decreased complexity is a 
> bigger win than the reduced contention.  The contention is only reduced if 
> there are collisions in the table.  So probably jhb is right, we should 
> revert this part of the change.  I need to make sure there was not some 
> other reason for it.  I had convinced myself at the time that it was 
> necessary, however, I can't recall why.

Actually, since you still hold the tc_lock the entire time (trywait() doesn't 
drop it), I don't think the trywait() / cancel() changes affect spin lock 
contention at all.  I think it's more just a matter of overhead.

If you do want to drop the tc_lock after the lookup in trywait() then I think 
the added complexity you would need to add (queue'ing the thread in trywait() 
to avoid a race with concurrent contesters on the "real" lock and then 
backing all that out in cancel()) would be very hairy.

> >> I'm not sure why you're calling this O(n).  It's O(n) with the number of
> >> hash collisions, but it's a hash lookup.  We should consider using a real
> >> hash algorithm here if the distribution is bad.
> >
> > Well, the hash lookup (worst case) has O(n) complexity.
> > This linear complexity cames from the O(1) * O(n) which is the access
> > to the list and the scanning of this one. So this is reduced to the
> > same complexity of a linked list.
> 
> sure, I just don't think it's common to call a hash O(n).

Well, the lookup of the chain in the bucket is what I call O(n).  Getting to 
the bucket (which you always have to do to lock the turnstile chain) is O(1).  
The change I'm suggesting would only defer walking the chain to 
turnstile_wait(), not finding the index in the bucket.  It would also defer 
grabbing the turnstile spin lock which probably dwarfs the cost of the chain 
walk in practice.
 
> > However, some time ago I saw empirically that starting from a 'struct
> > thread' allocated address the highest level of entropy in the word
> > came after a right shift of 10 bits (for ia32) and 12 bits for amd64.
> > Currently, turnstile and sleepqueues only uses 8 bits of right
> > shifting.
> 
> I haven't studied the hash distribution in turnstiles, I know there is 
> some debugging code in there for it so jhb must have.  Perhaps he can say 
> how it does.

I don't think I ever saw it go over 10.  Note that the comments about 'struct 
thread' aren't quite relevant as sleepqueue's and turnstile's don't use 
thread pointers for hashing, but wait channels and lock pointers.  That said, 
I only used the values they used based on what tsleep() in 4.x used for its 
hash for the 4.x sleep queues.

-- 
John Baldwin

From owner-freebsd-smp@FreeBSD.ORG  Wed Sep 26 18:51:09 2007
Return-Path: <owner-freebsd-smp@FreeBSD.ORG>
Delivered-To: freebsd-smp@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B3C2716A419;
	Wed, 26 Sep 2007 18:51:09 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com
	[216.240.101.25])
	by mx1.freebsd.org (Postfix) with ESMTP id 8DC6A13C457;
	Wed, 26 Sep 2007 18:51:09 +0000 (UTC)
	(envelope-from jroberson@chesapeake.net)
Received: from [192.168.1.103] (c-67-160-44-208.hsd1.wa.comcast.net
	[67.160.44.208]) (authenticated bits=0)
	by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id
	l8QIp60G013788
	(version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO);
	Wed, 26 Sep 2007 14:51:07 -0400 (EDT)
	(envelope-from jroberson@chesapeake.net)
Date: Wed, 26 Sep 2007 11:53:54 -0700 (PDT)
From: Jeff Roberson <jroberson@chesapeake.net>
X-X-Sender: jroberson@10.0.0.1
To: John Baldwin <jhb@freebsd.org>
In-Reply-To: <200709261340.08344.jhb@freebsd.org>
Message-ID: <20070926114943.X556@10.0.0.1>
References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com>
	<3bbf2fe10709252359h7812ff7bsce9d393f80b65c06@mail.gmail.com>
	<20070926084705.T556@10.0.0.1> <200709261340.08344.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Attilio Rao <attilio@freebsd.org>, freebsd-smp@freebsd.org,
	freebsd-arch@freebsd.org
Subject: Re: rwlocks: poor performance with adaptive spinning
X-BeenThere: freebsd-smp@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: FreeBSD SMP implementation group <freebsd-smp.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-smp>
List-Post: <mailto:freebsd-smp@freebsd.org>
List-Help: <mailto:freebsd-smp-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-smp>,
	<mailto:freebsd-smp-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 18:51:09 -0000

On Wed, 26 Sep 2007, John Baldwin wrote:

> On Wednesday 26 September 2007 11:54:57 am Jeff Roberson wrote:
>> On Wed, 26 Sep 2007, Attilio Rao wrote:
>>
>>> 2007/9/26, Jeff Roberson <jroberson@chesapeake.net>:
>>>> On Tue, 25 Sep 2007, John Baldwin wrote:
>>>>
>>>>>
>>>>> So I looked at this some more and now I don't understand why the
> trywait() and
>>>>> cancel() changes were done.  Some background:
>>>>
>>>> I wanted the turnstile chain lock to protect only the association between
>>>> the turnstile and lock.  This way it further reduces the contention
> domain
>>>> for the hard lock/unlock cases.  We sometimes see a lot of contention on
>>>> the turnstile chain that may be due to hash collisions.  If this reduced
>>>> contention doesn't help I think it would be possible to revert this
>>>> change.
>>>
>>> Well, the change jhb is suggesting won't reduce tc_lock contention for
>>> sure, but it will prevent the turnstile lookup/locking if not
>>> necessary.
>>> Mainly, we need a spinlock held in order to protect mtx_lock bitfield
>>> (as well as rw_lock) by setting waiters flags.
>>> While in the past code this lock was only tc_lock, currently mtx_lock
>>> is both protected by tc_lock and ts_lock, which it doesn't seem
>>> strictly necessary.
>>> if, for example, you lock tc_lock than you scan the chain and lock
>>> ts_lock (basically what turnstile_trywait() does) and later you find
>>> lock is released or that the state of lock changed and you cannot set
>>> a waiters flag, you did the ts_locking and hash table scanning even if
>>> it wasn't needed.
>>
>> Yes I understand, the question is whether the decreased complexity is a
>> bigger win than the reduced contention.  The contention is only reduced if
>> there are collisions in the table.  So probably jhb is right, we should
>> revert this part of the change.  I need to make sure there was not some
>> other reason for it.  I had convinced myself at the time that it was
>> necessary, however, I can't recall why.
>
> Actually, since you still hold the tc_lock the entire time (trywait() doesn't
> drop it), I don't think the trywait() / cancel() changes affect spin lock
> contention at all.  I think it's more just a matter of overhead.
>
> If you do want to drop the tc_lock after the lookup in trywait() then I think
> the added complexity you would need to add (queue'ing the thread in trywait()
> to avoid a race with concurrent contesters on the "real" lock and then
> backing all that out in cancel()) would be very hairy.

Yes it's not for concurrency in lock, but in unlock.  If the chain gets as 
deep as 10 there may be some practical value in reducing contention on 
that chain.  We should look at this more before we rush into anything. 
Right now unlock holds the chain lock for almost the entire duration but 
it's not really necessary.

I think more experimentation is warranted especially since I did not 
actually measure a slowdown in heavily contended workloads.  If we can 
reduce the number of hash collisions significantly then your approach is 
definitely a win.  However it may be less so if we do not.

Thanks,
Jeff

>
>>>> I'm not sure why you're calling this O(n).  It's O(n) with the number of
>>>> hash collisions, but it's a hash lookup.  We should consider using a real
>>>> hash algorithm here if the distribution is bad.
>>>
>>> Well, the hash lookup (worst case) has O(n) complexity.
>>> This linear complexity cames from the O(1) * O(n) which is the access
>>> to the list and the scanning of this one. So this is reduced to the
>>> same complexity of a linked list.
>>
>> sure, I just don't think it's common to call a hash O(n).
>
> Well, the lookup of the chain in the bucket is what I call O(n).  Getting to
> the bucket (which you always have to do to lock the turnstile chain) is O(1).
> The change I'm suggesting would only defer walking the chain to
> turnstile_wait(), not finding the index in the bucket.  It would also defer
> grabbing the turnstile spin lock which probably dwarfs the cost of the chain
> walk in practice.
>
>>> However, some time ago I saw empirically that starting from a 'struct
>>> thread' allocated address the highest level of entropy in the word
>>> came after a right shift of 10 bits (for ia32) and 12 bits for amd64.
>>> Currently, turnstile and sleepqueues only uses 8 bits of right
>>> shifting.
>>
>> I haven't studied the hash distribution in turnstiles, I know there is
>> some debugging code in there for it so jhb must have.  Perhaps he can say
>> how it does.
>
> I don't think I ever saw it go over 10.  Note that the comments about 'struct
> thread' aren't quite relevant as sleepqueue's and turnstile's don't use
> thread pointers for hashing, but wait channels and lock pointers.  That said,
> I only used the values they used based on what tsleep() in 4.x used for its
> hash for the 4.x sleep queues.
>
> -- 
> John Baldwin
>