From owner-freebsd-current@FreeBSD.ORG  Mon May 16 18:21:29 2011
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3B25F106566C;
	Mon, 16 May 2011 18:21:29 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 10B438FC15;
	Mon, 16 May 2011 18:21:29 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id 9CD7946B38;
	Mon, 16 May 2011 14:21:28 -0400 (EDT)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id 2CD518A04F;
	Mon, 16 May 2011 14:21:28 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Max Laier <max@love2party.net>
Date: Mon, 16 May 2011 14:21:27 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110325; KDE/4.5.5; amd64; ; )
References: <4DCD357D.6000109@FreeBSD.org> <201105161152.10458.jhb@freebsd.org>
	<201105161346.34134.max@love2party.net>
In-Reply-To: <201105161346.34134.max@love2party.net>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201105161421.27665.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6
	(bigwig.baldwin.cx); Mon, 16 May 2011 14:21:28 -0400 (EDT)
Cc: FreeBSD current <freebsd-current@freebsd.org>,
	Peter Grehan <grehan@freebsd.org>,
	Andriy Gapon <avg@freebsd.org>, neel@freebsd.org
Subject: Re: proposed smp_rendezvous change
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 16 May 2011 18:21:29 -0000

On Monday, May 16, 2011 1:46:33 pm Max Laier wrote:
> I'd like to propose that we move forward with fixing the race, before 
> redesigning the API - please.
> 
> We agree that there is a problem with the exit rendezvous.  Let's fix that.  
> We agree that the generation approach proposed by NetAPP is the right way to 
> go?  Can we check it in, then?  Again, I'd like some comments in the code if 
> at all possible.

How about this:

Index: subr_smp.c
===================================================================
--- subr_smp.c	(revision 221939)
+++ subr_smp.c	(working copy)
@@ -111,6 +111,7 @@ static void (*volatile smp_rv_action_func)(void *a
 static void (*volatile smp_rv_teardown_func)(void *arg);
 static void *volatile smp_rv_func_arg;
 static volatile int smp_rv_waiters[3];
+static volatile int smp_rv_generation;
 
 /* 
  * Shared mutex to restrict busywaits between smp_rendezvous() and
@@ -311,39 +312,62 @@ restart_cpus(cpumask_t map)
 void
 smp_rendezvous_action(void)
 {
-	void* local_func_arg = smp_rv_func_arg;
-	void (*local_setup_func)(void*)   = smp_rv_setup_func;
-	void (*local_action_func)(void*)   = smp_rv_action_func;
-	void (*local_teardown_func)(void*) = smp_rv_teardown_func;
+	void *local_func_arg;
+	void (*local_setup_func)(void*);
+	void (*local_action_func)(void*);
+	void (*local_teardown_func)(void*);
+	int generation;
 
 	/* Ensure we have up-to-date values. */
 	atomic_add_acq_int(&smp_rv_waiters[0], 1);
 	while (smp_rv_waiters[0] < smp_rv_ncpus)
 		cpu_spinwait();
 
-	/* setup function */
+	/* Fetch rendezvous parameters after acquire barrier. */
+	local_func_arg = smp_rv_func_arg;
+	local_setup_func = smp_rv_setup_func;
+	local_action_func = smp_rv_action_func;
+	local_teardown_func = smp_rv_teardown_func;
+	generation = smp_rv_generation;
+
+	/*
+	 * If requested, run a setup function before the main action
+	 * function.  Ensure all CPUs have completed the setup
+	 * function before moving on to the action function.
+	 */
 	if (local_setup_func != smp_no_rendevous_barrier) {
 		if (smp_rv_setup_func != NULL)
 			smp_rv_setup_func(smp_rv_func_arg);
-
-		/* spin on entry rendezvous */
 		atomic_add_int(&smp_rv_waiters[1], 1);
 		while (smp_rv_waiters[1] < smp_rv_ncpus)
                 	cpu_spinwait();
 	}
 
-	/* action function */
 	if (local_action_func != NULL)
 		local_action_func(local_func_arg);
 
-	/* spin on exit rendezvous */
+	/*
+	 * Signal that the main action has been completed.  If a
+	 * full exit rendezvous is requested, then all CPUs will
+	 * wait here until all CPUs have finished the main action.
+	 *
+	 * Note that the write by the last CPU to finish the action
+	 * may become visible to different CPUs at different times.
+	 * As a result, the CPU that initiated the rendezvous may
+	 * exit the rendezvous and drop the lock allowing another
+	 * rendezvous to be initiated on the same CPU or a different
+	 * CPU.  In that case the exit sentinel may be cleared before
+	 * all CPUs have noticed causing those CPUs to hang forever.
+	 * Workaround this by using a generation count to notice when
+	 * this race occurs and to exit the rendezvous in that case.
+	 */
 	atomic_add_int(&smp_rv_waiters[2], 1);
 	if (local_teardown_func == smp_no_rendevous_barrier)
                 return;
-	while (smp_rv_waiters[2] < smp_rv_ncpus)
+	while (smp_rv_waiters[2] < smp_rv_ncpus &&
+	    generation == smp_rv_generation)
 		cpu_spinwait();
 
-	/* teardown function */
 	if (local_teardown_func != NULL)
 		local_teardown_func(local_func_arg);
 }
@@ -374,10 +398,11 @@ smp_rendezvous_cpus(cpumask_t map,
 	if (ncpus == 0)
 		panic("ncpus is 0 with map=0x%x", map);
 
-	/* obtain rendezvous lock */
 	mtx_lock_spin(&smp_ipi_mtx);
 
-	/* set static function pointers */
+	atomic_add_acq_int(&smp_rv_generation, 1);
+
+	/* Pass rendezvous parameters via global variables. */
 	smp_rv_ncpus = ncpus;
 	smp_rv_setup_func = setup_func;
 	smp_rv_action_func = action_func;
@@ -387,18 +412,25 @@ smp_rendezvous_cpus(cpumask_t map,
 	smp_rv_waiters[2] = 0;
 	atomic_store_rel_int(&smp_rv_waiters[0], 0);
 
-	/* signal other processors, which will enter the IPI with interrupts off */
+	/*
+	 * Signal other processors, which will enter the IPI with
+	 * interrupts off.
+	 */
 	ipi_selected(map & ~(1 << curcpu), IPI_RENDEZVOUS);
 
 	/* Check if the current CPU is in the map */
 	if ((map & (1 << curcpu)) != 0)
 		smp_rendezvous_action();
 
+	/*
+	 * If the caller did not request an exit barrier to be enforced
+	 * on each CPU, ensure that this CPU waits for all the other
+	 * CPUs to finish the rendezvous.
+	 */
 	if (teardown_func == smp_no_rendevous_barrier)
 		while (atomic_load_acq_int(&smp_rv_waiters[2]) < ncpus)
 			cpu_spinwait();
 
-	/* release lock */
 	mtx_unlock_spin(&smp_ipi_mtx);
 }
 

> A second commit should take care of the entry sync - which may or may not be 
> required on some architectures.  If we decide that we don't need it, we should 
> remove it.  Otherwise, we must move all the assignments from global smp_rv_* 
> to the stack in smp_rendezvous_action to after the sync.  Otherwise we should 
> just kill it.  In any case, it would be nice to clean this up.

The patch also fixes this in the cautious fashion (not removing the early
rendezvous).

> After that, I have one more bugfix for rmlock(9) - one of the three consumers 
> of this API (that I am aware of).  As it uses a spinlock inside its IPI 
> action, we have to mark smp_rendezvous_action a critical section otherwise the 
> spin unlock might yield ... with the expected consequences.

Yes, we need to fix that.  Humm, it doesn't preempt when you do a critical_exit()
though?  Or do you use a hand-rolled critical exit that doesn't do a deferred
preemption?

Actually, I'm curious how the spin unlock inside the IPI could yield the
CPU.  Oh, is rmlock doing a wakeup inside the IPI handler?  I guess that is ok
as long as the critical_exit() just defers the preemption to the end of the IPI
handler.

> It is arguable if you should be allowed to use spinlocks in the rendevous at 
> all, but that's beside the point.

I'm less convinced this is bad since we don't use NMIs for these, so we know the
interrupted thread doesn't have any spin locks in the scheduler, etc.

-- 
John Baldwin