From owner-svn-src-head@FreeBSD.ORG  Wed Dec 29 18:17:37 2010
Return-Path: <owner-svn-src-head@FreeBSD.ORG>
Delivered-To: svn-src-head@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id F3E82106564A;
	Wed, 29 Dec 2010 18:17:36 +0000 (UTC)
	(envelope-from attilio@FreeBSD.org)
Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:4f8:fff6::2c])
	by mx1.freebsd.org (Postfix) with ESMTP id E29198FC16;
	Wed, 29 Dec 2010 18:17:36 +0000 (UTC)
Received: from svn.freebsd.org (localhost [127.0.0.1])
	by svn.freebsd.org (8.14.3/8.14.3) with ESMTP id oBTIHaW1009087;
	Wed, 29 Dec 2010 18:17:36 GMT (envelope-from attilio@svn.freebsd.org)
Received: (from attilio@localhost)
	by svn.freebsd.org (8.14.3/8.14.3/Submit) id oBTIHa0H009085;
	Wed, 29 Dec 2010 18:17:36 GMT (envelope-from attilio@svn.freebsd.org)
Message-Id: <201012291817.oBTIHa0H009085@svn.freebsd.org>
From: Attilio Rao <attilio@FreeBSD.org>
Date: Wed, 29 Dec 2010 18:17:36 +0000 (UTC)
To: src-committers@freebsd.org, svn-src-all@freebsd.org,
	svn-src-head@freebsd.org
X-SVN-Group: head
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Cc: 
Subject: svn commit: r216805 - head/sys/kern
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
	<svn-src-head.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-head>,
	<mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-head>,
	<mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Dec 2010 18:17:37 -0000

Author: attilio
Date: Wed Dec 29 18:17:36 2010
New Revision: 216805
URL: http://svn.freebsd.org/changeset/base/216805

Log:
  Fix several callout migration races:
   - Problem1:
     Hypothesis: thread1 is doing a callout_reset_on(), within his
     callout handler, willing to implicitly or explicitly migrate the
     callout.  thread2 is draining the callout.
  
     Thesys:
     * thread1 calls callout_lock() and locks the old callout cpu
     * thread1 performs the checks in the first path of the
       callout_reset_on()
     * thread1 hits this codepiece:
         /*
          * If the lock must migrate we have to check the state again as
          * we can't hold both the new and old locks simultaneously.
          */
         if (c->c_cpu != cpu) {
                 c->c_cpu = cpu;
                 CC_UNLOCK(cc);
                 goto retry;
         }
  
       which means it will drop the lock and 'retry'
     * thread2 will callout_lock() and locks the new callout cpu.
       thread1 spins on the new lock and will not keep going for the
       moment.
     * thread2 checks that the callout is not pending (as callout is
       currently running) and that it is not on cc->cc_curr (because cc
       now refers to the new callout and the callout is running on the
       old callout cpu) thus it thinks it is done and returns.
     * thread1  will now acquire the lock and then adds the callout
       to the new callout cpu queue
  
     That seems an obvious race as callout_stop() falsely reports
     the callout stopped or worse, callout_drain() falsely returns
     while the callout is still in use.
   - Solution1:
     Fixing this problem would require, in general, to lock both
     callout cpus at once while switching the c_cpu field and avoid
     cyclic deadlocks between callout cpus locks.
     The concept of CPUBLOCK is then introduced (working more or less
     like the blocked_lock for thread_lock() function) meaning:
     "in callout_lock(), spin until the c->c_cpu is not different from
     CPUBLOCK". That way the "original" callout cpu, referred to the
     above mentioned code snippet, will remain blocked until the lock
     handover is over critical path will remain covered.
  
   - Problem2:
     Having the callout currently executed on a specific callout cpu
     and contemporary pending on another callout cpu (as it can happen
     with current code) breaks, at least, the assumption callout_drain()
     returns just once the callout cannot be referenced anymore.
   - Solution2:
     Callout migration is deferred if the current callout is already
     under execution.
     The best place to do that is in softclock() and new members are
     added to the callout cpu structure in order to specify a pending
     migration is requested. That is necessary because the callout
     cannot be trusted (not freed) the 100% of times after the execution
     of the callout handler.
     CPUBLOCK will prevent, in the "deferred migration" case, that the
     callout gets freed in this case, stopping any callout_stop() and
     callout_drain() possible activity until the migration is
     actually performed.
  
   - Problem3:
     There is a further race in callout_drain().
     In order to avoid a race between sleepqueue lock and callout cpu
     spinlock, in _callout_stop_safe(), the callout cpu lock is dropped,
     the sleepqueue lock is acquired and a new callout cpu lookup is
     performed.  Note that the channel used for locking the sleepqueue is
     obtained from the "current" callout cpu (&cc->cc_waiting).
     If the callout migrated in the meanwhile, callout_drain() will end up
     using the wrong wchan for the sleepqueue (the locked one will be the
     older, while the new one will not really be locked) leading to a
     lock leak and a race access to sleepqueue.
   - Solution3:
     It is enough to check if a migration happened between the operation
     of acquiring the sleepqueue lock and the new callout cpu lock and
     eventually unwind all those and try again.
  
  This problems can lead to deathly races on moderate (4-ways) SMP
  environment, leading to easy panic or deadlocks.
  The 24-ways of the reporter, could easilly panic, with completely
  normal workload, almost daily.
  gianni@ kindly wrote the following prof-of-concept which can
  panic a FreeBSD machine in less than one hour, in smaller SMP:
  http://www.freebsd.org/~attilio/callout/test.c
  
  Reported by:	Nicholas Esborn <nick at desert dot net>, DesertNet
  In collabouration with:	gianni, pho, Nicholas Esborn
  Reviewed by:	jhb
  MFC after:	1 week (*)
  
  * Usually, I would aim for a larger MFC timeout, but I really want this
    in before 8.2-RELEASE, thus re@ accepted a shorter timeout as a special
    case for this patch

Modified:
  head/sys/kern/kern_timeout.c

Modified: head/sys/kern/kern_timeout.c
==============================================================================
--- head/sys/kern/kern_timeout.c	Wed Dec 29 17:12:05 2010	(r216804)
+++ head/sys/kern/kern_timeout.c	Wed Dec 29 18:17:36 2010	(r216805)
@@ -56,6 +56,10 @@ __FBSDID("$FreeBSD$");
 #include <sys/sysctl.h>
 #include <sys/smp.h>
 
+#ifdef SMP
+#include <machine/cpu.h>
+#endif
+
 SDT_PROVIDER_DEFINE(callout_execute);
 SDT_PROBE_DEFINE(callout_execute, kernel, , callout_start, callout-start);
 SDT_PROBE_ARGTYPE(callout_execute, kernel, , callout_start, 0,
@@ -107,11 +111,15 @@ struct callout_cpu {
 	struct callout		*cc_next;
 	struct callout		*cc_curr;
 	void			*cc_cookie;
+	void			(*cc_migration_func)(void *);
+	void			*cc_migration_arg;
 	int 			cc_ticks;
 	int 			cc_softticks;
 	int			cc_cancel;
 	int			cc_waiting;
 	int 			cc_firsttick;
+	int			cc_migration_cpu;
+	int			cc_migration_ticks;
 };
 
 #ifdef SMP
@@ -123,8 +131,10 @@ struct callout_cpu cc_cpu;
 #define	CC_CPU(cpu)	&cc_cpu
 #define	CC_SELF()	&cc_cpu
 #endif
+#define	CPUBLOCK	MAXCPU
 #define	CC_LOCK(cc)	mtx_lock_spin(&(cc)->cc_lock)
 #define	CC_UNLOCK(cc)	mtx_unlock_spin(&(cc)->cc_lock)
+#define	CC_LOCK_ASSERT(cc)	mtx_assert(&(cc)->cc_lock, MA_OWNED)
 
 static int timeout_cpu;
 void (*callout_new_inserted)(int cpu, int ticks) = NULL;
@@ -188,6 +198,7 @@ callout_cpu_init(struct callout_cpu *cc)
 	for (i = 0; i < callwheelsize; i++) {
 		TAILQ_INIT(&cc->cc_callwheel[i]);
 	}
+	cc->cc_migration_cpu = CPUBLOCK;
 	if (cc->cc_callout == NULL)
 		return;
 	for (i = 0; i < ncallout; i++) {
@@ -311,6 +322,13 @@ callout_lock(struct callout *c)
 
 	for (;;) {
 		cpu = c->c_cpu;
+#ifdef SMP
+		if (cpu == CPUBLOCK) {
+			while (c->c_cpu == CPUBLOCK)
+				cpu_spinwait();
+			continue;
+		}
+#endif
 		cc = CC_CPU(cpu);
 		CC_LOCK(cc);
 		if (cpu == c->c_cpu)
@@ -320,6 +338,29 @@ callout_lock(struct callout *c)
 	return (cc);
 }
 
+static void
+callout_cc_add(struct callout *c, struct callout_cpu *cc, int to_ticks,
+    void (*func)(void *), void *arg, int cpu)
+{
+
+	CC_LOCK_ASSERT(cc);
+
+	if (to_ticks <= 0)
+		to_ticks = 1;
+	c->c_arg = arg;
+	c->c_flags |= (CALLOUT_ACTIVE | CALLOUT_PENDING);
+	c->c_func = func;
+	c->c_time = ticks + to_ticks;
+	TAILQ_INSERT_TAIL(&cc->cc_callwheel[c->c_time & callwheelmask], 
+	    c, c_links.tqe);
+	if ((c->c_time - cc->cc_firsttick) < 0 &&
+	    callout_new_inserted != NULL) {
+		cc->cc_firsttick = c->c_time;
+		(*callout_new_inserted)(cpu,
+		    to_ticks + (ticks - cc->cc_ticks));
+	}
+}
+
 /*
  * The callout mechanism is based on the work of Adam M. Costello and 
  * George Varghese, published in a technical report entitled "Redesigning
@@ -390,11 +431,14 @@ softclock(void *arg)
 					steps = 0;
 				}
 			} else {
+				struct callout_cpu *new_cc;
 				void (*c_func)(void *);
-				void *c_arg;
+				void (*new_func)(void *);
+				void *c_arg, *new_arg;
 				struct lock_class *class;
 				struct lock_object *c_lock;
 				int c_flags, sharedlock;
+				int new_cpu, new_ticks;
 
 				cc->cc_next = TAILQ_NEXT(c, c_links.tqe);
 				TAILQ_REMOVE(bucket, c, c_links.tqe);
@@ -496,7 +540,40 @@ softclock(void *arg)
 					    c_links.sle);
 				}
 				cc->cc_curr = NULL;
-				if (cc->cc_waiting) {
+
+				/*
+				 * If the callout was scheduled for
+				 * migration just perform it now.
+				 */
+				if (cc->cc_migration_cpu != CPUBLOCK) {
+
+					/*
+					 * There must not be any waiting
+					 * thread now because the callout
+					 * has a blocked CPU.
+					 * Also, the callout must not be
+					 * freed, but that is not easy to
+					 * assert.
+					 */
+					MPASS(cc->cc_waiting == 0);
+					new_cpu = cc->cc_migration_cpu;
+					new_ticks = cc->cc_migration_ticks;
+					new_func = cc->cc_migration_func;
+					new_arg = cc->cc_migration_arg;
+					cc->cc_migration_cpu = CPUBLOCK;
+					cc->cc_migration_ticks = 0;
+					cc->cc_migration_func = NULL;
+					cc->cc_migration_arg = NULL;
+					CC_UNLOCK(cc);
+					new_cc = CC_CPU(new_cpu);
+					CC_LOCK(new_cc);
+					MPASS(c->c_cpu == CPUBLOCK);
+					c->c_cpu = new_cpu;
+					callout_cc_add(c, new_cc, new_ticks,
+					    new_func, new_arg, new_cpu);
+					CC_UNLOCK(new_cc);
+					CC_LOCK(cc);
+				} else if (cc->cc_waiting) {
 					/*
 					 * There is someone waiting
 					 * for the callout to complete.
@@ -617,7 +694,6 @@ callout_reset_on(struct callout *c, int 
 	 */
 	if (c->c_flags & CALLOUT_LOCAL_ALLOC)
 		cpu = c->c_cpu;
-retry:
 	cc = callout_lock(c);
 	if (cc->cc_curr == c) {
 		/*
@@ -649,31 +725,34 @@ retry:
 		cancelled = 1;
 		c->c_flags &= ~(CALLOUT_ACTIVE | CALLOUT_PENDING);
 	}
+#ifdef SMP
 	/*
-	 * If the lock must migrate we have to check the state again as
-	 * we can't hold both the new and old locks simultaneously.
+	 * If the lock must migrate we have to block the callout locking
+	 * until migration is completed.
+	 * If the callout is currently running, just defer the migration
+	 * to a more appropriate moment.
 	 */
 	if (c->c_cpu != cpu) {
-		c->c_cpu = cpu;
+		c->c_cpu = CPUBLOCK;
+		if (cc->cc_curr == c) {
+			cc->cc_migration_cpu = cpu;
+			cc->cc_migration_ticks = to_ticks;
+			cc->cc_migration_func = ftn;
+			cc->cc_migration_arg = arg;
+			CTR5(KTR_CALLOUT,
+		    "migration of %p func %p arg %p in %d to %u deferred",
+			    c, c->c_func, c->c_arg, to_ticks, cpu);
+			CC_UNLOCK(cc);
+			return (cancelled);
+		}
 		CC_UNLOCK(cc);
-		goto retry;
+		cc = CC_CPU(cpu);
+		CC_LOCK(cc);
+		c->c_cpu = cpu;
 	}
+#endif
 
-	if (to_ticks <= 0)
-		to_ticks = 1;
-
-	c->c_arg = arg;
-	c->c_flags |= (CALLOUT_ACTIVE | CALLOUT_PENDING);
-	c->c_func = ftn;
-	c->c_time = ticks + to_ticks;
-	TAILQ_INSERT_TAIL(&cc->cc_callwheel[c->c_time & callwheelmask], 
-			  c, c_links.tqe);
-	if ((c->c_time - cc->cc_firsttick) < 0 &&
-	    callout_new_inserted != NULL) {
-		cc->cc_firsttick = c->c_time;
-		(*callout_new_inserted)(cpu,
-		    to_ticks + (ticks - cc->cc_ticks));
-	}
+	callout_cc_add(c, cc, to_ticks, ftn, arg, cpu);
 	CTR5(KTR_CALLOUT, "%sscheduled %p func %p arg %p in %d",
 	    cancelled ? "re" : "", c, c->c_func, c->c_arg, to_ticks);
 	CC_UNLOCK(cc);
@@ -701,7 +780,7 @@ _callout_stop_safe(c, safe)
 	struct	callout *c;
 	int	safe;
 {
-	struct callout_cpu *cc;
+	struct callout_cpu *cc, *old_cc;
 	struct lock_class *class;
 	int use_lock, sq_locked;
 
@@ -721,8 +800,23 @@ _callout_stop_safe(c, safe)
 		use_lock = 0;
 
 	sq_locked = 0;
+	old_cc = NULL;
 again:
 	cc = callout_lock(c);
+
+	/*
+	 * If the callout was migrating while the callout cpu lock was
+	 * dropped,  just drop the sleepqueue lock and check the states
+	 * again.
+	 */
+	if (sq_locked != 0 && cc != old_cc) {
+		CC_UNLOCK(cc);
+		sleepq_release(&old_cc->cc_waiting);
+		sq_locked = 0;
+		old_cc = NULL;
+		goto again;
+	}
+
 	/*
 	 * If the callout isn't pending, it's not on the queue, so
 	 * don't attempt to remove it from the queue.  We can try to
@@ -774,6 +868,7 @@ again:
 					CC_UNLOCK(cc);
 					sleepq_lock(&cc->cc_waiting);
 					sq_locked = 1;
+					old_cc = cc;
 					goto again;
 				}
 				cc->cc_waiting = 1;
@@ -784,6 +879,7 @@ again:
 				    SLEEPQ_SLEEP, 0);
 				sleepq_wait(&cc->cc_waiting, 0);
 				sq_locked = 0;
+				old_cc = NULL;
 
 				/* Reacquire locks previously released. */
 				PICKUP_GIANT();