From owner-freebsd-current@FreeBSD.ORG Mon May 16 20:30:48 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A596D106564A; Mon, 16 May 2011 20:30:48 +0000 (UTC) (envelope-from max@love2party.net) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.17.9]) by mx1.freebsd.org (Postfix) with ESMTP id 4D8368FC12; Mon, 16 May 2011 20:30:48 +0000 (UTC) Received: from maxlap.localnet (gw-100.extranet.sea01.isilon.com [74.85.160.100]) by mrelayeu.kundenserver.de (node=mrbap3) with ESMTP (Nemesis) id 0Lnjkl-1PlvOz0OoZ-00hvmw; Mon, 16 May 2011 22:30:47 +0200 From: Max Laier To: John Baldwin Date: Mon, 16 May 2011 16:30:44 -0400 User-Agent: KMail/1.13.6 (FreeBSD/9.0-CURRENT; KDE/4.6.1; amd64; ; ) References: <4DCD357D.6000109@FreeBSD.org> <201105161346.34134.max@love2party.net> <201105161421.27665.jhb@freebsd.org> In-Reply-To: <201105161421.27665.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201105161630.44577.max@love2party.net> X-Provags-ID: V02:K0:6byTI43OCGMB95tZ45MIaapu6SewSDu1vnTaZVMX5n0 eXOH8lYmXCfLEskX9NLvfl1SHTo9empHg/98n6UFvvQLAC94EF Tx6cWPiKSP1IXjuKqSeQmVMqxSw8TItscK6bU/Xzdszc3vLQR8 3BHugIDjBkLd+2meTXE9KGM2Xro4FOdeQV28MkM6yPMrSrrPdl 0W6oKAUdG/HPVCyHKmCfw== Cc: FreeBSD current , Peter Grehan , Andriy Gapon , neel@freebsd.org Subject: Re: proposed smp_rendezvous change X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2011 20:30:48 -0000 On Monday 16 May 2011 14:21:27 John Baldwin wrote: > On Monday, May 16, 2011 1:46:33 pm Max Laier wrote: > > I'd like to propose that we move forward with fixing the race, before > > redesigning the API - please. > > > > We agree that there is a problem with the exit rendezvous. Let's fix > > that. We agree that the generation approach proposed by NetAPP is the > > right way to go? Can we check it in, then? Again, I'd like some > > comments in the code if at all possible. > > How about this: > > Index: subr_smp.c > =================================================================== > --- subr_smp.c (revision 221939) > +++ subr_smp.c (working copy) > @@ -111,6 +111,7 @@ static void (*volatile smp_rv_action_func)(void *a > static void (*volatile smp_rv_teardown_func)(void *arg); > static void *volatile smp_rv_func_arg; > static volatile int smp_rv_waiters[3]; > +static volatile int smp_rv_generation; > > /* > * Shared mutex to restrict busywaits between smp_rendezvous() and > @@ -311,39 +312,62 @@ restart_cpus(cpumask_t map) > void > smp_rendezvous_action(void) > { > - void* local_func_arg = smp_rv_func_arg; > - void (*local_setup_func)(void*) = smp_rv_setup_func; > - void (*local_action_func)(void*) = smp_rv_action_func; > - void (*local_teardown_func)(void*) = smp_rv_teardown_func; > + void *local_func_arg; > + void (*local_setup_func)(void*); > + void (*local_action_func)(void*); > + void (*local_teardown_func)(void*); > + int generation; > > /* Ensure we have up-to-date values. */ > atomic_add_acq_int(&smp_rv_waiters[0], 1); > while (smp_rv_waiters[0] < smp_rv_ncpus) > cpu_spinwait(); > > - /* setup function */ > + /* Fetch rendezvous parameters after acquire barrier. */ > + local_func_arg = smp_rv_func_arg; > + local_setup_func = smp_rv_setup_func; > + local_action_func = smp_rv_action_func; > + local_teardown_func = smp_rv_teardown_func; > + generation = smp_rv_generation; > + > + /* > + * If requested, run a setup function before the main action > + * function. Ensure all CPUs have completed the setup > + * function before moving on to the action function. > + */ > if (local_setup_func != smp_no_rendevous_barrier) { > if (smp_rv_setup_func != NULL) > smp_rv_setup_func(smp_rv_func_arg); > - > - /* spin on entry rendezvous */ > atomic_add_int(&smp_rv_waiters[1], 1); > while (smp_rv_waiters[1] < smp_rv_ncpus) > cpu_spinwait(); > } > > - /* action function */ > if (local_action_func != NULL) > local_action_func(local_func_arg); > > - /* spin on exit rendezvous */ > + /* > + * Signal that the main action has been completed. If a > + * full exit rendezvous is requested, then all CPUs will > + * wait here until all CPUs have finished the main action. > + * > + * Note that the write by the last CPU to finish the action > + * may become visible to different CPUs at different times. > + * As a result, the CPU that initiated the rendezvous may > + * exit the rendezvous and drop the lock allowing another > + * rendezvous to be initiated on the same CPU or a different > + * CPU. In that case the exit sentinel may be cleared before > + * all CPUs have noticed causing those CPUs to hang forever. > + * Workaround this by using a generation count to notice when > + * this race occurs and to exit the rendezvous in that case. > + */ MPASS(generation == smp_rv_generation); > atomic_add_int(&smp_rv_waiters[2], 1); > if (local_teardown_func == smp_no_rendevous_barrier) > return; > - while (smp_rv_waiters[2] < smp_rv_ncpus) > + while (smp_rv_waiters[2] < smp_rv_ncpus && > + generation == smp_rv_generation) > cpu_spinwait(); > > - /* teardown function */ > if (local_teardown_func != NULL) > local_teardown_func(local_func_arg); > } > @@ -374,10 +398,11 @@ smp_rendezvous_cpus(cpumask_t map, > if (ncpus == 0) > panic("ncpus is 0 with map=0x%x", map); > > - /* obtain rendezvous lock */ > mtx_lock_spin(&smp_ipi_mtx); > > - /* set static function pointers */ > + atomic_add_acq_int(&smp_rv_generation, 1); > + > + /* Pass rendezvous parameters via global variables. */ > smp_rv_ncpus = ncpus; > smp_rv_setup_func = setup_func; > smp_rv_action_func = action_func; > @@ -387,18 +412,25 @@ smp_rendezvous_cpus(cpumask_t map, > smp_rv_waiters[2] = 0; > atomic_store_rel_int(&smp_rv_waiters[0], 0); > > - /* signal other processors, which will enter the IPI with interrupts off > */ + /* > + * Signal other processors, which will enter the IPI with > + * interrupts off. > + */ > ipi_selected(map & ~(1 << curcpu), IPI_RENDEZVOUS); > > /* Check if the current CPU is in the map */ > if ((map & (1 << curcpu)) != 0) > smp_rendezvous_action(); > > + /* > + * If the caller did not request an exit barrier to be enforced > + * on each CPU, ensure that this CPU waits for all the other > + * CPUs to finish the rendezvous. > + */ > if (teardown_func == smp_no_rendevous_barrier) > while (atomic_load_acq_int(&smp_rv_waiters[2]) < ncpus) > cpu_spinwait(); > > - /* release lock */ > mtx_unlock_spin(&smp_ipi_mtx); > } > > > A second commit should take care of the entry sync - which may or may not > > be required on some architectures. If we decide that we don't need it, > > we should remove it. Otherwise, we must move all the assignments from > > global smp_rv_* to the stack in smp_rendezvous_action to after the sync. > > Otherwise we should just kill it. In any case, it would be nice to > > clean this up. > > The patch also fixes this in the cautious fashion (not removing the early > rendezvous). Good idea for now. We can always remove the sync later and the assign on declare is a style(9) violation anyways. > > After that, I have one more bugfix for rmlock(9) - one of the three > > consumers of this API (that I am aware of). As it uses a spinlock > > inside its IPI action, we have to mark smp_rendezvous_action a critical > > section otherwise the spin unlock might yield ... with the expected > > consequences. > > Yes, we need to fix that. Humm, it doesn't preempt when you do a > critical_exit() though? Or do you use a hand-rolled critical exit that > doesn't do a deferred preemption? Right now I just did a manual td_critnest++/--, but I guess ... > Actually, I'm curious how the spin unlock inside the IPI could yield the > CPU. Oh, is rmlock doing a wakeup inside the IPI handler? I guess that is > ok as long as the critical_exit() just defers the preemption to the end of > the IPI handler. ... the earliest point where it is safe to preempt is after doing the atomic_add_int(&smp_rv_waiters[2], 1); so that we can start other IPIs again. However, since we don't accept new IPIs until we signal EOI in the MD code (on amd64), this might still not be a good place to do the yield?!? The spin unlock boils down to a critical_exit() and unless we did a critical_enter() at some point during the redenvouz setup, we will yield() if we owepreempt. I'm not quite sure how that can happen, but it seems like there is a path that allows the scheduler to set it from a foreign CPU. > > It is arguable if you should be allowed to use spinlocks in the rendevous > > at all, but that's beside the point. > > I'm less convinced this is bad since we don't use NMIs for these, so we > know the interrupted thread doesn't have any spin locks in the scheduler, > etc. Agreed. I'll send a patch as soon as this one's in, but it really is a simple matter of adding critical_enter/exit (or the manual version thereof) to the right places. Thanks, Max