From owner-cvs-src@FreeBSD.ORG  Fri Jun 15 21:10:14 2007
Return-Path: <owner-cvs-src@FreeBSD.ORG>
X-Original-To: cvs-src@freebsd.org
Delivered-To: cvs-src@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id ACBD216A46D;
	Fri, 15 Jun 2007 21:10:14 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net
	[66.23.211.162])
	by mx1.freebsd.org (Postfix) with ESMTP id 2811D13C458;
	Fri, 15 Jun 2007 21:10:14 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l5FLA62C038252;
	Fri, 15 Jun 2007 17:10:07 -0400 (EDT) (envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: Bruce Evans <brde@optusnet.com.au>
Date: Fri, 15 Jun 2007 17:09:58 -0400
User-Agent: KMail/1.9.6
References: <200706051420.l55EKEih018925@repoman.freebsd.org>
	<200706111154.31357.jhb@freebsd.org>
	<20070616054050.Q2037@besplex.bde.org>
In-Reply-To: <20070616054050.Q2037@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200706151709.59898.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Fri, 15 Jun 2007 17:10:07 -0400 (EDT)
X-Virus-Scanned: ClamAV 0.88.3/3429/Fri Jun 15 10:25:06 2007 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: src-committers@freebsd.org, Kip Macy <kip.macy@gmail.com>,
	cvs-all@freebsd.org, Attilio Rao <attilio@freebsd.org>,
	cvs-src@freebsd.org, Kostik Belousov <kostikbel@gmail.com>,
	Jeff Roberson <jroberson@chesapeake.net>
Subject: Re: cvs commit: src/sys/kern kern_mutex.c
X-BeenThere: cvs-src@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: CVS commit messages for the src tree <cvs-src.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-src>,
	<mailto:cvs-src-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/cvs-src>
List-Post: <mailto:cvs-src@freebsd.org>
List-Help: <mailto:cvs-src-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-src>,
	<mailto:cvs-src-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Jun 2007 21:10:14 -0000

On Friday 15 June 2007 04:46:30 pm Bruce Evans wrote:
> On Mon, 11 Jun 2007, John Baldwin wrote:
> 
> > On Friday 08 June 2007 10:20:19 pm Bruce Evans wrote:
> >> On Fri, 8 Jun 2007, John Baldwin wrote:
> >>
> >>> On Friday 08 June 2007 10:50:15 am Bruce Evans wrote:
> >>>> On Thu, 7 Jun 2007, John Baldwin wrote:
> 
> ["this" is in vm_page_zero_idle() where the vm mutex is dropped]
> 
> >> I think this is as good a place to preempt as any.  In fact, the code
> >> can be simplified by preempting here and not in the main loop.  The
> >> ...
> >
> > Preemption happens when a thread is made runnable, i.e. usually when an
> > interrupt comes in, but also when releasing locks or doing
> > wakeup/cv_broadcast, etc.  The only thing the idle thread does other than
> > interrupts is release the lock.
> 
> I'm not sure what you mean in the last sentence.  This is a special idle
> thread, and it doesn't do interrupts.

Those are the only times it will preempt is what I was trying to say.
 
> I think I want to batch up switches a little in general, and and only
> a little batching occurs here.  pmap_zero_page_idle() takes about 1uS
> on my main test system (Turion X2 with relatively slow DDR2 memory
> which can neverthless be zeroed at 4 or 5GB/S).  An extra 1uS of
> interrupt latency here and there won't make much difference.  It's
> less than the extra latency for 1 ATPIC access if not using the APIC.
> Also, for the makeworld benchmark, the interrupt handler runs for about
> 2% of the time, and pagezero runs for about 1% of the time.  These
> threads just don't run long enough to have much contention.

In that case the critical sections should "batch" things for the PREEMPTION 
case.

> > As to why preemption doesn't work for SMP, a thread only knows to preempt 
if
> > it makes a higher priority thread runnable.  This happens in mtx_unlock 
when
> > we wakeup a thread waiting on the lock, in wakeup, or when an interrupt
> > thread is scheduled (the interrupted thread "sees" the ithread being
> > scheduled).  If another thread on another CPU makes a thread runnable, the
> > thread on the first CPU has no idea unless the second CPU explicitly sends 
a
> > message (i.e. IPI) to the first CPU asking it to yield instead.
> 
> I believe SCHED_ULE does the IPI.

If you add 'options IPI_PREEMPTION' I think the IPI is enabled in 4BSD.

> > Specifically, suppose you have threads A, B, C and with priorities A < B < 
C.
> > Suppose A is running on CPU 0, and C is running on CPU 1.  If C does a 
wakeup
> > that awakens B, C isn't going to preempt to B because C is more important
> > (assume > means more important, even though priority values are opposite
> > that, which is annoying).  If A had awakened B it would have preempted
> > though, so in theory C should look at the other CPUs, notice that A < B, 
and
> > send an IPI to CPU 0 to ask it to preempt from A to B.  One thing is that
> > IPI_PREEMPT should set td_owepreempt if the target A is in a critical
> > section, I haven't checked to see if we do that currently.
> 
> Would it be worth checking a preemption flag in mtx_unlock()?  This
> would bloat the macro a bit.  However, we already have the check and the
> bloat for spinlocks, at least on i386's, by checking in critical_exit()
> via spinlock_exit().

All archs check the flag in spinlock_exit().  mtx_unlock() will preempt in 
turnstile_unpend() if we wakeup a higher priority thread, so no need to check 
for a flag.  If we were to get an interrupt while holding the lock we will 
preempt immediately (the patch changes this by deferring that preemption to 
the critical_exit()).

> Using critical sections should have the side effect of getting the flag
> checked in critical_exit().  This doesn't seem to work (for SCHED_4BSD),
> so there seems to be a problem setting the flag.

The flag should work fine, but keep in mind the multiple-CPU case I outlined 
above.  That doesn't set the flag unless you have the IPI in place.

> >> Next I will try moving the PREEMPTION code to where the vm mutex is 
dropped
> >> naturally.  I will try the following order:
> >
> > I like this idea a lot.
> 
> > Actually, I would keep the sched_runnable(), etc. as #ifndef PREEMPTION, 
the
> > critical_exit() already does that check (modulo concerns above).  Also, I
> 
> Testing shows critical_exit() doesn't seem to be doing the preemption.  On
> UP, depending on PREEMPTION makes little difference, but on 2-way SMP with
> no voluntary yielding, pagezero is too active.  The critical sections don't
> seem to be doing much.

You probably need more details (KTR is good) to see exactly when threads are 
becoming runnable (and on which CPUs) and when the kernel is preempting to 
see what it is going on and where the context switches come from.  KTR_SCHED 
+ schedgraph.py may prove useful.

> > originally wanted to not hold the critical sectino while zeroing the page 
to
> > not impeded interrupts during that operation.  I was actually trying to 
just
> > avoid preempting while holding the lock.  However, given my comments about
> > how this harms interrupt latency, maybe this is a bad idea and we should 
just
> > let priority propagation handle that for us.  Moving the context switch is
> > probably a good idea though.
> 
> The 1 uS extra latency on my main test machine wouldn't matter, but go back
> to a 486 with 10MB/S memory and the latency would be a problem -- the 1uS
> becomes 400uS, which is a lot even for a 486.
> 
> > the reason I wanted to avoid preempting while holding the lock is you can 
get
> > this case:
> >
> > zeroidle -> some ithread -> some top-half thread in kernel which needs the
> > vm page queue mtx -> zeroidle (with top-half thread's priority; until
> > mtx_unlock) -> top-half thread in kernel -> zeroidle
> >
> > which can be many context switches.  By not switching while holding the 
lock,
> > one can reduce this to:
> >
> > zeroidle -> some ithread -> some top-half thread -> zeroidle
> >
> > but at the cost of adding latency to "some ithread" and "some top-half 
thread"
> 
> Maybe preemption should be inhibited a bit when any mutex is held.

That would make mutexes spinlocks that block interrupts.  Would sort of defeat 
the point of having mutexes that aren't spinlocks.

> Here is my current version.  I got tired of using a dynamic enable for
> the PREEMPTION ifdef code and removed all the conditionals after the
> most recent test showed that the voluntary switch is still needed.
> 
> % Index: vm_zeroidle.c
> % ===================================================================
> % RCS file: /home/ncvs/src/sys/vm/vm_zeroidle.c,v
> % retrieving revision 1.47
> % diff -u -2 -r1.47 vm_zeroidle.c
> % --- vm_zeroidle.c	5 Jun 2007 00:00:57 -0000	1.47
> % +++ vm_zeroidle.c	15 Jun 2007 19:30:13 -0000
> % @@ -111,5 +111,13 @@
> %  		mtx_unlock(&vm_page_queue_free_mtx);
> %  		pmap_zero_page_idle(m);
> % +		if (sched_runnable()) {
> % +			thread_lock(curthread);
> % +			critical_exit();
> % +			mi_switch(SW_VOL, NULL);
> % +			thread_unlock(curthread);
> % +		} else
> % +			critical_exit();
> %  		mtx_lock(&vm_page_queue_free_mtx);
> % +		critical_enter();
> %  		m->flags |= PG_ZERO;
> %  		vm_pageq_enqueue(PQ_FREE + m->pc, m);
> % @@ -141,18 +149,14 @@
> % 
> %  	mtx_lock(&vm_page_queue_free_mtx);
> % +	critical_enter();
> %  	for (;;) {
> %  		if (vm_page_zero_check()) {
> %  			vm_page_zero_idle();
> % -#ifndef PREEMPTION
> % -			if (sched_runnable()) {
> % -				thread_lock(curthread);
> % -				mi_switch(SW_VOL, NULL);
> % -				thread_unlock(curthread);
> % -			}
> % -#endif
> %  		} else {
> %  			wakeup_needed = TRUE;
> % +			critical_exit();
> %  			msleep(&zero_state, &vm_page_queue_free_mtx, 0,
> %  			    "pgzero", hz * 300);
> % +			critical_enter();
> %  		}
> %  	}
> 
> Bruce
> 


-- 
John Baldwin