From owner-cvs-src@FreeBSD.ORG  Mon Jun 11 18:22:33 2007
Return-Path: <owner-cvs-src@FreeBSD.ORG>
X-Original-To: cvs-src@freebsd.org
Delivered-To: cvs-src@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 71C9D16A562;
	Mon, 11 Jun 2007 18:22:31 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net
	[66.23.211.162])
	by mx1.freebsd.org (Postfix) with ESMTP id 5281913C44C;
	Mon, 11 Jun 2007 18:22:30 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l5BIMGdq086344;
	Mon, 11 Jun 2007 14:22:17 -0400 (EDT) (envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: Bruce Evans <brde@optusnet.com.au>
Date: Mon, 11 Jun 2007 11:54:30 -0400
User-Agent: KMail/1.9.6
References: <200706051420.l55EKEih018925@repoman.freebsd.org>
	<200706081210.24393.jhb@freebsd.org>
	<20070609112753.C15075@besplex.bde.org>
In-Reply-To: <20070609112753.C15075@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200706111154.31357.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]);
	Mon, 11 Jun 2007 14:22:17 -0400 (EDT)
X-Virus-Scanned: ClamAV 0.88.3/3400/Mon Jun 11 13:41:37 2007 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,BAYES_00 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: src-committers@freebsd.org, Kip Macy <kip.macy@gmail.com>,
	cvs-all@freebsd.org, Attilio Rao <attilio@freebsd.org>,
	cvs-src@freebsd.org, Kostik Belousov <kostikbel@gmail.com>,
	Jeff Roberson <jroberson@chesapeake.net>
Subject: Re: cvs commit: src/sys/kern kern_mutex.c
X-BeenThere: cvs-src@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: CVS commit messages for the src tree <cvs-src.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-src>,
	<mailto:cvs-src-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/cvs-src>
List-Post: <mailto:cvs-src@freebsd.org>
List-Help: <mailto:cvs-src-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-src>,
	<mailto:cvs-src-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Jun 2007 18:22:33 -0000

On Friday 08 June 2007 10:20:19 pm Bruce Evans wrote:
> On Fri, 8 Jun 2007, John Baldwin wrote:
> 
> > On Friday 08 June 2007 10:50:15 am Bruce Evans wrote:
> >> On Thu, 7 Jun 2007, John Baldwin wrote:
> 
> >>> Index: vm_zeroidle.c
> >>> ===================================================================
> >>> RCS file: /usr/cvs/src/sys/vm/vm_zeroidle.c,v
> >>> retrieving revision 1.45
> >>> diff -u -r1.45 vm_zeroidle.c
> >>> --- vm_zeroidle.c       18 May 2007 07:10:50 -0000      1.45
> >>> +++ vm_zeroidle.c       7 Jun 2007 14:56:02 -0000
> >>> @@ -147,8 +147,10 @@
> >>> #ifndef PREEMPTION
> >>>                        if (sched_runnable()) {
> >>>                                mtx_lock_spin(&sched_lock);
> >>> +                               mtx_unlock(&vm_page_queue_free_mtx);
> >>>                                mi_switch(SW_VOL, NULL);
> >>>                                mtx_unlock_spin(&sched_lock);
> >>> +                               mtx_lock(&vm_page_queue_free_mtx);
> >>>                        }
> >>> #endif
> >>>                } else {
> >>
> >> The sched_locks are now of course thread_locks.  I put the vm unlock
> >> before the thread lock since the above order seems to risk a LOR.  That
> >> may have been a mistake -- we would prefer not to be switched after
> >> deciding to do it ourself.
> >
> > Actually, the order is on purpose so that we don't waste time doing a
> > preemption when we're about to switch anyway.
> 
> OK, changed back (not tested with it changed back).
> 
> What about after returning from mi_switch()?  The spinlock must be dropped
> before acquiring the sleep lock, but that wastes some time if preemption
> occurs.  Here we expect to have more work to do and it seems best to
> ensure doing at least one page of work per context switch if possible.

Well, when we drop the spin lock we might enable a pending interrupt in which 
case we really should go handle that right away.  If I have to starve 
something, I'd rather starve this process than ithreads.

> >>>> Index: vm_zeroidle.c
> >>> ===================================================================
> >>> RCS file: /usr/cvs/src/sys/vm/vm_zeroidle.c,v
> >>> retrieving revision 1.45
> >>> diff -u -r1.45 vm_zeroidle.c
> >>> --- vm_zeroidle.c       18 May 2007 07:10:50 -0000      1.45
> >>> +++ vm_zeroidle.c       7 Jun 2007 14:58:39 -0000
> >>> @@ -110,8 +110,10 @@
> >>>        if (m != NULL && (m->flags & PG_ZERO) == 0) {
> >>>                vm_pageq_remove_nowakeup(m);
> >>>                mtx_unlock(&vm_page_queue_free_mtx);
> >>> +               critical_exit();
> >>>                pmap_zero_page_idle(m);
> >>>                mtx_lock(&vm_page_queue_free_mtx);
> >>> +               critical_enter();
> >>>                m->flags |= PG_ZERO;
> >>>                vm_pageq_enqueue(PQ_FREE + m->pc, m);
> >>>                ++vm_page_zero_count;
> >>
> >> Next I will try this.  I put the critical_exit() before the vm unlock.
> >> mtx_unlock() should be allowed to switch if it wants.  However, we
> >> would prefer to keep context switches disabled in the above -- just drop
> >> the lock so that other CPUs can proceed.
> >
> > Again, the order here is on purpose to make sure we don't preempt while
> > holding the lock.
> 
> critical_enter(9) says that critical regions should not be interlocked
> with other synchronization primitives like this.  It makes an exception
> for spin locks, but vm_page_queue_free_mtx is a sleep lock.

Mostly that is to prevent foot shooting.  It can be done if one is careful.  
Critical sections can be thought of as basically deferring preemptions.

> I think this is as good a place to preempt as any.  In fact, the code
> can be simplified by preempting here and not in the main loop.  The
> vm mutex is already dropped here, so extra code isn't needed to drop
> in the main loop.  In the !PREEMPTION case, check sched_runnable()
> here.  In the PREEMPTION case, can you explain why involuntary preemption
> never worked right with SMP for SCHED_4BSD?  I thought that it was
> because SCHED_4BSD doesn't do enough IPIs, so it doesn't preempt idle
> threads running on other CPUs, but shouldn't preemption occur when the
> vm mutex is unlocked in the above, if sched_runnable() is true and
> there is something better to run than pagezero?

Preemption happens when a thread is made runnable, i.e. usually when an 
interrupt comes in, but also when releasing locks or doing 
wakeup/cv_broadcast, etc.  The only thing the idle thread does other than 
interrupts is release the lock.  One side effect of using a critical section 
here is that we are potentially adding interrupt latency, so this may be a 
rather bad thing. :-/  Probably you are seeing the number of context switches 
go down because we are "batching" up switches.  Say you get two interrupts 
during an iteration of the main loop for this process, previously you'd get 4 
context switches out of that: zeroidle -> first ithread -> zeroidle, and then 
later zeroidle -> second ithread -> zeroidle.  Now you block those switches 
until the critical exit, at which point both interrupt threads are pending, 
so you do: zeroidle -> first ithread -> second ithread -> zeroidle, i.e. 3 
context switches.  However, the cost is increased interrupt latency and that 
may explain why the "real" time went up.  Lots of context switches is fairly 
lame, but if the box is idle, then I'm less worried about wasting time in a 
context switch (well, I'm concerned because it means we can't put the CPU to 
sleep since we are executing something still, but extra overhead in an idle 
thread doing a background task is not near as worrisome as overhead 
in "important" work like ithreads).

As to why preemption doesn't work for SMP, a thread only knows to preempt if 
it makes a higher priority thread runnable.  This happens in mtx_unlock when 
we wakeup a thread waiting on the lock, in wakeup, or when an interrupt 
thread is scheduled (the interrupted thread "sees" the ithread being 
scheduled).  If another thread on another CPU makes a thread runnable, the 
thread on the first CPU has no idea unless the second CPU explicitly sends a 
message (i.e. IPI) to the first CPU asking it to yield instead.  
Specifically, suppose you have threads A, B, C and with priorities A < B < C.  
Suppose A is running on CPU 0, and C is running on CPU 1.  If C does a wakeup 
that awakens B, C isn't going to preempt to B because C is more important 
(assume > means more important, even though priority values are opposite 
that, which is annoying).  If A had awakened B it would have preempted 
though, so in theory C should look at the other CPUs, notice that A < B, and 
send an IPI to CPU 0 to ask it to preempt from A to B.  One thing is that 
IPI_PREEMPT should set td_owepreempt if the target A is in a critical 
section, I haven't checked to see if we do that currently.

> >> I added the necessary critical_exit/enter() calls here.
> >
> > They aren't needed in the !PREEMPTION case since the context switch is
> > explicit.  The critical sections are only needed for the PREEMPTION case
> > really.
> 
> They are needed to pass the KASSERT(td_critnest == 1 || ...)  in 
mi_switch().

Ah, right.

> PREEMPTION ifdefs on the critical sections would be messy.

Agreed.

> Next I will try moving the PREEMPTION code to where the vm mutex is dropped
> naturally.  I will try the following order:

I like this idea a lot.

> 
>  	// Stay in critical section.
>  	drop vm mutex         // though critical_enter(9) says this is wrong
>  	zero the page         // intentionally in critical section
> #ifdef PREEMPTION(?)          // even more needed now that the critical
>                                // prevents prevents a switch on the mutex 
drop
>  	if (sched_runnable()) {
>  		thread_lock(...)
>  		critical_exit()    // if using critical sections with !PRE*
>  		mi_switch(...)
>  		thread_unlock(...)
>  	} else
> #endif
>  		critical_exit()

Actually, I would keep the sched_runnable(), etc. as #ifndef PREEMPTION, the 
critical_exit() already does that check (modulo concerns above).  Also, I 
originally wanted to not hold the critical sectino while zeroing the page to 
not impeded interrupts during that operation.  I was actually trying to just 
avoid preempting while holding the lock.  However, given my comments about 
how this harms interrupt latency, maybe this is a bad idea and we should just 
let priority propagation handle that for us.  Moving the context switch is 
probably a good idea though.

the reason I wanted to avoid preempting while holding the lock is you can get 
this case:

zeroidle -> some ithread -> some top-half thread in kernel which needs the
 vm page queue mtx -> zeroidle (with top-half thread's priority; until
 mtx_unlock) -> top-half thread in kernel -> zeroidle

which can be many context switches.  By not switching while holding the lock,
one can reduce this to:

zeroidle -> some ithread -> some top-half thread -> zeroidle

but at the cost of adding latency to "some ithread" and "some top-half thread"

-- 
John Baldwin