From owner-cvs-src@FreeBSD.ORG Mon Jun 11 18:22:33 2007 Return-Path: X-Original-To: cvs-src@freebsd.org Delivered-To: cvs-src@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 71C9D16A562; Mon, 11 Jun 2007 18:22:31 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.freebsd.org (Postfix) with ESMTP id 5281913C44C; Mon, 11 Jun 2007 18:22:30 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l5BIMGdq086344; Mon, 11 Jun 2007 14:22:17 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Bruce Evans Date: Mon, 11 Jun 2007 11:54:30 -0400 User-Agent: KMail/1.9.6 References: <200706051420.l55EKEih018925@repoman.freebsd.org> <200706081210.24393.jhb@freebsd.org> <20070609112753.C15075@besplex.bde.org> In-Reply-To: <20070609112753.C15075@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200706111154.31357.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Mon, 11 Jun 2007 14:22:17 -0400 (EDT) X-Virus-Scanned: ClamAV 0.88.3/3400/Mon Jun 11 13:41:37 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: src-committers@freebsd.org, Kip Macy , cvs-all@freebsd.org, Attilio Rao , cvs-src@freebsd.org, Kostik Belousov , Jeff Roberson Subject: Re: cvs commit: src/sys/kern kern_mutex.c X-BeenThere: cvs-src@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: CVS commit messages for the src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Jun 2007 18:22:33 -0000 On Friday 08 June 2007 10:20:19 pm Bruce Evans wrote: > On Fri, 8 Jun 2007, John Baldwin wrote: > > > On Friday 08 June 2007 10:50:15 am Bruce Evans wrote: > >> On Thu, 7 Jun 2007, John Baldwin wrote: > > >>> Index: vm_zeroidle.c > >>> =================================================================== > >>> RCS file: /usr/cvs/src/sys/vm/vm_zeroidle.c,v > >>> retrieving revision 1.45 > >>> diff -u -r1.45 vm_zeroidle.c > >>> --- vm_zeroidle.c 18 May 2007 07:10:50 -0000 1.45 > >>> +++ vm_zeroidle.c 7 Jun 2007 14:56:02 -0000 > >>> @@ -147,8 +147,10 @@ > >>> #ifndef PREEMPTION > >>> if (sched_runnable()) { > >>> mtx_lock_spin(&sched_lock); > >>> + mtx_unlock(&vm_page_queue_free_mtx); > >>> mi_switch(SW_VOL, NULL); > >>> mtx_unlock_spin(&sched_lock); > >>> + mtx_lock(&vm_page_queue_free_mtx); > >>> } > >>> #endif > >>> } else { > >> > >> The sched_locks are now of course thread_locks. I put the vm unlock > >> before the thread lock since the above order seems to risk a LOR. That > >> may have been a mistake -- we would prefer not to be switched after > >> deciding to do it ourself. > > > > Actually, the order is on purpose so that we don't waste time doing a > > preemption when we're about to switch anyway. > > OK, changed back (not tested with it changed back). > > What about after returning from mi_switch()? The spinlock must be dropped > before acquiring the sleep lock, but that wastes some time if preemption > occurs. Here we expect to have more work to do and it seems best to > ensure doing at least one page of work per context switch if possible. Well, when we drop the spin lock we might enable a pending interrupt in which case we really should go handle that right away. If I have to starve something, I'd rather starve this process than ithreads. > >>>> Index: vm_zeroidle.c > >>> =================================================================== > >>> RCS file: /usr/cvs/src/sys/vm/vm_zeroidle.c,v > >>> retrieving revision 1.45 > >>> diff -u -r1.45 vm_zeroidle.c > >>> --- vm_zeroidle.c 18 May 2007 07:10:50 -0000 1.45 > >>> +++ vm_zeroidle.c 7 Jun 2007 14:58:39 -0000 > >>> @@ -110,8 +110,10 @@ > >>> if (m != NULL && (m->flags & PG_ZERO) == 0) { > >>> vm_pageq_remove_nowakeup(m); > >>> mtx_unlock(&vm_page_queue_free_mtx); > >>> + critical_exit(); > >>> pmap_zero_page_idle(m); > >>> mtx_lock(&vm_page_queue_free_mtx); > >>> + critical_enter(); > >>> m->flags |= PG_ZERO; > >>> vm_pageq_enqueue(PQ_FREE + m->pc, m); > >>> ++vm_page_zero_count; > >> > >> Next I will try this. I put the critical_exit() before the vm unlock. > >> mtx_unlock() should be allowed to switch if it wants. However, we > >> would prefer to keep context switches disabled in the above -- just drop > >> the lock so that other CPUs can proceed. > > > > Again, the order here is on purpose to make sure we don't preempt while > > holding the lock. > > critical_enter(9) says that critical regions should not be interlocked > with other synchronization primitives like this. It makes an exception > for spin locks, but vm_page_queue_free_mtx is a sleep lock. Mostly that is to prevent foot shooting. It can be done if one is careful. Critical sections can be thought of as basically deferring preemptions. > I think this is as good a place to preempt as any. In fact, the code > can be simplified by preempting here and not in the main loop. The > vm mutex is already dropped here, so extra code isn't needed to drop > in the main loop. In the !PREEMPTION case, check sched_runnable() > here. In the PREEMPTION case, can you explain why involuntary preemption > never worked right with SMP for SCHED_4BSD? I thought that it was > because SCHED_4BSD doesn't do enough IPIs, so it doesn't preempt idle > threads running on other CPUs, but shouldn't preemption occur when the > vm mutex is unlocked in the above, if sched_runnable() is true and > there is something better to run than pagezero? Preemption happens when a thread is made runnable, i.e. usually when an interrupt comes in, but also when releasing locks or doing wakeup/cv_broadcast, etc. The only thing the idle thread does other than interrupts is release the lock. One side effect of using a critical section here is that we are potentially adding interrupt latency, so this may be a rather bad thing. :-/ Probably you are seeing the number of context switches go down because we are "batching" up switches. Say you get two interrupts during an iteration of the main loop for this process, previously you'd get 4 context switches out of that: zeroidle -> first ithread -> zeroidle, and then later zeroidle -> second ithread -> zeroidle. Now you block those switches until the critical exit, at which point both interrupt threads are pending, so you do: zeroidle -> first ithread -> second ithread -> zeroidle, i.e. 3 context switches. However, the cost is increased interrupt latency and that may explain why the "real" time went up. Lots of context switches is fairly lame, but if the box is idle, then I'm less worried about wasting time in a context switch (well, I'm concerned because it means we can't put the CPU to sleep since we are executing something still, but extra overhead in an idle thread doing a background task is not near as worrisome as overhead in "important" work like ithreads). As to why preemption doesn't work for SMP, a thread only knows to preempt if it makes a higher priority thread runnable. This happens in mtx_unlock when we wakeup a thread waiting on the lock, in wakeup, or when an interrupt thread is scheduled (the interrupted thread "sees" the ithread being scheduled). If another thread on another CPU makes a thread runnable, the thread on the first CPU has no idea unless the second CPU explicitly sends a message (i.e. IPI) to the first CPU asking it to yield instead. Specifically, suppose you have threads A, B, C and with priorities A < B < C. Suppose A is running on CPU 0, and C is running on CPU 1. If C does a wakeup that awakens B, C isn't going to preempt to B because C is more important (assume > means more important, even though priority values are opposite that, which is annoying). If A had awakened B it would have preempted though, so in theory C should look at the other CPUs, notice that A < B, and send an IPI to CPU 0 to ask it to preempt from A to B. One thing is that IPI_PREEMPT should set td_owepreempt if the target A is in a critical section, I haven't checked to see if we do that currently. > >> I added the necessary critical_exit/enter() calls here. > > > > They aren't needed in the !PREEMPTION case since the context switch is > > explicit. The critical sections are only needed for the PREEMPTION case > > really. > > They are needed to pass the KASSERT(td_critnest == 1 || ...) in mi_switch(). Ah, right. > PREEMPTION ifdefs on the critical sections would be messy. Agreed. > Next I will try moving the PREEMPTION code to where the vm mutex is dropped > naturally. I will try the following order: I like this idea a lot. > > // Stay in critical section. > drop vm mutex // though critical_enter(9) says this is wrong > zero the page // intentionally in critical section > #ifdef PREEMPTION(?) // even more needed now that the critical > // prevents prevents a switch on the mutex drop > if (sched_runnable()) { > thread_lock(...) > critical_exit() // if using critical sections with !PRE* > mi_switch(...) > thread_unlock(...) > } else > #endif > critical_exit() Actually, I would keep the sched_runnable(), etc. as #ifndef PREEMPTION, the critical_exit() already does that check (modulo concerns above). Also, I originally wanted to not hold the critical sectino while zeroing the page to not impeded interrupts during that operation. I was actually trying to just avoid preempting while holding the lock. However, given my comments about how this harms interrupt latency, maybe this is a bad idea and we should just let priority propagation handle that for us. Moving the context switch is probably a good idea though. the reason I wanted to avoid preempting while holding the lock is you can get this case: zeroidle -> some ithread -> some top-half thread in kernel which needs the vm page queue mtx -> zeroidle (with top-half thread's priority; until mtx_unlock) -> top-half thread in kernel -> zeroidle which can be many context switches. By not switching while holding the lock, one can reduce this to: zeroidle -> some ithread -> some top-half thread -> zeroidle but at the cost of adding latency to "some ithread" and "some top-half thread" -- John Baldwin