From owner-cvs-src@FreeBSD.ORG  Fri Jun 15 20:46:45 2007
Return-Path: <owner-cvs-src@FreeBSD.ORG>
X-Original-To: cvs-src@FreeBSD.org
Delivered-To: cvs-src@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 9F9ED16A469;
	Fri, 15 Jun 2007 20:46:45 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au
	[211.29.132.184])
	by mx1.freebsd.org (Postfix) with ESMTP id 16A7413C4AE;
	Fri, 15 Jun 2007 20:46:44 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c220-239-235-248.carlnfd3.nsw.optusnet.com.au
	[220.239.235.248])
	by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	l5FKkUHX023606
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sat, 16 Jun 2007 06:46:32 +1000
Date: Sat, 16 Jun 2007 06:46:30 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: John Baldwin <jhb@FreeBSD.org>
In-Reply-To: <200706111154.31357.jhb@freebsd.org>
Message-ID: <20070616054050.Q2037@besplex.bde.org>
References: <200706051420.l55EKEih018925@repoman.freebsd.org>
	<200706081210.24393.jhb@freebsd.org>
	<20070609112753.C15075@besplex.bde.org>
	<200706111154.31357.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: src-committers@FreeBSD.org, Kip Macy <kip.macy@gmail.com>,
	cvs-all@FreeBSD.org, Attilio Rao <attilio@FreeBSD.org>,
	Bruce Evans <brde@optusnet.com.au>, cvs-src@FreeBSD.org,
	Kostik Belousov <kostikbel@gmail.com>,
	Jeff Roberson <jroberson@chesapeake.net>
Subject: Re: cvs commit: src/sys/kern kern_mutex.c
X-BeenThere: cvs-src@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: CVS commit messages for the src tree <cvs-src.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-src>,
	<mailto:cvs-src-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/cvs-src>
List-Post: <mailto:cvs-src@freebsd.org>
List-Help: <mailto:cvs-src-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-src>,
	<mailto:cvs-src-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Jun 2007 20:46:45 -0000

On Mon, 11 Jun 2007, John Baldwin wrote:

> On Friday 08 June 2007 10:20:19 pm Bruce Evans wrote:
>> On Fri, 8 Jun 2007, John Baldwin wrote:
>>
>>> On Friday 08 June 2007 10:50:15 am Bruce Evans wrote:
>>>> On Thu, 7 Jun 2007, John Baldwin wrote:

["this" is in vm_page_zero_idle() where the vm mutex is dropped]

>> I think this is as good a place to preempt as any.  In fact, the code
>> can be simplified by preempting here and not in the main loop.  The
>> ...
>
> Preemption happens when a thread is made runnable, i.e. usually when an
> interrupt comes in, but also when releasing locks or doing
> wakeup/cv_broadcast, etc.  The only thing the idle thread does other than
> interrupts is release the lock.

I'm not sure what you mean in the last sentence.  This is a special idle
thread, and it doesn't do interrupts.

> One side effect of using a critical section
> here is that we are potentially adding interrupt latency, so this may be a
> rather bad thing. :-/  Probably you are seeing the number of context switches
> go down because we are "batching" up switches.  Say you get two interrupts
> during an iteration of the main loop for this process, previously you'd get 4
> context switches out of that: zeroidle -> first ithread -> zeroidle, and then
> later zeroidle -> second ithread -> zeroidle.  Now you block those switches
> until the critical exit, at which point both interrupt threads are pending,
> so you do: zeroidle -> first ithread -> second ithread -> zeroidle, i.e. 3
> context switches.  However, the cost is increased interrupt latency and that
> may explain why the "real" time went up.  Lots of context switches is fairly
> lame, but if the box is idle, then I'm less worried about wasting time in a
> context switch (well, I'm concerned because it means we can't put the CPU to
> sleep since we are executing something still, but extra overhead in an idle
> thread doing a background task is not near as worrisome as overhead
> in "important" work like ithreads).

I think I want to batch up switches a little in general, and and only
a little batching occurs here.  pmap_zero_page_idle() takes about 1uS
on my main test system (Turion X2 with relatively slow DDR2 memory
which can neverthless be zeroed at 4 or 5GB/S).  An extra 1uS of
interrupt latency here and there won't make much difference.  It's
less than the extra latency for 1 ATPIC access if not using the APIC.
Also, for the makeworld benchmark, the interrupt handler runs for about
2% of the time, and pagezero runs for about 1% of the time.  These
threads just don't run long enough to have much contention.

I think the main cause of extra context switches that I saw in some
misconfigurations is switching while holding the vm mutex.  The thread
switched to is quite likely to block on this mutex and switch back.

> As to why preemption doesn't work for SMP, a thread only knows to preempt if
> it makes a higher priority thread runnable.  This happens in mtx_unlock when
> we wakeup a thread waiting on the lock, in wakeup, or when an interrupt
> thread is scheduled (the interrupted thread "sees" the ithread being
> scheduled).  If another thread on another CPU makes a thread runnable, the
> thread on the first CPU has no idea unless the second CPU explicitly sends a
> message (i.e. IPI) to the first CPU asking it to yield instead.

I believe SCHED_ULE does the IPI.

> Specifically, suppose you have threads A, B, C and with priorities A < B < C.
> Suppose A is running on CPU 0, and C is running on CPU 1.  If C does a wakeup
> that awakens B, C isn't going to preempt to B because C is more important
> (assume > means more important, even though priority values are opposite
> that, which is annoying).  If A had awakened B it would have preempted
> though, so in theory C should look at the other CPUs, notice that A < B, and
> send an IPI to CPU 0 to ask it to preempt from A to B.  One thing is that
> IPI_PREEMPT should set td_owepreempt if the target A is in a critical
> section, I haven't checked to see if we do that currently.

Would it be worth checking a preemption flag in mtx_unlock()?  This
would bloat the macro a bit.  However, we already have the check and the
bloat for spinlocks, at least on i386's, by checking in critical_exit()
via spinlock_exit().

Using critical sections should have the side effect of getting the flag
checked in critical_exit().  This doesn't seem to work (for SCHED_4BSD),
so there seems to be a problem setting the flag.

>> Next I will try moving the PREEMPTION code to where the vm mutex is dropped
>> naturally.  I will try the following order:
>
> I like this idea a lot.

> Actually, I would keep the sched_runnable(), etc. as #ifndef PREEMPTION, the
> critical_exit() already does that check (modulo concerns above).  Also, I

Testing shows critical_exit() doesn't seem to be doing the preemption.  On
UP, depending on PREEMPTION makes little difference, but on 2-way SMP with
no voluntary yielding, pagezero is too active.  The critical sections don't
seem to be doing much.

> originally wanted to not hold the critical sectino while zeroing the page to
> not impeded interrupts during that operation.  I was actually trying to just
> avoid preempting while holding the lock.  However, given my comments about
> how this harms interrupt latency, maybe this is a bad idea and we should just
> let priority propagation handle that for us.  Moving the context switch is
> probably a good idea though.

The 1 uS extra latency on my main test machine wouldn't matter, but go back
to a 486 with 10MB/S memory and the latency would be a problem -- the 1uS
becomes 400uS, which is a lot even for a 486.

> the reason I wanted to avoid preempting while holding the lock is you can get
> this case:
>
> zeroidle -> some ithread -> some top-half thread in kernel which needs the
> vm page queue mtx -> zeroidle (with top-half thread's priority; until
> mtx_unlock) -> top-half thread in kernel -> zeroidle
>
> which can be many context switches.  By not switching while holding the lock,
> one can reduce this to:
>
> zeroidle -> some ithread -> some top-half thread -> zeroidle
>
> but at the cost of adding latency to "some ithread" and "some top-half thread"

Maybe preemption should be inhibited a bit when any mutex is held.

Here is my current version.  I got tired of using a dynamic enable for
the PREEMPTION ifdef code and removed all the conditionals after the
most recent test showed that the voluntary switch is still needed.

% Index: vm_zeroidle.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/vm/vm_zeroidle.c,v
% retrieving revision 1.47
% diff -u -2 -r1.47 vm_zeroidle.c
% --- vm_zeroidle.c	5 Jun 2007 00:00:57 -0000	1.47
% +++ vm_zeroidle.c	15 Jun 2007 19:30:13 -0000
% @@ -111,5 +111,13 @@
%  		mtx_unlock(&vm_page_queue_free_mtx);
%  		pmap_zero_page_idle(m);
% +		if (sched_runnable()) {
% +			thread_lock(curthread);
% +			critical_exit();
% +			mi_switch(SW_VOL, NULL);
% +			thread_unlock(curthread);
% +		} else
% +			critical_exit();
%  		mtx_lock(&vm_page_queue_free_mtx);
% +		critical_enter();
%  		m->flags |= PG_ZERO;
%  		vm_pageq_enqueue(PQ_FREE + m->pc, m);
% @@ -141,18 +149,14 @@
% 
%  	mtx_lock(&vm_page_queue_free_mtx);
% +	critical_enter();
%  	for (;;) {
%  		if (vm_page_zero_check()) {
%  			vm_page_zero_idle();
% -#ifndef PREEMPTION
% -			if (sched_runnable()) {
% -				thread_lock(curthread);
% -				mi_switch(SW_VOL, NULL);
% -				thread_unlock(curthread);
% -			}
% -#endif
%  		} else {
%  			wakeup_needed = TRUE;
% +			critical_exit();
%  			msleep(&zero_state, &vm_page_queue_free_mtx, 0,
%  			    "pgzero", hz * 300);
% +			critical_enter();
%  		}
%  	}

Bruce