From owner-cvs-src@FreeBSD.ORG  Sun Mar 30 00:20:13 2003
Return-Path: <owner-cvs-src@FreeBSD.ORG>
Delivered-To: cvs-src@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id D0F3137B404; Sun, 30 Mar 2003 00:20:13 -0800 (PST)
Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id ACA0C43F85; Sun, 30 Mar 2003 00:20:07 -0800 (PST)
	(envelope-from bde@zeta.org.au)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id SAA09876;
	Sun, 30 Mar 2003 18:19:39 +1000
Date: Sun, 30 Mar 2003 18:19:38 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@gamplex.bde.org
To: Jeff Roberson <jroberson@chesapeake.net>
In-Reply-To: <20030329152001.I64602-100000@mail.chesapeake.net>
Message-ID: <20030330161319.O13783@gamplex.bde.org>
References: <20030329152001.I64602-100000@mail.chesapeake.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
cc: David Malone <dwmalone@maths.tcd.ie>
cc: src-committers@FreeBSD.org
cc: Nate Lawson <nate@root.org>
cc: cvs-src@FreeBSD.org
cc: Mike Silbersack <silby@silby.com>
cc: cvs-all@FreeBSD.org
cc: Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?= <des@ofug.org>
Subject: Re: Checksum/copy
X-BeenThere: cvs-src@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: CVS commit messages for the src tree <cvs-src.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-src>,
	<mailto:cvs-src-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/cvs-src>
List-Post: <mailto:cvs-src@freebsd.org>
List-Help: <mailto:cvs-src-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-src>,
	<mailto:cvs-src-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 30 Mar 2003 08:20:17 -0000

On Sat, 29 Mar 2003, Jeff Roberson wrote:

> On Sat, 29 Mar 2003, Bruce Evans wrote:
> > ...
> > - SCHED_ULE breaks scheduling of idleprio processes.  This results in
> >   pagezero being too active.  It costs 1-2% instead of saving 1-2%.

[1-2% for makeworld].

> Thanks for the analysis.  I know my +nice values are screwy right now.
> It's actually a pretty interesting problem.  Perhaps you'll have some
> insight.

idleprio is a little different from nice.  I would have expected it to
work right fairly automatically.  At least with the standard scheduler,
idleprio gives td->td_priority values which are lower (numerically
higher) than the value for all non-idleprio processes, so idleprio
processes should never run if there is a runnable non-idleprio processes.

> The basic issue is that threads on run queues in ule must be given a
> slice.  And with a slice, ignoring interactive threads, they are run every
> n times we select a new thread where n is the number of runnning threads.
> That means two things
>
> 1)  +nice threads always get a chance to run.
> 2)  Their %cpu is relative to the sum of all slices of all running
> threads.
>
> #2 is sort of what you want, except that the slice value never reaches
> zero.  In sched_4bsd if you have a nice priority that is 20 away from the
> lowest  priority processes you never get a chance to run.  I'm not sure if
> this scales all the way across.

It scales poorly.  In RELENG_4 or a bit earlier, I changed the scaling
to make a nice difference of precisely 20 or more prevent the higher-niced
process from running.  This gives potentential priority inversion
problems even for user nice values [0,20].  RELENG_4 also has bogus
scaling of system nice values [-20,-1] into what should be kernel-only.

> I know a nice 0 will always prevent a
> nice 20 from running, but will a nice -20 prevent a nice 0 from running?
> I believe so.  With nice +19 and a nice 0 the nice +19 gets approx 2% cpu.

This seems about right for RELENG_4, but for -current the priority
inversion problems and bogus scaling were "fixed" by changing the scaling
so that the difference between nice -20 and nice +20 is approx. the same
as the old difference between nice +0 and nice +19, so the difference
between nice +0 and nice +20 was too small again (it shouldn't be infinite
since this gives potentially fatal priority inversion, but it should be
large).  Subsequent compression of the PRI_TIMESHARE range reduced the
difference further, so we're back to approx. the old difference of only
2:1 or 3:1 for nice +0 vs nice +20.

> So, in ule, I need a way to approximate this.  The real problem is that
> the drop off point where a process gets 0 cpu time is artificial.  The
> algorithm doesn't work linearly down to 0 as it does in sched_4bsd.  I
> need to make slice assignments relative to all other processes in the
> system.  This seems like it may break the O(1) properties of the
> scheduler.

It doesn't actually work linearly in sched_4bsd either.  sched_4bsd clamps
kg_estcpu so that the scaling of kg_estcpu to a priority can be linear.
This gives huge nonlinearities, especially in the following cases:
- fork and exit.  kg_estcpu is inherited on fork (bad) and added
  from the child back to the parent in exit (worse).  This wants to cause
  sorcerer's-apprentice behaviour, but the clamping makes it cause
  nonlinearities instead (kg_estcpu ramps up to the limit much faster than
  it should and then sticks there much longer than it should).
- conditions where the load average fluctuates a lot.  When the load
  average is large, priority decay is slow and processes can more easily
  hit the limit, especially when most of the processes that caused the
  large load average have just exited.
Clamping also discards history, so hog processes are given lower
(numerically higher) priorities than they should be even after the conditions
that cause the nonlinearities have gone away.

> I'm just now thinking that I could assign the slice using the run queues
> to find out how this thread relates to others in the system.  This all
> sounds rather complicated.  I'm hoping that I'm missing some simple
> elegant solution that someone may know of.
>
> Any takers?  Comments on nice or slice selection?

Peter Dufault's solution for sched_4bsd as implemented by me is to remove
the explicit clamp on kg_estcpu (it is bounded algorithmically by priority
decay except for the fork+exit problem which we just hack around), and
then map kg_estcpu to td_priority more carefully.  In my version of the
latter, each tick in kg_estcpu costs more if the nice value is numerically
larger, according to a table:

static int niceweights[PRIO_MAX - PRIO_MIN + 1] = {
#if 1
	/*
	 * Geometic niceness.  The weight at index i is
	 * floor(2 * 3 * pow(2.0, i / 4.0) + 0.5).
	 */
	6, 7, 8, 10, 12, 14, 17, 20,
	24, 29, 34, 40, 48, 57, 68, 81,
	96, 114, 136, 161, 192, 228, 272, 323,
	384, 457, 543, 646, 768, 913, 1086, 1292,
	1536, 1827, 2172, 2583, 3072, 3653, 4344, 5166,
	6144,
#else
	/*
	 * Arithmetic niceness.  The weight at index i is
	 * 2 * 2 * 2 * 3 * 3 * 5 * 7 / (40 - i)
	 * (except the one at index 40 is an approximation for infinity).
	 */
	63, 64, 66, 68, 70, 72, 74, 76,
	78, 81, 84, 86, 90, 93, 96, 100,
	105, 109, 114, 120, 126, 132, 140, 148,
	157, 168, 180, 193, 210, 229, 252, 280,
	315, 360, 420, 504, 630, 840, 1260, 2520,
	20000,
#endif
};

Almost any behaviour can be programmed by changing the table.  (Infinite
niceness can't be programmed easily, but that's a feature since it prevents
priority inversion.)

To use this method in other schedulers, I think you only need to have
some idea of the time spent in each thread.  Then weight the times
according to the scheduling policies for the threads.  Then scale
weighted times to priorities.

Scaling is the tricky part for sched_4bsd and probably for most
schedulers.  kg_estcpu can get very large, and it has a large variance.
I just compute its maximum in schedcpu and use a value a little larger
than the maximum for scaling.  Sometimes the scale factor changes
significantly (e.g., when all of the processes with a large kg_estcpu
exit), and then most processes move to a different scheduling queue.
This makes the scheduler more O(n) than before but doesn't affect
performance on any of the loads that I have tested.  It seems to be
difficult to keep the scale factor almost constant without reintroducing
the nonlinearities or significantly increasing response times.

Bruce