From owner-freebsd-current@FreeBSD.ORG  Wed Jun 23 05:04:10 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 490DB16A4CE; Wed, 23 Jun 2004 05:04:10 +0000 (GMT)
Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 9E47C43D5D; Wed, 23 Jun 2004 05:04:09 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.0.87])i5N53X4u019129;	Wed, 23 Jun 2004 15:03:33 +1000
Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246])
	i5N53Tnl026123;	Wed, 23 Jun 2004 15:03:30 +1000
Date: Wed, 23 Jun 2004 15:03:29 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@gamplex.bde.org
To: Julian Elischer <julian@elischer.org>
In-Reply-To: <Pine.BSF.4.21.0406221526000.54870-100000@InterJet.elischer.org>
Message-ID: <20040623121053.G56410@gamplex.bde.org>
References: <Pine.BSF.4.21.0406221526000.54870-100000@InterJet.elischer.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
cc: freebsd-current@FreeBSD.org
cc: FreeBSD current users <current@FreeBSD.org>
cc: John Baldwin <jhb@FreeBSD.org>
Subject: Re: ithread priority question...
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 23 Jun 2004 05:04:10 -0000

On Tue, 22 Jun 2004, Julian Elischer wrote:

> On Tue, 22 Jun 2004, John Baldwin wrote:
>
> > On Monday 21 June 2004 03:48 am, Julian Elischer wrote:
> > > On Mon, 21 Jun 2004, Bruce Evans wrote:
> > > > On Sun, 20 Jun 2004, Julian Elischer wrote:
> > > > > In swi_add, the priority is multiplied by PPQ.
> > > > > This is a layering violation really because PPQ should only be known
> > > > > within the scheduler.... but..... "Why multiply by PPQ inthe first
> > > > > place?"  we are not using the system run queues for interrupt threads.
> > > > >
> > > > > (PPQ = Priorities Per Queue).
> > > > >
> > > > > Without this you can remove runq.h from proc.h  and include it only in
> > > > > the scheduler related files.
> > > >
> > > > I agree that this makes no sense.  Apart from the layering violation,
> > > > It seems to just waste priority space.  The wastage is not just cosmetic
> > > > since someone increased the number of SWIs although there was no room
> > > > for expansion.

Oops, this is mostly wrong.  I forgot that the same type of run queues are
used for all threads when I wrote this.

> > > > Hardware ithread priorities are also separated by 4.  The magic number 4
> > > > is encoded in their definitions in priority.h.  It's not clear if the 4
> > > > is PPQ or just room for expansion without changing the ABI.  Preserving
> > > > this ABI doesn't seem very important.
> > >
> > > seems pointless to me..
> > > It looks to me that at on stage someone was considerring using the
> > > standard run-queue code to make interrupt threads runnable.
> > > They wanted each interrupt thread to eb on a differen queue and to use
> > > the ffs() code to find the next one to run.
> >
> > That was the intention.  One question though, if the ithreads aren't on the
> > system run queues then which run queues are they on?

Good point.

> aren't they run from the interupt?

Nah, they are put on normal queues if they block, at least for SCHED_4BSD.
They just aren't scheduled normally.  Run queues are mostly not a property
of the scheduler.  According to "grep PPQ *.c" in kern:

% kern_intr.c:		    (pri * RQ_PPQ) + PI_SOFT, flags, cookiep));

This is the place that you noticed.

% kern_switch.c:		if (ke->ke_rqindex != (newpri / RQ_PPQ)) {
% kern_switch.c:	pri = ke->ke_thread->td_priority / RQ_PPQ;

This is the place that deals with run queues and RQ_PPQ.  It is sort
of scheduler-independent.  Schedulers must set td_priority so that the
handling of priorities here works right (priorities are fundamentally
scheduler-independent, and run queues are non-fundamentally scheduler-
independent).  Setting of the priority in kern_intr.c can be thought
of as a special case of scheduling.  The scheduling is trivial (just
round-robin among "equal" priority processes), but kern_intr.c must
still know about RQ_PPQ to interface with kern_switch.c (since "equal"
actually means "equal after division by RQ_PPQ").

% sched_4bsd.c:    RQ_PPQ) + INVERSE_ESTCPU_WEIGHT - 1)

This is SCHED_4BSD's only explicit awareness of run queues.  It knows
that priority differences smaller than RQ_PPQ don't have much effect,
so it wants to arrange that a change in nice values causes a difference
of at least RQ_PPQ, but needs to limit the difference so that it doesn't
make the priority too large to fit in priority space.  SCHED_4BSD also
has implicit knowledge of RQ_PPQ.  It maps CPU consumption to priority
changes and must have implicit factors of RQ_PPQ in the mapping, else
its response to CPU consumption would be too slow by the missing factor.
(In fact, its intentional response to CPU consumption is too slow for
other reasons, especially at high load averages, but this is masked
by bugfeatures like rescheduling on every non-fast interrupt, so a
mere factor of RQ_PPQ = 4 might not be noticed.)

This stuff is broken for realtime priorities too.  There are 32 realtime
priorities, and there used to be a separate run queue with 32 slots for
them, but now there are only 8 runq slots for realtime priorities so
there are effectively only 8 of them.  Similarly for idletime.

I think the way things should work is:
- only 64 priorities (unless we expand the number of run queues)
- only 8 realtime priorities
- only 8 idletime priorities
- no magic factors of 4 in priority.h or not-so-magic factors of RQ_PPQ
  elsewhere or even more magic factors of N/4 (N magic too) in schedulers
- schedulers map scheduling decisions to priorities not quite the same
  as now
- schedulers that need the extra resolution given by the bits that used
  to be in the low 2 bits of td_priority need to maintain this internally.
  For SCHED_4BSD, these bits are only used for scheduling user tasks.
  Their only purposes seem to be to avoid fractions in the niceness
  adjustment and to pessimize things by setting TDF_NEEDRESCHED too much
  (see maybe_resched() -- it compares priorities by should compare
  them mod RQ_PPQ to avoid generating null context switches).

In 4.4BSD-Lite, there were several priorities that are not a multiple of
RQ_PPQ:

% #define	PSWP	0
% #define	PVM	4
% #define	PINOD	8
% #define	PRIBIO	16
% #define	PVFS	20
% #define	PZERO	22		/* No longer magic, shouldn't be here.  XXX */
% #define	PSOCK	24
% #define	PWAIT	32
% #define	PLOCK	36
% #define	PPAUSE	40
% #define	PUSER	50
% #define	MAXPRI	127		/* Priorities range from 0 through MAXPRI. */

Note the values for PUSER and PZERO.  AFAIR, priorities were effectively
equal if they were equal after division by RQ_PPQ in 4.4BSD too.  I
think PUSER was in the middle of a priority bucket range only to give
subtle scaling of the mapping of (CPU, nice) to a priority, and PZERO
was in the middle of a priority bucket range only so that PUSER - PZERO
was a multiple of RQ_PPQ (or to not break historical magic numbers
that were visible in ps output).  The subtlety: most processes start
with priority PUSER; since that is in the middle of a priority bucket
range it only takes half as much CPU as usual to push it to the next
bucket, so response to small amounts of accumulated CPU was faster.
There were related subtleties for scaling of niceness: IIRC, the scale
factor for niceness was 2 in 4.4BSD, so a change of niceness from 0
to 1 pushed the base priority into the next bucket; now it takes a
change of niceness of 4 to have an effect.  These subtleties are so
subtle that I couldn't show that they were useful when I objected to
the commit that broke them.

Bruce