From owner-freebsd-current@FreeBSD.ORG Wed Jun 23 05:04:10 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 490DB16A4CE; Wed, 23 Jun 2004 05:04:10 +0000 (GMT) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9E47C43D5D; Wed, 23 Jun 2004 05:04:09 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87])i5N53X4u019129; Wed, 23 Jun 2004 15:03:33 +1000 Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246]) i5N53Tnl026123; Wed, 23 Jun 2004 15:03:30 +1000 Date: Wed, 23 Jun 2004 15:03:29 +1000 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Julian Elischer In-Reply-To: Message-ID: <20040623121053.G56410@gamplex.bde.org> References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-current@FreeBSD.org cc: FreeBSD current users cc: John Baldwin Subject: Re: ithread priority question... X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jun 2004 05:04:10 -0000 On Tue, 22 Jun 2004, Julian Elischer wrote: > On Tue, 22 Jun 2004, John Baldwin wrote: > > > On Monday 21 June 2004 03:48 am, Julian Elischer wrote: > > > On Mon, 21 Jun 2004, Bruce Evans wrote: > > > > On Sun, 20 Jun 2004, Julian Elischer wrote: > > > > > In swi_add, the priority is multiplied by PPQ. > > > > > This is a layering violation really because PPQ should only be known > > > > > within the scheduler.... but..... "Why multiply by PPQ inthe first > > > > > place?" we are not using the system run queues for interrupt threads. > > > > > > > > > > (PPQ = Priorities Per Queue). > > > > > > > > > > Without this you can remove runq.h from proc.h and include it only in > > > > > the scheduler related files. > > > > > > > > I agree that this makes no sense. Apart from the layering violation, > > > > It seems to just waste priority space. The wastage is not just cosmetic > > > > since someone increased the number of SWIs although there was no room > > > > for expansion. Oops, this is mostly wrong. I forgot that the same type of run queues are used for all threads when I wrote this. > > > > Hardware ithread priorities are also separated by 4. The magic number 4 > > > > is encoded in their definitions in priority.h. It's not clear if the 4 > > > > is PPQ or just room for expansion without changing the ABI. Preserving > > > > this ABI doesn't seem very important. > > > > > > seems pointless to me.. > > > It looks to me that at on stage someone was considerring using the > > > standard run-queue code to make interrupt threads runnable. > > > They wanted each interrupt thread to eb on a differen queue and to use > > > the ffs() code to find the next one to run. > > > > That was the intention. One question though, if the ithreads aren't on the > > system run queues then which run queues are they on? Good point. > aren't they run from the interupt? Nah, they are put on normal queues if they block, at least for SCHED_4BSD. They just aren't scheduled normally. Run queues are mostly not a property of the scheduler. According to "grep PPQ *.c" in kern: % kern_intr.c: (pri * RQ_PPQ) + PI_SOFT, flags, cookiep)); This is the place that you noticed. % kern_switch.c: if (ke->ke_rqindex != (newpri / RQ_PPQ)) { % kern_switch.c: pri = ke->ke_thread->td_priority / RQ_PPQ; This is the place that deals with run queues and RQ_PPQ. It is sort of scheduler-independent. Schedulers must set td_priority so that the handling of priorities here works right (priorities are fundamentally scheduler-independent, and run queues are non-fundamentally scheduler- independent). Setting of the priority in kern_intr.c can be thought of as a special case of scheduling. The scheduling is trivial (just round-robin among "equal" priority processes), but kern_intr.c must still know about RQ_PPQ to interface with kern_switch.c (since "equal" actually means "equal after division by RQ_PPQ"). % sched_4bsd.c: RQ_PPQ) + INVERSE_ESTCPU_WEIGHT - 1) This is SCHED_4BSD's only explicit awareness of run queues. It knows that priority differences smaller than RQ_PPQ don't have much effect, so it wants to arrange that a change in nice values causes a difference of at least RQ_PPQ, but needs to limit the difference so that it doesn't make the priority too large to fit in priority space. SCHED_4BSD also has implicit knowledge of RQ_PPQ. It maps CPU consumption to priority changes and must have implicit factors of RQ_PPQ in the mapping, else its response to CPU consumption would be too slow by the missing factor. (In fact, its intentional response to CPU consumption is too slow for other reasons, especially at high load averages, but this is masked by bugfeatures like rescheduling on every non-fast interrupt, so a mere factor of RQ_PPQ = 4 might not be noticed.) This stuff is broken for realtime priorities too. There are 32 realtime priorities, and there used to be a separate run queue with 32 slots for them, but now there are only 8 runq slots for realtime priorities so there are effectively only 8 of them. Similarly for idletime. I think the way things should work is: - only 64 priorities (unless we expand the number of run queues) - only 8 realtime priorities - only 8 idletime priorities - no magic factors of 4 in priority.h or not-so-magic factors of RQ_PPQ elsewhere or even more magic factors of N/4 (N magic too) in schedulers - schedulers map scheduling decisions to priorities not quite the same as now - schedulers that need the extra resolution given by the bits that used to be in the low 2 bits of td_priority need to maintain this internally. For SCHED_4BSD, these bits are only used for scheduling user tasks. Their only purposes seem to be to avoid fractions in the niceness adjustment and to pessimize things by setting TDF_NEEDRESCHED too much (see maybe_resched() -- it compares priorities by should compare them mod RQ_PPQ to avoid generating null context switches). In 4.4BSD-Lite, there were several priorities that are not a multiple of RQ_PPQ: % #define PSWP 0 % #define PVM 4 % #define PINOD 8 % #define PRIBIO 16 % #define PVFS 20 % #define PZERO 22 /* No longer magic, shouldn't be here. XXX */ % #define PSOCK 24 % #define PWAIT 32 % #define PLOCK 36 % #define PPAUSE 40 % #define PUSER 50 % #define MAXPRI 127 /* Priorities range from 0 through MAXPRI. */ Note the values for PUSER and PZERO. AFAIR, priorities were effectively equal if they were equal after division by RQ_PPQ in 4.4BSD too. I think PUSER was in the middle of a priority bucket range only to give subtle scaling of the mapping of (CPU, nice) to a priority, and PZERO was in the middle of a priority bucket range only so that PUSER - PZERO was a multiple of RQ_PPQ (or to not break historical magic numbers that were visible in ps output). The subtlety: most processes start with priority PUSER; since that is in the middle of a priority bucket range it only takes half as much CPU as usual to push it to the next bucket, so response to small amounts of accumulated CPU was faster. There were related subtleties for scaling of niceness: IIRC, the scale factor for niceness was 2 in 4.4BSD, so a change of niceness from 0 to 1 pushed the base priority into the next bucket; now it takes a change of niceness of 4 to have an effect. These subtleties are so subtle that I couldn't show that they were useful when I objected to the commit that broke them. Bruce