From owner-freebsd-arch Tue Jul 16 23:52:37 2002 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 64A0637B400 for ; Tue, 16 Jul 2002 23:52:18 -0700 (PDT) Received: from iguana.icir.org (iguana.icir.org [192.150.187.36]) by mx1.FreeBSD.org (Postfix) with ESMTP id E5ACA43E42 for ; Tue, 16 Jul 2002 23:52:16 -0700 (PDT) (envelope-from rizzo@iguana.icir.org) Received: (from rizzo@localhost) by iguana.icir.org (8.11.6/8.11.3) id g6H6qGu06926; Tue, 16 Jul 2002 23:52:16 -0700 (PDT) (envelope-from rizzo) Date: Tue, 16 Jul 2002 23:52:16 -0700 From: Luigi Rizzo To: arch@freebsd.org Subject: proposed changes to kern_switch.c and kern_synch.c Message-ID: <20020716235216.B6785@iguana.icir.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Hi, we have implemented a weight-based process scheduler for FreeBSD-stable, and are trying to port it to -current (in the description below replace "process" with "thread/kse" as appropriate for the -current case). In order to make this work, it is convenient to have all scheduler-specific functions and data structures in a single file (kern_switch*.c), and generic support in another one (kern_synch.c). I believe this was also the original BSD design in partitioning the code between the two files. However, in our current code, there are some functions which are scheduler-specific (see below) which are misplaced. So I would like to make the following, purely cosmetic, change identified as step #1 below, both for consistency with what i believe to be the original design, and to simplify further work in this area. These would go to both -current and -stable; I repeat, they are only cosmetic changes. Comments ? I already spoke to julian who has no objections. For those interested, a patch for -current is attached, and the one for -stable is very similar (for the records, in -stable we have a fully functional weight-based process scheduler, where you still control the weights using "nice", and you can switch between the old and the new scheduler at any time using a sysctl variable). --- Re. the multi-scheduler architecture --- The general idea is to make the process/thread/kse scheduler a replaceable piece of the kernel, requiring no modifications to the "struct proc", and with the ability of switching from one scheduler to another one at runtime (this both for testing purposes and for whatever need may arise). The way to achieve this would be the following: 1. identify all scheduler-specific functions, put all of them in one file (e.g. kern_switch.c for the default scheduler), and define function pointers for all of them; 2. use one field in "struct proc" as a link to whatever scheduler-specific data structures are necessary. In -stable, we have the p_pad3 field that can be used for this purpose, much like the if_index field in struct ifnet. 3. implement a function which, under control of a sysctl call, activate a new scheduler (by initialising all the function pointers to appropriate values) and moves all processes from the old to the new one. Step #1 is basically a cosmetic change, requiring mostly a move of some functions from kern_synch.c to kern_switch.c. These are roundrobin(); curpriority_cmp(); resetpriority(); parts of schedcpu() and schedclock(); cheers luigi ---------------------------------- Index: kern_switch.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_switch.c,v retrieving revision 1.33 diff -u -r1.33 kern_switch.c --- kern_switch.c 14 Jul 2002 03:43:33 -0000 1.33 +++ kern_switch.c 16 Jul 2002 22:15:06 -0000 @@ -97,6 +97,7 @@ #include #include #include +#include #include CTASSERT((RQB_BPW * RQB_LEN) == RQ_NQS); @@ -107,6 +108,9 @@ static struct runq runq; SYSINIT(runq, SI_SUB_RUN_QUEUE, SI_ORDER_FIRST, runq_init, &runq) +static struct callout roundrobin_callout; + +static void roundrobin(void *arg); static void runq_readjust(struct runq *rq, struct kse *ke); /************************************************************************ * Functions that manipulate runnability from a thread perspective. * @@ -442,9 +446,13 @@ { int i; + callout_init(&roundrobin_callout, 0); + bzero(rq, sizeof *rq); for (i = 0; i < RQ_NQS; i++) TAILQ_INIT(&rq->rq_queues[i]); + + roundrobin(NULL); } /* @@ -719,3 +727,207 @@ } #endif +/* + * -- formerly in kern_synch.c + */ + +int curpriority_cmp(struct thread *td); + +int +curpriority_cmp(struct thread *td) +{ + return (td->td_priority - curthread->td_priority); +} + +/* + * Force switch among equal priority processes every 100ms. + * We don't actually need to force a context switch of the current process. + * The act of firing the event triggers a context switch to softclock() and + * then switching back out again which is equivalent to a preemption, thus + * no further work is needed on the local CPU. + */ +/* ARGSUSED */ +static void +roundrobin(arg) + void *arg; +{ + +#ifdef SMP + mtx_lock_spin(&sched_lock); + forward_roundrobin(); + mtx_unlock_spin(&sched_lock); +#endif + + callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL); +} + +/* calculations for digital decay to forget 90% of usage in 5*loadav sec */ +#define loadfactor(loadav) (2 * (loadav)) +#define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE)) +extern fixpt_t ccpu; +#define CCPU_SHIFT 11 /* XXX dup from kern_synch.c! */ + +/* + * Recompute process priorities, every hz ticks. + * MP-safe, called without the Giant mutex. + */ +void schedcpu1(struct ksegrp *kg); +/* ARGSUSED */ +void +schedcpu1(struct ksegrp *kg) +{ + register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]); + struct thread *td; + struct kse *ke; + int realstathz; + int awake; + + realstathz = stathz ? stathz : hz; + + awake = 0; + FOREACH_KSE_IN_GROUP(kg, ke) { + /* + * Increment time in/out of memory and sleep + * time (if sleeping). We ignore overflow; + * with 16-bit int's (remember them?) + * overflow takes 45 days. + */ + /* XXXKSE **WRONG***/ + /* + * the kse slptimes are not touched in wakeup + * because the thread may not HAVE a KSE + */ + if ((ke->ke_state == KES_ONRUNQ) || + ((ke->ke_state == KES_THREAD) && + (ke->ke_thread->td_state == TDS_RUNNING))) { + ke->ke_slptime++; + } else { + ke->ke_slptime = 0; + awake = 1; + } + + /* + * pctcpu is only for ps? + * Do it per kse.. and add them up at the end? + * XXXKSE + */ + ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT; + /* + * If the kse has been idle the entire second, + * stop recalculating its priority until + * it wakes up. + */ + if (ke->ke_slptime > 1) { + continue; + } + +#if (FSHIFT >= CCPU_SHIFT) + ke->ke_pctcpu += (realstathz == 100) ? + ((fixpt_t) ke->ke_cpticks) << + (FSHIFT - CCPU_SHIFT) : + 100 * (((fixpt_t) ke->ke_cpticks) << + (FSHIFT - CCPU_SHIFT)) / realstathz; +#else + ke->ke_pctcpu += ((FSCALE - ccpu) * + (ke->ke_cpticks * FSCALE / realstathz)) >> + FSHIFT; +#endif + ke->ke_cpticks = 0; + } /* end of kse loop */ + if (awake == 0) { + kg->kg_slptime++; + } else { + kg->kg_slptime = 0; + } + kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu); + resetpriority(kg); + FOREACH_THREAD_IN_GROUP(kg, td) { + int changedqueue; + if (td->td_priority >= PUSER) { + /* + * Only change the priority + * of threads that are still at their + * user priority. + * XXXKSE This is problematic + * as we may need to re-order + * the threads on the KSEG list. + */ + changedqueue = + ((td->td_priority / RQ_PPQ) != + (kg->kg_user_pri / RQ_PPQ)); + + td->td_priority = kg->kg_user_pri; + if (changedqueue && + td->td_state == TDS_RUNQ) { + /* this could be optimised */ + remrunqueue(td); + td->td_priority = + kg->kg_user_pri; + setrunqueue(td); + } else { + td->td_priority = kg->kg_user_pri; + } + } + } +} + +/* + * Compute the priority of a process when running in user mode. + * Arrange to reschedule if the resulting priority is better + * than that of the current process. + */ +void +resetpriority(kg) + register struct ksegrp *kg; +{ + register unsigned int newpriority; + struct thread *td; + + mtx_lock_spin(&sched_lock); + if (kg->kg_pri_class == PRI_TIMESHARE) { + newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT + + NICE_WEIGHT * (kg->kg_nice - PRIO_MIN); + newpriority = min(max(newpriority, PRI_MIN_TIMESHARE), + PRI_MAX_TIMESHARE); + kg->kg_user_pri = newpriority; + } + FOREACH_THREAD_IN_GROUP(kg, td) { + maybe_resched(td); /* XXXKSE silly */ + } + mtx_unlock_spin(&sched_lock); +} + +#if 0 +/* + * We adjust the priority of the current process. The priority of + * a process gets worse as it accumulates CPU time. The cpu usage + * estimator (p_estcpu) is increased here. resetpriority() will + * compute a different priority each time p_estcpu increases by + * INVERSE_ESTCPU_WEIGHT + * (until MAXPRI is reached). The cpu usage estimator ramps up + * quite quickly when the process is running (linearly), and decays + * away exponentially, at a rate which is proportionally slower when + * the system is busy. The basic principle is that the system will + * 90% forget that the process used a lot of CPU time in 5 * loadav + * seconds. This causes the system to favor processes which haven't + * run much recently, and to round-robin among other processes. + */ +void +schedclock(td) + struct thread *td; +{ + struct kse *ke; + struct ksegrp *kg; + + KASSERT((td != NULL), ("schedlock: null thread pointer")); + ke = td->td_kse; + kg = td->td_ksegrp; + ke->ke_cpticks++; + kg->kg_estcpu = ESTCPULIM(kg->kg_estcpu + 1); + if ((kg->kg_estcpu % INVERSE_ESTCPU_WEIGHT) == 0) { + resetpriority(kg); + if (td->td_priority >= PUSER) + td->td_priority = kg->kg_user_pri; + } +} +#endif Index: kern_synch.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v retrieving revision 1.185 diff -u -r1.185 kern_synch.c --- kern_synch.c 14 Jul 2002 03:43:33 -0000 1.185 +++ kern_synch.c 16 Jul 2002 22:16:46 -0000 @@ -67,6 +67,8 @@ #include +void schedcpu1(struct ksegrp *kg); /* XXX in kern_switch.c */ + static void sched_setup(void *dummy); SYSINIT(sched_setup, SI_SUB_KICK_SCHEDULER, SI_ORDER_FIRST, sched_setup, NULL) @@ -76,7 +78,6 @@ static struct callout loadav_callout; static struct callout schedcpu_callout; -static struct callout roundrobin_callout; struct loadavg averunnable = { {0, 0, 0}, FSCALE }; /* load average, of runnable procs */ @@ -92,7 +93,6 @@ static void endtsleep(void *); static void loadav(void *arg); -static void roundrobin(void *arg); static void schedcpu(void *arg); static int @@ -135,28 +135,6 @@ } /* - * Force switch among equal priority processes every 100ms. - * We don't actually need to force a context switch of the current process. - * The act of firing the event triggers a context switch to softclock() and - * then switching back out again which is equivalent to a preemption, thus - * no further work is needed on the local CPU. - */ -/* ARGSUSED */ -static void -roundrobin(arg) - void *arg; -{ - -#ifdef SMP - mtx_lock_spin(&sched_lock); - forward_roundrobin(); - mtx_unlock_spin(&sched_lock); -#endif - - callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL); -} - -/* * Constants for digital decay and forget: * 90% of (p_estcpu) usage in 5 * loadav time * 95% of (p_pctcpu) usage in 60 seconds (load insensitive) @@ -225,7 +203,7 @@ #define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE)) /* decay 95% of `p_pctcpu' in 60 seconds; see CCPU_SHIFT before changing */ -static fixpt_t ccpu = 0.95122942450071400909 * FSCALE; /* exp(-1/20) */ +fixpt_t ccpu = 0.95122942450071400909 * FSCALE; /* exp(-1/20) */ SYSCTL_INT(_kern, OID_AUTO, ccpu, CTLFLAG_RD, &ccpu, 0, ""); /* kernel uses `FSCALE', userland (SHOULD) use kern.fscale */ @@ -255,105 +233,15 @@ schedcpu(arg) void *arg; { - register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]); - struct thread *td; struct proc *p; - struct kse *ke; struct ksegrp *kg; - int realstathz; - int awake; - realstathz = stathz ? stathz : hz; sx_slock(&allproc_lock); FOREACH_PROC_IN_SYSTEM(p) { mtx_lock_spin(&sched_lock); p->p_swtime++; FOREACH_KSEGRP_IN_PROC(p, kg) { - awake = 0; - FOREACH_KSE_IN_GROUP(kg, ke) { - /* - * Increment time in/out of memory and sleep - * time (if sleeping). We ignore overflow; - * with 16-bit int's (remember them?) - * overflow takes 45 days. - */ - /* XXXKSE **WRONG***/ - /* - * the kse slptimes are not touched in wakeup - * because the thread may not HAVE a KSE - */ - if ((ke->ke_state == KES_ONRUNQ) || - ((ke->ke_state == KES_THREAD) && - (ke->ke_thread->td_state == TDS_RUNNING))) { - ke->ke_slptime++; - } else { - ke->ke_slptime = 0; - awake = 1; - } - - /* - * pctcpu is only for ps? - * Do it per kse.. and add them up at the end? - * XXXKSE - */ - ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT; - /* - * If the kse has been idle the entire second, - * stop recalculating its priority until - * it wakes up. - */ - if (ke->ke_slptime > 1) { - continue; - } - -#if (FSHIFT >= CCPU_SHIFT) - ke->ke_pctcpu += (realstathz == 100) ? - ((fixpt_t) ke->ke_cpticks) << - (FSHIFT - CCPU_SHIFT) : - 100 * (((fixpt_t) ke->ke_cpticks) << - (FSHIFT - CCPU_SHIFT)) / realstathz; -#else - ke->ke_pctcpu += ((FSCALE - ccpu) * - (ke->ke_cpticks * FSCALE / realstathz)) >> - FSHIFT; -#endif - ke->ke_cpticks = 0; - } /* end of kse loop */ - if (awake == 0) { - kg->kg_slptime++; - } else { - kg->kg_slptime = 0; - } - kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu); - resetpriority(kg); - FOREACH_THREAD_IN_GROUP(kg, td) { - int changedqueue; - if (td->td_priority >= PUSER) { - /* - * Only change the priority - * of threads that are still at their - * user priority. - * XXXKSE This is problematic - * as we may need to re-order - * the threads on the KSEG list. - */ - changedqueue = - ((td->td_priority / RQ_PPQ) != - (kg->kg_user_pri / RQ_PPQ)); - - td->td_priority = kg->kg_user_pri; - if (changedqueue && - td->td_state == TDS_RUNQ) { - /* this could be optimised */ - remrunqueue(td); - td->td_priority = - kg->kg_user_pri; - setrunqueue(td); - } else { - td->td_priority = kg->kg_user_pri; - } - } - } + schedcpu1(kg); } /* end of ksegrp loop */ mtx_unlock_spin(&sched_lock); } /* end of process loop */ @@ -944,32 +832,6 @@ } /* - * Compute the priority of a process when running in user mode. - * Arrange to reschedule if the resulting priority is better - * than that of the current process. - */ -void -resetpriority(kg) - register struct ksegrp *kg; -{ - register unsigned int newpriority; - struct thread *td; - - mtx_lock_spin(&sched_lock); - if (kg->kg_pri_class == PRI_TIMESHARE) { - newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT + - NICE_WEIGHT * (kg->kg_nice - PRIO_MIN); - newpriority = min(max(newpriority, PRI_MIN_TIMESHARE), - PRI_MAX_TIMESHARE); - kg->kg_user_pri = newpriority; - } - FOREACH_THREAD_IN_GROUP(kg, td) { - maybe_resched(td); /* XXXKSE silly */ - } - mtx_unlock_spin(&sched_lock); -} - -/* * Compute a tenex style load average of a quantity on * 1, 5 and 15 minute intervals. * XXXKSE Needs complete rewrite when correct info is available. @@ -1022,11 +884,9 @@ { callout_init(&schedcpu_callout, 1); - callout_init(&roundrobin_callout, 0); callout_init(&loadav_callout, 0); /* Kick off timeout driven events by calling first time. */ - roundrobin(NULL); schedcpu(NULL); loadav(NULL); } ----- End forwarded message ----- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message