From owner-freebsd-arch  Tue Jul 16 23:52:37 2002
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 64A0637B400
	for <arch@freebsd.org>; Tue, 16 Jul 2002 23:52:18 -0700 (PDT)
Received: from iguana.icir.org (iguana.icir.org [192.150.187.36])
	by mx1.FreeBSD.org (Postfix) with ESMTP id E5ACA43E42
	for <arch@freebsd.org>; Tue, 16 Jul 2002 23:52:16 -0700 (PDT)
	(envelope-from rizzo@iguana.icir.org)
Received: (from rizzo@localhost)
	by iguana.icir.org (8.11.6/8.11.3) id g6H6qGu06926;
	Tue, 16 Jul 2002 23:52:16 -0700 (PDT)
	(envelope-from rizzo)
Date: Tue, 16 Jul 2002 23:52:16 -0700
From: Luigi Rizzo <rizzo@icir.org>
To: arch@freebsd.org
Subject: proposed changes to kern_switch.c and kern_synch.c
Message-ID: <20020716235216.B6785@iguana.icir.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5.1i
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

Hi,
we have implemented a weight-based process scheduler
for FreeBSD-stable, and are trying to port it to -current (in the
description below replace "process" with "thread/kse" as appropriate
for the -current case).

In order to make this work, it is convenient to have all
scheduler-specific functions and data structures in a
single file (kern_switch*.c), and generic support in
another one (kern_synch.c). I believe this was also the original
BSD design in partitioning the code between the two files.
However, in our current code, there are some functions which
are scheduler-specific (see below) which are misplaced.

So I would like to make the following, purely cosmetic, change
identified as step #1 below, both for consistency with what i
believe to be the original design, and to simplify further work
in this area. These would go to both -current and -stable; I
repeat, they are only cosmetic changes.

Comments ? I already spoke to julian who has no objections.

For those interested, a patch for -current is attached, and the one
for -stable is very similar (for the records, in -stable we have a
fully functional weight-based process scheduler, where you still
control the weights using "nice", and you can switch between the
old and the new scheduler at any time using a sysctl variable).

--- Re. the multi-scheduler architecture ---

The general idea is to make the process/thread/kse scheduler
a replaceable piece of the kernel, requiring no modifications
to the "struct proc", and with the ability of switching from
one scheduler to another one at runtime (this both for testing
purposes and for whatever need may arise).


The way to achieve this would be the following:

 1. identify all scheduler-specific functions, put all of them
    in one file (e.g. kern_switch.c for the default scheduler),
    and define function pointers for all of them;

 2. use one field in "struct proc" as a link to whatever
    scheduler-specific data structures are necessary.
    In -stable, we have the p_pad3 field that can be used
    for this purpose, much like the if_index field in struct ifnet.

 3. implement a function which, under control of a sysctl call,
    activate a new scheduler (by initialising all the function
    pointers to appropriate values) and moves all processes from
    the old to the new one.

Step #1 is basically a cosmetic change, requiring mostly a move of
some functions from kern_synch.c to kern_switch.c. These are

	roundrobin();

	curpriority_cmp();

	resetpriority();

	parts of schedcpu() and schedclock();

cheers
luigi
----------------------------------

Index: kern_switch.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_switch.c,v
retrieving revision 1.33
diff -u -r1.33 kern_switch.c
--- kern_switch.c	14 Jul 2002 03:43:33 -0000	1.33
+++ kern_switch.c	16 Jul 2002 22:15:06 -0000
@@ -97,6 +97,7 @@
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/queue.h>
+#include <sys/resourcevar.h>
 #include <machine/critical.h>
 
 CTASSERT((RQB_BPW * RQB_LEN) == RQ_NQS);
@@ -107,6 +108,9 @@
 static struct runq runq;
 SYSINIT(runq, SI_SUB_RUN_QUEUE, SI_ORDER_FIRST, runq_init, &runq)
 
+static struct callout roundrobin_callout;
+
+static void	roundrobin(void *arg);
 static void runq_readjust(struct runq *rq, struct kse *ke);
 /************************************************************************
  * Functions that manipulate runnability from a thread perspective.	*
@@ -442,9 +446,13 @@
 {
 	int i;
 
+	callout_init(&roundrobin_callout, 0);
+
 	bzero(rq, sizeof *rq);
 	for (i = 0; i < RQ_NQS; i++)
 		TAILQ_INIT(&rq->rq_queues[i]);
+
+	roundrobin(NULL);
 }
 
 /*
@@ -719,3 +727,207 @@
 }
 #endif
 
+/*
+ * -- formerly in kern_synch.c
+ */
+
+int curpriority_cmp(struct thread *td);
+
+int
+curpriority_cmp(struct thread *td)
+{
+	return (td->td_priority - curthread->td_priority);
+}
+
+/*
+ * Force switch among equal priority processes every 100ms.
+ * We don't actually need to force a context switch of the current process.
+ * The act of firing the event triggers a context switch to softclock() and
+ * then switching back out again which is equivalent to a preemption, thus
+ * no further work is needed on the local CPU.
+ */
+/* ARGSUSED */
+static void
+roundrobin(arg)
+	void *arg;
+{
+
+#ifdef SMP
+	mtx_lock_spin(&sched_lock);
+	forward_roundrobin();
+	mtx_unlock_spin(&sched_lock);
+#endif
+
+	callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL);
+}
+
+/* calculations for digital decay to forget 90% of usage in 5*loadav sec */
+#define loadfactor(loadav)      (2 * (loadav))
+#define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE))
+extern fixpt_t ccpu;
+#define CCPU_SHIFT      11	/* XXX dup from kern_synch.c! */
+
+/*
+ * Recompute process priorities, every hz ticks.
+ * MP-safe, called without the Giant mutex.
+ */
+void schedcpu1(struct ksegrp *kg);
+/* ARGSUSED */
+void
+schedcpu1(struct ksegrp *kg)
+{
+	register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
+	struct thread *td;
+	struct kse *ke;
+	int realstathz;
+	int awake;
+
+        realstathz = stathz ? stathz : hz;
+
+	awake = 0;
+	FOREACH_KSE_IN_GROUP(kg, ke) {
+		/*
+		 * Increment time in/out of memory and sleep
+		 * time (if sleeping).  We ignore overflow;
+		 * with 16-bit int's (remember them?)
+		 * overflow takes 45 days.
+		 */
+		/* XXXKSE **WRONG***/
+		/*
+		 * the kse slptimes are not touched in wakeup
+		 * because the thread may not HAVE a KSE
+		 */
+		if ((ke->ke_state == KES_ONRUNQ) ||
+		    ((ke->ke_state == KES_THREAD) &&
+		    (ke->ke_thread->td_state == TDS_RUNNING))) {
+			ke->ke_slptime++;
+		} else {
+			ke->ke_slptime = 0;
+			awake = 1;
+		}
+
+		/*
+		 * pctcpu is only for ps?
+		 * Do it per kse.. and add them up at the end?
+		 * XXXKSE
+		 */
+		ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT;
+		/*
+		 * If the kse has been idle the entire second,
+		 * stop recalculating its priority until
+		 * it wakes up.
+		 */
+		if (ke->ke_slptime > 1) {
+			continue;
+		}
+
+#if	(FSHIFT >= CCPU_SHIFT)
+		ke->ke_pctcpu += (realstathz == 100) ?
+		    ((fixpt_t) ke->ke_cpticks) <<
+		    (FSHIFT - CCPU_SHIFT) :
+		    100 * (((fixpt_t) ke->ke_cpticks) <<
+		    (FSHIFT - CCPU_SHIFT)) / realstathz;
+#else
+		ke->ke_pctcpu += ((FSCALE - ccpu) *
+		    (ke->ke_cpticks * FSCALE / realstathz)) >>
+		    FSHIFT;
+#endif
+		ke->ke_cpticks = 0;
+	} /* end of kse loop */
+	if (awake == 0) {
+		kg->kg_slptime++;
+	} else {
+		kg->kg_slptime = 0;
+	}
+	kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu);
+	resetpriority(kg);
+	FOREACH_THREAD_IN_GROUP(kg, td) {
+		int changedqueue;
+		if (td->td_priority >= PUSER) {
+			/*
+			 * Only change the priority
+			 * of threads that are still at their
+			 * user priority. 
+			 * XXXKSE This is problematic
+			 * as we may need to re-order
+			 * the threads on the KSEG list.
+			 */
+			changedqueue =
+			    ((td->td_priority / RQ_PPQ) !=
+			     (kg->kg_user_pri / RQ_PPQ));
+
+			td->td_priority = kg->kg_user_pri;
+			if (changedqueue &&
+			    td->td_state == TDS_RUNQ) {
+				/* this could be optimised */
+				remrunqueue(td);
+				td->td_priority =
+				    kg->kg_user_pri;
+				setrunqueue(td);
+			} else {
+				td->td_priority = kg->kg_user_pri;
+			}
+		}
+	}
+}
+
+/*
+ * Compute the priority of a process when running in user mode.
+ * Arrange to reschedule if the resulting priority is better
+ * than that of the current process.
+ */
+void
+resetpriority(kg)
+	register struct ksegrp *kg;
+{
+	register unsigned int newpriority;
+	struct thread *td;
+
+	mtx_lock_spin(&sched_lock);
+	if (kg->kg_pri_class == PRI_TIMESHARE) {
+		newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT +
+		    NICE_WEIGHT * (kg->kg_nice - PRIO_MIN);
+		newpriority = min(max(newpriority, PRI_MIN_TIMESHARE),
+		    PRI_MAX_TIMESHARE);
+		kg->kg_user_pri = newpriority;
+	}
+	FOREACH_THREAD_IN_GROUP(kg, td) {
+		maybe_resched(td);			/* XXXKSE silly */
+	}
+	mtx_unlock_spin(&sched_lock);
+}
+
+#if 0
+/*
+ * We adjust the priority of the current process.  The priority of
+ * a process gets worse as it accumulates CPU time.  The cpu usage
+ * estimator (p_estcpu) is increased here.  resetpriority() will
+ * compute a different priority each time p_estcpu increases by
+ * INVERSE_ESTCPU_WEIGHT
+ * (until MAXPRI is reached).  The cpu usage estimator ramps up
+ * quite quickly when the process is running (linearly), and decays
+ * away exponentially, at a rate which is proportionally slower when
+ * the system is busy.  The basic principle is that the system will
+ * 90% forget that the process used a lot of CPU time in 5 * loadav
+ * seconds.  This causes the system to favor processes which haven't
+ * run much recently, and to round-robin among other processes.
+ */
+void
+schedclock(td)
+	struct thread *td;
+{
+	struct kse *ke;
+	struct ksegrp *kg;
+
+	KASSERT((td != NULL), ("schedlock: null thread pointer"));
+	ke = td->td_kse;
+	kg = td->td_ksegrp;
+	ke->ke_cpticks++;
+	kg->kg_estcpu = ESTCPULIM(kg->kg_estcpu + 1);
+	if ((kg->kg_estcpu % INVERSE_ESTCPU_WEIGHT) == 0) {
+		resetpriority(kg);
+		if (td->td_priority >= PUSER)
+			td->td_priority = kg->kg_user_pri;
+	}
+}
+#endif
Index: kern_synch.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v
retrieving revision 1.185
diff -u -r1.185 kern_synch.c
--- kern_synch.c	14 Jul 2002 03:43:33 -0000	1.185
+++ kern_synch.c	16 Jul 2002 22:16:46 -0000
@@ -67,6 +67,8 @@
 
 #include <machine/cpu.h>
 
+void	schedcpu1(struct ksegrp *kg); /* XXX in kern_switch.c */
+
 static void sched_setup(void *dummy);
 SYSINIT(sched_setup, SI_SUB_KICK_SCHEDULER, SI_ORDER_FIRST, sched_setup, NULL)
 
@@ -76,7 +78,6 @@
 
 static struct callout loadav_callout;
 static struct callout schedcpu_callout;
-static struct callout roundrobin_callout;
 
 struct loadavg averunnable =
 	{ {0, 0, 0}, FSCALE };	/* load average, of runnable procs */
@@ -92,7 +93,6 @@
 
 static void	endtsleep(void *);
 static void	loadav(void *arg);
-static void	roundrobin(void *arg);
 static void	schedcpu(void *arg);
 
 static int
@@ -135,28 +135,6 @@
 }
 
 /*
- * Force switch among equal priority processes every 100ms.
- * We don't actually need to force a context switch of the current process.
- * The act of firing the event triggers a context switch to softclock() and
- * then switching back out again which is equivalent to a preemption, thus
- * no further work is needed on the local CPU.
- */
-/* ARGSUSED */
-static void
-roundrobin(arg)
-	void *arg;
-{
-
-#ifdef SMP
-	mtx_lock_spin(&sched_lock);
-	forward_roundrobin();
-	mtx_unlock_spin(&sched_lock);
-#endif
-
-	callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL);
-}
-
-/*
  * Constants for digital decay and forget:
  *	90% of (p_estcpu) usage in 5 * loadav time
  *	95% of (p_pctcpu) usage in 60 seconds (load insensitive)
@@ -225,7 +203,7 @@
 #define	decay_cpu(loadfac, cpu)	(((loadfac) * (cpu)) / ((loadfac) + FSCALE))
 
 /* decay 95% of `p_pctcpu' in 60 seconds; see CCPU_SHIFT before changing */
-static fixpt_t	ccpu = 0.95122942450071400909 * FSCALE;	/* exp(-1/20) */
+fixpt_t	ccpu = 0.95122942450071400909 * FSCALE;	/* exp(-1/20) */
 SYSCTL_INT(_kern, OID_AUTO, ccpu, CTLFLAG_RD, &ccpu, 0, "");
 
 /* kernel uses `FSCALE', userland (SHOULD) use kern.fscale */
@@ -255,105 +233,15 @@
 schedcpu(arg)
 	void *arg;
 {
-	register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
-	struct thread *td;
 	struct proc *p;
-	struct kse *ke;
 	struct ksegrp *kg;
-	int realstathz;
-	int awake;
 
-	realstathz = stathz ? stathz : hz;
 	sx_slock(&allproc_lock);
 	FOREACH_PROC_IN_SYSTEM(p) {
 		mtx_lock_spin(&sched_lock);
 		p->p_swtime++;
 		FOREACH_KSEGRP_IN_PROC(p, kg) { 
-			awake = 0;
-			FOREACH_KSE_IN_GROUP(kg, ke) {
-				/*
-				 * Increment time in/out of memory and sleep
-				 * time (if sleeping).  We ignore overflow;
-				 * with 16-bit int's (remember them?)
-				 * overflow takes 45 days.
-				 */
-				/* XXXKSE **WRONG***/
-				/*
-				 * the kse slptimes are not touched in wakeup
-				 * because the thread may not HAVE a KSE
-				 */
-				if ((ke->ke_state == KES_ONRUNQ) ||
-				    ((ke->ke_state == KES_THREAD) &&
-				    (ke->ke_thread->td_state == TDS_RUNNING))) {
-					ke->ke_slptime++;
-				} else {
-					ke->ke_slptime = 0;
-					awake = 1;
-				}
-
-				/*
-				 * pctcpu is only for ps?
-				 * Do it per kse.. and add them up at the end?
-				 * XXXKSE
-				 */
-				ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT;
-				/*
-				 * If the kse has been idle the entire second,
-				 * stop recalculating its priority until
-				 * it wakes up.
-				 */
-				if (ke->ke_slptime > 1) {
-					continue;
-				}
-
-#if	(FSHIFT >= CCPU_SHIFT)
-				ke->ke_pctcpu += (realstathz == 100) ?
-				    ((fixpt_t) ke->ke_cpticks) <<
-				    (FSHIFT - CCPU_SHIFT) :
-				    100 * (((fixpt_t) ke->ke_cpticks) <<
-				    (FSHIFT - CCPU_SHIFT)) / realstathz;
-#else
-				ke->ke_pctcpu += ((FSCALE - ccpu) *
-				    (ke->ke_cpticks * FSCALE / realstathz)) >>
-				    FSHIFT;
-#endif
-				ke->ke_cpticks = 0;
-			} /* end of kse loop */
-			if (awake == 0) {
-				kg->kg_slptime++;
-			} else {
-				kg->kg_slptime = 0;
-			}
-			kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu);
-		      	resetpriority(kg);
-			FOREACH_THREAD_IN_GROUP(kg, td) {
-				int changedqueue;
-				if (td->td_priority >= PUSER) {
-					/*
-					 * Only change the priority
-					 * of threads that are still at their
-					 * user priority. 
-					 * XXXKSE This is problematic
-					 * as we may need to re-order
-					 * the threads on the KSEG list.
-					 */
-					changedqueue =
-					    ((td->td_priority / RQ_PPQ) !=
-					     (kg->kg_user_pri / RQ_PPQ));
-
-					td->td_priority = kg->kg_user_pri;
-					if (changedqueue &&
-					    td->td_state == TDS_RUNQ) {
-						/* this could be optimised */
-						remrunqueue(td);
-						td->td_priority =
-						    kg->kg_user_pri;
-						setrunqueue(td);
-					} else {
-						td->td_priority = kg->kg_user_pri;
-					}
-				}
-			}
+			schedcpu1(kg);
 		} /* end of ksegrp loop */
 		mtx_unlock_spin(&sched_lock);
 	} /* end of process loop */
@@ -944,32 +832,6 @@
 }
 
 /*
- * Compute the priority of a process when running in user mode.
- * Arrange to reschedule if the resulting priority is better
- * than that of the current process.
- */
-void
-resetpriority(kg)
-	register struct ksegrp *kg;
-{
-	register unsigned int newpriority;
-	struct thread *td;
-
-	mtx_lock_spin(&sched_lock);
-	if (kg->kg_pri_class == PRI_TIMESHARE) {
-		newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT +
-		    NICE_WEIGHT * (kg->kg_nice - PRIO_MIN);
-		newpriority = min(max(newpriority, PRI_MIN_TIMESHARE),
-		    PRI_MAX_TIMESHARE);
-		kg->kg_user_pri = newpriority;
-	}
-	FOREACH_THREAD_IN_GROUP(kg, td) {
-		maybe_resched(td);			/* XXXKSE silly */
-	}
-	mtx_unlock_spin(&sched_lock);
-}
-
-/*
  * Compute a tenex style load average of a quantity on
  * 1, 5 and 15 minute intervals.
  * XXXKSE   Needs complete rewrite when correct info is available.
@@ -1022,11 +884,9 @@
 {
 
 	callout_init(&schedcpu_callout, 1);
-	callout_init(&roundrobin_callout, 0);
 	callout_init(&loadav_callout, 0);
 
 	/* Kick off timeout driven events by calling first time. */
-	roundrobin(NULL);
 	schedcpu(NULL);
 	loadav(NULL);
 }

----- End forwarded message -----

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message