From owner-cvs-src@FreeBSD.ORG Sun Mar 30 00:20:13 2003 Return-Path: Delivered-To: cvs-src@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D0F3137B404; Sun, 30 Mar 2003 00:20:13 -0800 (PST) Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id ACA0C43F85; Sun, 30 Mar 2003 00:20:07 -0800 (PST) (envelope-from bde@zeta.org.au) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id SAA09876; Sun, 30 Mar 2003 18:19:39 +1000 Date: Sun, 30 Mar 2003 18:19:38 +1000 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Jeff Roberson In-Reply-To: <20030329152001.I64602-100000@mail.chesapeake.net> Message-ID: <20030330161319.O13783@gamplex.bde.org> References: <20030329152001.I64602-100000@mail.chesapeake.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: David Malone cc: src-committers@FreeBSD.org cc: Nate Lawson cc: cvs-src@FreeBSD.org cc: Mike Silbersack cc: cvs-all@FreeBSD.org cc: Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?= Subject: Re: Checksum/copy X-BeenThere: cvs-src@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: CVS commit messages for the src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Mar 2003 08:20:17 -0000 On Sat, 29 Mar 2003, Jeff Roberson wrote: > On Sat, 29 Mar 2003, Bruce Evans wrote: > > ... > > - SCHED_ULE breaks scheduling of idleprio processes. This results in > > pagezero being too active. It costs 1-2% instead of saving 1-2%. [1-2% for makeworld]. > Thanks for the analysis. I know my +nice values are screwy right now. > It's actually a pretty interesting problem. Perhaps you'll have some > insight. idleprio is a little different from nice. I would have expected it to work right fairly automatically. At least with the standard scheduler, idleprio gives td->td_priority values which are lower (numerically higher) than the value for all non-idleprio processes, so idleprio processes should never run if there is a runnable non-idleprio processes. > The basic issue is that threads on run queues in ule must be given a > slice. And with a slice, ignoring interactive threads, they are run every > n times we select a new thread where n is the number of runnning threads. > That means two things > > 1) +nice threads always get a chance to run. > 2) Their %cpu is relative to the sum of all slices of all running > threads. > > #2 is sort of what you want, except that the slice value never reaches > zero. In sched_4bsd if you have a nice priority that is 20 away from the > lowest priority processes you never get a chance to run. I'm not sure if > this scales all the way across. It scales poorly. In RELENG_4 or a bit earlier, I changed the scaling to make a nice difference of precisely 20 or more prevent the higher-niced process from running. This gives potentential priority inversion problems even for user nice values [0,20]. RELENG_4 also has bogus scaling of system nice values [-20,-1] into what should be kernel-only. > I know a nice 0 will always prevent a > nice 20 from running, but will a nice -20 prevent a nice 0 from running? > I believe so. With nice +19 and a nice 0 the nice +19 gets approx 2% cpu. This seems about right for RELENG_4, but for -current the priority inversion problems and bogus scaling were "fixed" by changing the scaling so that the difference between nice -20 and nice +20 is approx. the same as the old difference between nice +0 and nice +19, so the difference between nice +0 and nice +20 was too small again (it shouldn't be infinite since this gives potentially fatal priority inversion, but it should be large). Subsequent compression of the PRI_TIMESHARE range reduced the difference further, so we're back to approx. the old difference of only 2:1 or 3:1 for nice +0 vs nice +20. > So, in ule, I need a way to approximate this. The real problem is that > the drop off point where a process gets 0 cpu time is artificial. The > algorithm doesn't work linearly down to 0 as it does in sched_4bsd. I > need to make slice assignments relative to all other processes in the > system. This seems like it may break the O(1) properties of the > scheduler. It doesn't actually work linearly in sched_4bsd either. sched_4bsd clamps kg_estcpu so that the scaling of kg_estcpu to a priority can be linear. This gives huge nonlinearities, especially in the following cases: - fork and exit. kg_estcpu is inherited on fork (bad) and added from the child back to the parent in exit (worse). This wants to cause sorcerer's-apprentice behaviour, but the clamping makes it cause nonlinearities instead (kg_estcpu ramps up to the limit much faster than it should and then sticks there much longer than it should). - conditions where the load average fluctuates a lot. When the load average is large, priority decay is slow and processes can more easily hit the limit, especially when most of the processes that caused the large load average have just exited. Clamping also discards history, so hog processes are given lower (numerically higher) priorities than they should be even after the conditions that cause the nonlinearities have gone away. > I'm just now thinking that I could assign the slice using the run queues > to find out how this thread relates to others in the system. This all > sounds rather complicated. I'm hoping that I'm missing some simple > elegant solution that someone may know of. > > Any takers? Comments on nice or slice selection? Peter Dufault's solution for sched_4bsd as implemented by me is to remove the explicit clamp on kg_estcpu (it is bounded algorithmically by priority decay except for the fork+exit problem which we just hack around), and then map kg_estcpu to td_priority more carefully. In my version of the latter, each tick in kg_estcpu costs more if the nice value is numerically larger, according to a table: static int niceweights[PRIO_MAX - PRIO_MIN + 1] = { #if 1 /* * Geometic niceness. The weight at index i is * floor(2 * 3 * pow(2.0, i / 4.0) + 0.5). */ 6, 7, 8, 10, 12, 14, 17, 20, 24, 29, 34, 40, 48, 57, 68, 81, 96, 114, 136, 161, 192, 228, 272, 323, 384, 457, 543, 646, 768, 913, 1086, 1292, 1536, 1827, 2172, 2583, 3072, 3653, 4344, 5166, 6144, #else /* * Arithmetic niceness. The weight at index i is * 2 * 2 * 2 * 3 * 3 * 5 * 7 / (40 - i) * (except the one at index 40 is an approximation for infinity). */ 63, 64, 66, 68, 70, 72, 74, 76, 78, 81, 84, 86, 90, 93, 96, 100, 105, 109, 114, 120, 126, 132, 140, 148, 157, 168, 180, 193, 210, 229, 252, 280, 315, 360, 420, 504, 630, 840, 1260, 2520, 20000, #endif }; Almost any behaviour can be programmed by changing the table. (Infinite niceness can't be programmed easily, but that's a feature since it prevents priority inversion.) To use this method in other schedulers, I think you only need to have some idea of the time spent in each thread. Then weight the times according to the scheduling policies for the threads. Then scale weighted times to priorities. Scaling is the tricky part for sched_4bsd and probably for most schedulers. kg_estcpu can get very large, and it has a large variance. I just compute its maximum in schedcpu and use a value a little larger than the maximum for scaling. Sometimes the scale factor changes significantly (e.g., when all of the processes with a large kg_estcpu exit), and then most processes move to a different scheduling queue. This makes the scheduler more O(n) than before but doesn't affect performance on any of the loads that I have tested. It seems to be difficult to keep the scale factor almost constant without reintroducing the nonlinearities or significantly increasing response times. Bruce