Date: Thu, 28 Feb 2019 16:27:30 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: ian@FreeBSD.org Cc: bugs@freebsd.org Subject: Re: [Bug 236096] top shows WCPU numbers greater than 100 percent when using SCHED_BSD Message-ID: <20190228140651.J990@besplex.bde.org> In-Reply-To: <bug-236096-227@https.bugs.freebsd.org/bugzilla/> References: <bug-236096-227@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 28 Feb 2019 a bug that doesn't want replies@freebsd.org wrote: > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236096 > > After switching from SCHED_ULE to SCHED_4BSD I immediately noticed that top Congratulations on the switch. SCHED_ULE is slightly better, but I use my version SCHED_4BSD and have fixed it to work slightly better than SCHED_ULE in cases that I care about. Scheduling is unimportant in most cases, since under light loads on SMP systems it is easy to find a spare CPU and under heavy loads it is impossible to find a spare CPU and hard to do better than choose a non-spare one at random. > displays wildly inaccurate numbers for WCPU. If you switch the display to > un-weighted CPU the numbers are mostly right (rarely you'll see a 101% type > number). This is on a 6-core/12-thread system running make universe: This was broken almost 5 years ago in r266906. r267685 is supposed to fix the percentages going over 100%, but the percentages are still garbage for 4BSD and not too good for ULE and often go over 100% for at least 4BSD. > last pid: 93675; load averages: 11.55, 12.01, 11.34 up 0+09:47:22 18:07:54 > 277 processes: 12 running, 265 sleeping > CPU: 0.3% user, 71.7% nice, 5.8% system, 0.1% interrupt, 22.2% idle > Mem: 2727M Active, 6622M Inact, 144M Laundry, 1978M Wired, 1174M Buf, 533M Free > Swap: 32G Total, 32G Free > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND > 93198 ilepore 1 95 20 101M 71M CPU6 6 0:02 1669.79% cc > 93380 ilepore 1 95 20 80M 50M RUN 8 0:01 1092.95% cc > 93366 ilepore 1 95 20 80M 51M CPU2 2 0:01 1090.72% cc > 93381 ilepore 1 95 20 80M 50M CPU0 0 0:01 1075.67% cc > 93365 ilepore 1 95 20 80M 50M CPU10 10 0:01 1075.13% cc > 93367 ilepore 1 95 20 80M 50M CPU5 5 0:01 1031.28% cc > 93378 ilepore 1 95 20 80M 50M CPU9 9 0:01 1027.59% cc > 93379 ilepore 1 95 20 80M 50M CPU7 7 0:01 1026.13% cc The bug is essentially division by 0. More precisely, it is division by (1 - exp(k*t)), where k is not Boltzmann's constant (it is log(ccpu)) and and t is time. This division corrects from raw CPU to weighted CPU. This is not really valid for ULE, and ULE mis-emulates it by setting ccpu to 0. log(0) is -Infinity, so when t is 0 the result is NaN and the percentage is displayed as something like "nan", but for other t exp(k*t) is 0 so the divisor is 1 and the conversion is null. WCPU is just worse than useless for ULE, since it is the same as CPU except when it is NaN. This gave the original bug of WCPU taking a long time to ramp up to 100% for a thread that uses 100%. For 4BSD, ccpu is about 0.95 represented as an integer. This corresponds to a 95% in 60 seconds decay rate for raw CPU. This is not broken, but raw CPU is broken by making it almost the same as the correct WCPU using a bad method, so that applying the correct conversion makes it start at nearly infinity and stay above 100% for a long time (for 100% actual use, about 105% after 1 minute by dividing by 0.95). Further details for 4BSD: when t is 0, (1 - exp(k*t)) is 0, so there must be a test somewhere to avoid dividing by this. This test apparently doesn't or didn't work for avoiding NaNs for ULE. I observed the NaNs but didn't check the code. A special case for t = 0 would work for both, but the apparently-more robust check for (1 - exp(k*t)) != 0 fails when the LHS is NaN. When t is small, CPU must also be small so that division by nearly 0 doesn't give much larger percentages than 100%. E.g, if The main breakage in r266906 was to ignore the kernel's long-term average ("raw") %CPU (except initially) and use the average over the last top update interval. This gives the following observable bugs: - %CPU is broken for its reason for existence of showing the raw kernel %CPU - ps doesn't have this bug, so %CPU in top is inconsistent with %CPU in ps - there is little documentation about this, and what there is is misleading. ps says that -C gives a "raw" CPU that ignores "resident" time and that this normally has no effect. Actually, the "resident" time has almost no effect, but -C gives a very large difference by _not_ dividing by (1 - exp(k*t)). Except for ULE, the details are different. Then -C really does notmally have no effect, since k is broken. - NaNs sometimes. I might have only seen them for ps. - garbage %WCPU for 4BSD. You can almost recover by never using %WCPU. Use only %CPU. It is similar to what %WCPU should be. - lots of jitter in %CPU and %WCPU. It is impossible for it to be very accurate since it is measured over a short interval. For a long-lived thread taking 100% CPU, the displayed %*CPU is often off by +- a few percent, while the kernel's long-term average is stable at nearly 100% (a bit below that for 4BSD since 4BSD often reduces it by 5%). It might be useful to display transients, but this should be on a separate display named something like %TCPU. Transients are bad for most purposes, especially for sorting on %*CPU, since they change the values and/or order a lot. The result would be: - %CPU shows shows kernel CPU - %WCPU shows weighted CPU. This needs to be fixed for ULE. For ULE, the raw CPU is mis-emulated and is more like WCPU, but it doesn't ramp up as fast as it should for WCPU, giving the original bug. A correct emulation would emulate 4BSD's raw CPU, with ccpu probably different but not 0, but it would be better for ULE's CPU to be fully raw and do better conversions to WCPU in userland (don't use the (1 - exp(k*t)) factor for ULE. ULE doesn't keep as much history as 4BSD, but it keeps some, so raw CPU is small initially when WCPU should be 100%. - %TCPU shows transient CPU. The following quick fix is enough for 4BSD: XX Index: machine.c XX =================================================================== XX --- machine.c (revision 341138) XX +++ machine.c (working copy) XX @@ -661,7 +661,7 @@ XX { XX const struct kinfo_proc *oldp; XX XX - if (previous_interval != 0) { XX + if (0 && previous_interval != 0) { XX oldp = get_old_proc(pp); XX if (oldp != NULL) XX return ((double)(pp->ki_runtime - oldp->ki_runtime) This just never uses the transient CPU for any scheduler. So %WCPU and %CPU work correctly as in old versions for 4BSD, and %CPU works almost correctly as in old versions for ULE (it is just not raw enough, and most users don't understand that it is raw), and %WCPU is broken as in old versions for ULE (it is just the same as %WCPU, so is not what users should expect). I asked the author of the bug to fix it about a year ago, and provided a different long explanation than the above and a less-quick fix: XX Index: machine.c XX =================================================================== XX --- machine.c (revision 331608) XX +++ machine.c (working copy) XX @@ -89,6 +89,7 @@ XX XX /* define what weighted cpu is. */ XX #define weighted_cpu(pct, pp) ((pp)->ki_swtime == 0 ? 0.0 : \ XX + sched_ule ? (pct) : \ XX ((pct) / (1.0 - exp((pp)->ki_swtime * logcpu)))) XX XX /* what we consider to be process size: */ Also, don't waste time calculating 1 using exp(-Inf) for the WCPU && SCHED_ULE case. XX @@ -147,6 +148,7 @@ XX /* these are retrieved from the kernel in _init */ XX XX static load_avg ccpu; XX +static int sched_ule; XX XX /* these are used in the get_ functions */ XX XX @@ -331,6 +333,7 @@ XX boolean_t carc_en; XX size_t size; XX struct passwd *pw; XX + char name[4]; XX XX size = sizeof(smpmode); XX if ((sysctlbyname("machdep.smp_active", &smpmode, &size, XX @@ -365,6 +368,10 @@ XX if (kd == NULL) XX return (-1); XX XX + size = sizeof(name); XX + sched_ule = (sysctlbyname("kern.sched.name", &name[0], &size, XX + NULL, 0) == 0 && strcmp(name, "ULE") == 0); XX + XX GETSYSCTL("kern.ccpu", ccpu); XX XX /* this is used in calculating WCPU -- calculate it ahead of time */ XX @@ -715,6 +722,13 @@ XX * If there was a previous update, use the delta in ki_runtime over XX * the previous interval to calculate pctcpu. Otherwise, fall back XX * to using the kernel's ki_pctcpu. XX + * XX + * XXX: the kernel's ki_pctcpu is the correct one, but we don't know XX + * how to scale it to WCPU for SCHED_ULE (we used to scale by SCHED_4BSD's XX + * factor or 1/(1-exp(k*t) where k = log(ccpu) in all cases. For XX + * SCHED_ULE ccpu is 0 so k is -infinity and the factor is 1 which XX + * doesn't do too much damage). Actually only clobber the kernel's XX + * value for SCHED_ULE && WCPU. XX */ XX static double XX proc_calc_pctcpu(struct kinfo_proc *pp) XX @@ -721,7 +735,7 @@ XX { XX const struct kinfo_proc *oldp; XX XX - if (previous_interval != 0) { XX + if (sched_ule && ps.wcpu && previous_interval != 0) { XX oldp = get_old_proc(pp); XX if (oldp != NULL) XX return ((double)(pp->ki_runtime - oldp->ki_runtime) This doesn't go as far as adding %TCPU of fixing %WCPU properly for ULE, or fixing ps and other utilities, or fixing the documentation. top(1) misdocuments WCPU by saying that it is the weighted CPU and is the same as ps displays. It doesn't say what weighting is, or document that this is scheduler-dependent with different bugs or that the bugs make this not the same as ps displays. ps should have an option to display transient %CPU too, and should have keywords to select cpu, wcpu and tcpu. It is otherwise more programmable than top, so can display all of these in narrow displays by omitting other columns. Its -C option switches from WCPU (keyword %cpu; column header %CPU) to CPU without even changing the name in the header. ps already supports the confusingly similar keyword "cpu" (column header CPU). This is even rawer than actual %CPU, so it shouldn't be any standard displays, but it is in "ps l" output while %[W]CPU is not. CPU is 0 for all processes on freefall now. That is another bug in ULE. ULE doesn't use the ts_estcpu variable or logic -- that is 4BSD-specific. ULE doesn't emulate this either, but always returns 0 in sched_estcpu(). ULE does emulate %CPU. Its internal state for this is just ts_tick, which is similar to ts_estcpu in 4BSD (but rawer and averaged over a default of only 10 seconds instead of retaining 5% after 1 minute for 4BSD). Returning this in sched_estcpu() would make sense, but users wouldn't know how to scale it and it doesn't provide any more information than %CPU, so this would mainly confuse users by putting strange nonzero values in ps l output. I now remember trying to fix %CPU for ULE without knowing or wanting to know much about ULE. I tried using scaling by (1 - exp(k*t)) to adjust %CPU down and %WCPU up. Nothing worked. The reasons are now clearer. Since ts_tick is only an average over 10 seconds, exponential scaling just doesn't apply to it. But it seems to ramp up much like %CPU in 4BSD. That must be just that after 1 second out of 10, 100% of CPU is seen as only 10% CPU. Userland should see this 10% and scale it to 100%. But the API (mostly struct kinfo) is still specified for 4BSD, so it doesn't contain information needed for this scaling. The kernel may as well do it. Then its value of 0 for ccpu would be correct (it means that %CPU is already scaled). Averaging over 10 seconds in the kernel gives lots of jitter too. It gives even more jitter by using tick counts instead of precise runtimes. Yet somehow it seems to give less jitter than top with an interval of 1 or 2. Try top with an interval of 0 to see enormous jitter. The actual interval is slightly larger than 0, so top rarely falls back to the old method to avoid division by 0. Instead it sees sharp transients. top should support more useful short intervals like 0.1, but it misparses 0.1 as 0. The interval of 0 is restricted to root. This is bogus, since anyway can use even more resources by execing top in a loop. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190228140651.J990>