Date: Tue, 1 Feb 2005 09:14:55 -0800 (PST) From: Chris Landauer <cal@aero.org> To: FreeBSD-gnats-submit@FreeBSD.org Subject: kern/76972: 64-bit integer overflow computing user cpu time in calcru() in kern_resource.c Message-ID: <200502011714.j11HEtRJ000644@calamari.aero.org> Resent-Message-ID: <200502011730.j11HUIlW098450@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 76972 >Category: kern >Synopsis: 64-bit integer overflow computing user cpu time in calcru() in kern_resource.c >Confidential: no >Severity: non-critical >Priority: low >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Tue Feb 01 17:30:17 GMT 2005 >Closed-Date: >Last-Modified: >Originator: Chris Landauer >Release: FreeBSD 4.x <and 5.x, and all other i386> >Organization: aerospace integration science center, the aerospace corporation >Environment: System: FreeBSD all i386 <at least; probly on all others, too> >Description: the line numbers below are for 5.1RELEASE - for other releases, only the line numbers are different - the code is the same, so the problem is the same (i've observed it on 5.1R, 5.2.1R, and 5.3R systems, seen it in the code on 4.10R systems, and i have been told that the same problem occurs on all FreeBSD systems since BSD4.4-Lite (!)) in FreeBSD 5.1-RELEASE (and 5.2.1R, and 5.3R, and 4.10R), in file /usr/src/sys/kern/kern_resource.c, lines 657-750 define a function calcru() ("calculate resource usage"), and there is a bug in that function - line 713 in that file is uu = (tu * ut) / tt; which is trying to compute user cpu time (uu) as a proportion (since ut<=tt) of total cpu time (tu), based on the fraction of statclock ticks (ut / tt) that occur in user time - the problem is that for my application, the values of tt and ut (which are counts of statclock ticks) are very large and the value of the intermediate product overflows, even with 64-bit arithmetic, leading to incorrect values (i reported the symptoms to freebsd-hackers and a very helpful description and localization of the problem to calcru() was provided by peter jeremy, who also wrote that it is a very old problem, but only partly known because it is so rarely encountered) >How-To-Repeat: use time in csh or /usr/bin/time or getrusage() to time any program that takes more than a week or so of user or system cpu time to run (they all exhibit the same error of "losing" the user or system time, whichever is larger, and therefore incorrectly reporting the utilization percentage) >Fix: <mostly> it turns out that this problem can be (partly) fixed easily, by simply re-ordering the computations (this idea is due to Doug Ambrisko), replacing lines 711-714 (sorry i don't speak "patch" very well yet): /* Subdivide tu. */ uu = (tu * ut) / tt; su = (tu * st) / tt; iu = tu - uu - su; of the file with the following lines (tt is defined to be ut+st+it earlier in the file, so we want to insist that tu = uu+su+iu): /* Subdivide tu. */ su = (tu * st) / tt; iu = (tu * it) / tt; uu = tu - (su + iu); since st and it are expected to be much smaller than ut for most long programs, the overflow likelihood is greatly reduced - but i think that we can actually do much better - see the full mathematical analysis at http://www.cs.umd.edu/~cal/math/overflow-analysis.txt which also explains the value of the constant TOOBIG (yup, there is a proof there that shows how well the suggested fix *should* work 8-)) this problem can be fixed (more completely) by replacing those same lines with the following lines: /* Subdivide tu. be careful of overflow */ /* we know 0 <= st, it, ut <= tt and 1 <= tt = st+it+ut */ #define TOOBIG 48592008 /* 2^35 * sqrt(2) / 10^3 */ if (tt >= TOOBIG && tu >= tt) { u_int64_t q, r; q = tu / tt; r = tu % tt; su = (q * st) + (r * st) / tt; iu = (q * it) + (r * it) / tt; } else { su = (tu * st) / tt; iu = (tu * it) / tt; } uu = tu - (su + iu); the constant TOOBIG is designed so the calcru() function does not use the complicated form of the equations when the numbers are guaranteed to be small enough not to overflow in the intermediate products this change does not solve all the arithmetic problems (especially when either st or it is very large), but this much suffices for my needs - my assumptions for the analysis are as follows (if any of them is incorrect, please tell me): tu is total time in microseconds tt is total 128hz ticks (tt = ut + st + it) st is system 128hz ticks it is interrupt 128hz ticks ut is user 128hz ticks i assume that all of these counters are non-negative and increasing throughout the life of the process - i assume therefore that (i use "~" to mean "very roughly near") tu ~ tt * 10^6/128 strictly because of the clock rates (machines with other kinds of clock rate ratios will need a different analysis, but the same kind of idea should work there, too, possibly with different constant values for TOOBIG) - in my case ut ~ tt (i have utilizations for this particular program usually well above 90%, often 99%, for weeks at a time), so we see overflow in the old computation when tu * ut >= 2^64, which happens when tt is about ~ 3.8*10^5 seconds ~ 105.45 hours this is the phenomenon i observed visible effect of making the change for long programs (that is, programs that take more than 388 days or so of either system or interrupt cpu time), this change may not help, and will certainly not help after long enough, as shown in the analysis for medium length programs (that is, programs that take more than 105 hours or so of either system or interrupt cpu time, but less than 388 days or so each of system and interrupt cpu time), this change adds both tests from the if statement to every call to calcru(), and does in fact fix the problem for short programs (that is, programs that take less than 105 hours or so each of system and interrupt cpu time), it adds the first test in the if statement to every call to calcru(), and does not otherwise affect the correctness of the computation <philosophy> instrumentation is costly, and correct instrumentation is even more costly, but it's worth it, every time, to get the right answer </philosophy> (unless, of course, you don't want or need it 8-)) both of these changes are being tested (05jan26), and both of them pass the first error threshold - the experimental results and analysis can be found on the web http://www.cs.umd.edu/~cal/math/overflow-analysis.txt more later, cal Dr. Christopher Landauer Aerospace Integration Science Center The Aerospace Corporation, Mail Stop M6/214 P.O.Box 92957 Los Angeles, California 90009-2957, USA cal@aero.org +1 (310) 336-1361 >Release-Note: >Audit-Trail: >Unformatted:
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200502011714.j11HEtRJ000644>