Date: Tue, 25 Jan 2005 07:58:15 -0800 (PST) From: Chris Landauer <cal@rush.aero.org> To: freebsd-hackers@freebsd.org Cc: cal@rush.aero.org Subject: bug in calcru() in kernel: integer overflow computing user time Message-ID: <200501251558.j0PFwFjn029638@calamari.aero.org>
next in thread | raw e-mail | index | archive | help
hihi, all - well, i have an "almost" fix for the problem - read on, ... (this is for discussin before send-pr submission) description in FreeBSD 5.3-RELEASE (and 5.2.1R, and 5.1R), in file /usr/src/sys/kern/kern_resource.c, lines 657-750 define a function calcru(), and there is a bug in that function - line 713 in that file is uu = (tu * ut) / tt, which is trying to compute a proportion (since ut<=tt) - the problem is that for my application, the values of tt and ut are very large and the value of the intermediate product overflows, leading to incorrect values (i reported the symptoms to freebsd-hackers and a very helpful description and localization of the problem to calcru() was provided by peter jeremy, who also wrote that it is a very old, but only partly known, problem) repetition use time in csh or /usr/bin/time or getrusage() to time any program that takes more than a week to run fix (almost) it turns out that this problem can be (partly) fixed by replacing that one line above with the following lines: /* we know 0 <= ut <= tt and 1 <= tt */ if (tu >= tt) { **whatever type they need to be** q, r; q = tu / tt; r = tu % tt; uu = (q * ut) + (r * ut) / tt; } else uu = (tu * ut) / tt this change does not solve all the arithmetic problems (especially when ut is very large), and a similar change is needed for system time, computing su in line 714 of the file, but it suffices for me - analysis (yup, proof that it should work 8-)) - i expect that all of these counters are increasing throughout the life of the process - tu is total time in microseconds ut is user 128hz ticks tt is total 128hz ticks i assume therefore that tu ~ tt * 10^6/128 strictly because of the clock rates (machines with other kinds of clock rate ratios will need a different analysis, but the same idea should work there, too) - in my case ut ~ tt, so we see overflow in the old computation when tu * ut >= 2^64 tt^2 * 10^6/128 >= 2^64 tt * 10^3/8*sqrt(2) >= 2^32 ~ 4 * 10^9 tt >= 32*sqrt(2)*10^6, which is about sqrt(2)*10^6 / 4 ~ 3.54*10^5 seconds, or ~ 98 hours (this is the phenomenon i observed) in the new computation offered above, since we know that ut <= tt, we also know that uu <= tu, and (q * ut) <= uu, so there can be no overflow in the (q * ut) term - therefore, this changed code will overflow only when r is large r ~ tt, and then only when r * ut >= 2^64 tt^2 >= 2^64 tt >= 2^32, which is about ~ 2^25 seconds ~ 9300 hours ~ 388 days (i can live with that for now) for everyone else, it adds the one test in the if statement to every call to calcru() (or two, if system time is similarly fixed) - <philosophy> instrumentation is costly, and correct instrumentation is even more costly, but it's worth it, every time, to get the right answer </philosophy> i'm about to try it out this week - i will report the results when i get some data (which will be a few weeks) i'm thinking about how to solve the problem correctly for really long-running programs, but it's all pretty special case ugly so far more later, cal Dr. Christopher Landauer Aerospace Integration Science Center The Aerospace Corporation, Mail Stop M6/214 P.O.Box 92957 Los Angeles, California 90009-2957, USA e-mail: cal@aero.org Phone: +1 (310) 336-1361
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200501251558.j0PFwFjn029638>