From owner-freebsd-current@FreeBSD.ORG Mon Jun 21 05:01:42 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A518116A4CE; Mon, 21 Jun 2004 05:01:42 +0000 (GMT) Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.85]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2962C43D58; Mon, 21 Jun 2004 05:01:42 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87])i5L51e5v008152; Mon, 21 Jun 2004 15:01:41 +1000 Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246]) i5L51cnl031991; Mon, 21 Jun 2004 15:01:39 +1000 Date: Mon, 21 Jun 2004 15:01:38 +1000 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Don Lewis In-Reply-To: <200406210258.i5L2wTKF047144@gw.catspoiler.org> Message-ID: <20040621132119.Q8596@gamplex.bde.org> References: <200406210258.i5L2wTKF047144@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: threads@freebsd.org cc: rwatson@freebsd.org cc: current@freebsd.org Subject: Re: calcru: negative time ... followed by freeze X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Jun 2004 05:01:42 -0000 On Sun, 20 Jun 2004, Don Lewis wrote: > On 20 Jun, Robert Watson wrote: > > FYI, this is a Xeon box with two physical processors, each with two > > logical processors, and the problem could well be SMP-related. > > The hang might be SMP-related, but I just started getting the calcru > messages on my UP Athlon XP box. I just upgraded to today's > -CURRENT from a week-old version, and my console got spammed by a bunch > of these messages while I was running portupgrade. > > calcru: negative time of 0 usec for pid 21261 (sh) > calcru: negative time of 0 usec for pid 21261 (sh) > ... > calcru: negative time of 3917 usec for pid 49504 (sh) > calcru: negative time of 3917 usec for pid 49504 (sh) Hmm, these are nonnegative times. The extra messages in -current might be caused by fill_kinfo() now calling calcru(). Misreporting of negative times is by the following too-simple dianostic: % tu = (u_int64_t)tv.tv_sec * 1000000 + tv.tv_usec; % ptu = p->p_uu + p->p_su + p->p_iu; % if (tu < ptu || (int64_t)tu < 0) { % printf("calcru: negative time of %jd usec for pid %d (%s)\n", % (intmax_t)tu, p->p_pid, p->p_comm); % tu = ptu; % } The message is also printed for the tu < ptu case, which is what you are getting. I fixed the messages when I got a lot of them due to a local bug. The local bug was from double rounding for calcru() on child times (which -current doesn't do). I can't see how this could be the problem in -current, since the components of ptu are rounded down and there is a KASSERT that they added up to less than tu in the previous call. Ah, here is a likely cause of the bug in -current: % if (p == curthread->td_proc) { % /* % * Adjust for the current time slice. This is actually fairly % * important since the error here is on the order of a time % * quantum, which is much greater than the sampling error. % * XXXKSE use a different test due to threads on other % * processors also being 'current'. % */ % binuptime(&bt); % bintime_sub(&bt, PCPU_PTR(switchtime)); % bintime_add(&bt, &p->p_runtime); % } else % bt = p->p_runtime; The XXXKSE comment is correct that this might be broken. If the (p != curthread->td_proc) case happens at all for a running process, then it gives a wrong (out of date) timestamp in bt. This wrongness will be detected if calcru() is was called called earlier in the current timeslice and took the other path here. The recent change to fill_kinfo() is quite likely to trigger detection of this bug. fill_kinfo() is often used to iterate over all processes for ps, so it will call calcru() with (p != curthread->td_proc) for all processes other than the one running it, and give a bt that is out of date for all such processes that are actually running. Since there can be at most one running process per CPU, this bug only affects SMP. The call to calcru() from ttyinfo() may be the only other trigger. ttyinfo() picks a process and should rarely or never pick the ithread running it, so it will almost always take the (p != curthread->td_proc) path. Again, this is only a problem for the SMP case since in the !SMP case the picked process must have been switched away from to run the ithread, so it cannot be running. Bruce