From owner-freebsd-arch@FreeBSD.ORG  Fri Oct  1 15:02:04 2004
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 2287116A4D8
	for <arch@FreeBSD.org>; Fri,  1 Oct 2004 15:02:04 +0000 (GMT)
Received: from mail2.speakeasy.net (mail2.speakeasy.net [216.254.0.202])
	by mx1.FreeBSD.org (Postfix) with ESMTP id D1C8943D4C
	for <arch@FreeBSD.org>; Fri,  1 Oct 2004 15:02:03 +0000 (GMT)
	(envelope-from jhb@FreeBSD.org)
Received: (qmail 27300 invoked from network); 1 Oct 2004 15:02:03 -0000
Received: from dsl027-160-063.atl1.dsl.speakeasy.net (HELO server.baldwin.cx)
	([216.27.160.63])          (envelope-sender <jhb@FreeBSD.org>)
	encrypted SMTP
	for <arch@FreeBSD.org>; 1 Oct 2004 15:02:02 -0000
Received: from [10.50.40.210] (gw1.twc.weather.com [216.133.140.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.12.11/8.12.11) with ESMTP id i91F1sEH027282
	for <arch@FreeBSD.org>; Fri, 1 Oct 2004 11:01:58 -0400 (EDT)
	(envelope-from jhb@FreeBSD.org)
From: John Baldwin <jhb@FreeBSD.org>
To: arch@FreeBSD.org
Date: Fri, 1 Oct 2004 11:02:43 -0400
User-Agent: KMail/1.6.2
MIME-Version: 1.0
Content-Disposition: inline
Content-Type: text/plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit
Message-Id: <200410011102.43394.jhb@FreeBSD.org>
X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on server.baldwin.cx
Subject: [PATCH] Rework how we store process times in the kernel and
	deferring calcru()
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussion related to FreeBSD architecture
	<freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Oct 2004 15:02:04 -0000

I'll commit this soonish unless there are any objections.  The basic idea is 
to store process times resource usage as raw data (i.e. as bintimes and tick 
counts) for both process usage and child usage and only calculate the timeval 
style times if they are explicitly asked for.  This lets us avoid always 
calling calcru() to calculate the timeval values in exit1() for example.  A 
more detailed listing of the changes follows:

- Fix the various kern_wait() syscall wrappers to only pass in a rusage
  pointer if they are going to use the result.
- Add a kern_getrusage() function for the ABI syscalls to use so that they
  don't have to play stackgap games to call getrusage().
- Fix the svr4_sys_times() syscall to just call calcru() to calculate the
  times it needs rather than calling getrusage() twice with associated
  stackgap, etc.
- Add a new rusage_ext structure to store raw time stats such as tick counts
  for user, system, and interrupt time as well as a bintime of the total
  runtime.  A new p_rux field in struct proc replaces the same inline fields
  from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime).  A new p_crux
  field in struct proc contains the "raw" child time usage statistics.
  ruadd() has been changed to handle adding the associated rusage_ext
  structures as well as the values in rusage.  Effectively, the values in
  rusage_ext replace the ru_utime and ru_stime values in struct rusage.  These
  two fields in struct rusage are no longer used in the kernel.
- calcru() has been split into a static worker function calcru1() that
  calculates appropriate timevals for user and system time as well as updating
  the rux_[isu]u fields of a passed in rusage_ext structure.  calcru() uses a
  copy of the process' p_rux structure to compute the timevals after updating
  the runtime appropriately if any of the threads in that process are
  currently executing.  This also includes an additional fix so that calcru()
  now correctly handles threads from the process that are executing on other
  CPUs.  Also, the calcru() now only locks sched_lock internally while doing
  the rux_runtime fixup.  It now only requires the caller to hold the proc
  lock and calcru1() only requires the proc lock internally.  calcru() also no
  longer allows callers to ask for an interrupt timeval since none of them
  actually did.
- A new calccru() function computes the child system and user timevals by
  calling calcru1() on p_crux.  Note that this means that any code that wants
  child times must now call this function rather than reading from p_cru
  directly.  This function also requires the proc lock.
- This finishes the locking for rusage and friends so some of the Giant locks
  in exit1() and kern_wait() are now gone.

As a side effect of storing the raw values, the accuracy of the process timing 
has been approved.  This makes benchmarking somewhat tricky as the appearance 
is that with this patch user times go way up but system times go way down.  
Thus, the only benchmarks I did were to compare real times and to also 
compare the sum of the user and system times to the real times.  Thus, here 
are the results on a kernel w/o debugging (when WITNESS + INVARIANTS were on, 
the extra overhead resulted in no statistical difference in the before and 
after).  For real times (100 runs of 10000 fork/wait loops):

x smpng.fast.real
+ proc.fast.real
+--------------------------------------------------------------------------+
|                  +                                                       |
|                  +                                                       |
|                  +   +                                                   |
|                  +   +                                                   |
|                  +   +                                                   |
|                  +   +                                                   |
|                  +   +                                                   |
|                  +   +                 x   x                             |
|                  +   +                 x   x                             |
|                  +   +                 x   x                             |
|                  +   +                 x   x                             |
|                  +   +              x  x   x                             |
|                  +   +              x  x   x                             |
|                  +   +              x  x   x                             |
|                  +   +              x  x   x                             |
|                  +   +              x  x   x  x                          |
|               +  +   +              x  x   x  x                          |
|               +  +   +              x  x   x  x                          |
|               +  +   +              x  x   x  x                          |
|               +  +   +              x  x   x  x                          |
|               +  +   +          x   x  x   x  x                          |
|               +  +   +          x   x  x   x  x                          |
|               +  +   +   +      x   x  x   x  x                          |
|               +  +   +   +      x   x  x   x  x                          |
|               +  +   +   +      x   x  x   x  x   x                      |
|               +  +   +   +  +   *   x  x   x  x   x                      |
|           +   +  +   +   +  +   *   x  x   x  x   x                      |
|           +   +  +   +   +  +   *   x  x   x  x   x                      |
|       +   +   +  +   +   +  +   *   *  x   x  x   x              x       |
|+      +   +   +  +   +   +  +   *   *  *   x  x   x   x          x      x|
|              |___M__A_____|       |____M_A______|                        |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x 100          2.97          3.08          2.99        2.9959   0.018968075
+ 100          2.88          2.99          2.93        2.9362   0.017568337
Difference at 95.0% confidence
        -0.0597 +/- 0.0050674
        -1.99272% +/- 0.169145%
        (Student's t, pooled s = 0.0182816)

So, close to about a 2% improvement.  As far as accuracy "improvements", the 
numbers comparing sum of user + sys compared to "real" time is:

x smpng.fast.real
+ smpng.fast.total
    N           Min           Max        Median           Avg        Stddev
x 100          2.97          3.08          2.99        2.9959   0.018968075
+ 100          2.83          2.93          2.86        2.8601   0.016111668
Difference at 95.0% confidence
        -0.1358 +/- 0.0048779
        -4.53286% +/- 0.162819%
        (Student's t, pooled s = 0.0175979)

And for the kernel with the patch:

x proc.fast.real
+ proc.fast.total
    N           Min           Max        Median           Avg        Stddev
x 100          2.88          2.99          2.93        2.9362   0.017568337
+ 100          2.85          2.96          2.92        2.9201   0.017551943
Difference at 95.0% confidence
        -0.0161 +/- 0.00486742
        -0.548328% +/- 0.165773%
        (Student's t, pooled s = 0.0175601)

Thus, the total counts are closer to the real times with the patch than 
without the patch.  Given that these results were repeated numerous times 
with different benchmarks on an idle box in the same state I feel that these 
differences indicate an improvement in the accuracy of the accounting.

The patch is at http://www.FreeBSD.org/~jhb/patches/rusage_ext.patch and is 
largely based on a patch originally submitted by bde@.

-- 
John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org