From owner-freebsd-hackers  Tue Sep 10 13:13:02 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id NAA10361
          for hackers-outgoing; Tue, 10 Sep 1996 13:13:02 -0700 (PDT)
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id NAA10349
          for <freebsd-hackers@FreeBSD.org>; Tue, 10 Sep 1996 13:12:58 -0700 (PDT)
Received: (from bde@localhost) by godzilla.zeta.org.au (8.6.12/8.6.9) id FAA01711; Wed, 11 Sep 1996 05:53:49 +1000
Date: Wed, 11 Sep 1996 05:53:49 +1000
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199609101953.FAA01711@godzilla.zeta.org.au>
To: fwmiller@cs.UMD.EDU, terry@lambert.org
Subject: Re: kernel performance
Cc: freebsd-hackers@FreeBSD.org
Sender: owner-hackers@FreeBSD.org
X-Loop: FreeBSD.org
Precedence: bulk

>There are performance figure you can get via gprof.  These are
                                              ^^^^^
`config -p' and kgmon.  kgmon is just the messenger and gprof is just
the interpreter.

>statistical in nature, so it will be impossible to make a reasonable
>distinction between cache/non-cache cases (which is why statistical
>profiling sucks).

Actually, FreeBSD-current has had non-statistical profiling using
`config -pp', kgmon and gprof for 5 months.  E.g., the following
script:
---
sync
kgmon -rB
dd if=/dev/zero of=/dev/null bs=1 count=100000
kgmon -hp
gprof4 -u /kernel >zp
---
gives accurate timing for the parts of the system exercised by
per-char i/o to /dev/zero and /dev/null.  The gprof output is
standard, except the times are more accurate, e.g.:
---
granularity: each sample hit covers 4 byte(s) for 0.00% of 8.75 seconds

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 33.6      2.937    2.937                             _mcount (1987)
 14.8      4.235    1.298                             _mexitcount [4]
 12.7      5.342    1.107                             _cputime [5]
  7.7      6.013    0.671                             _user [9]
  4.2      6.378    0.365   200121     1822    12468  _syscall [3]
  2.6      6.604    0.227   200000     1134     2339  _mmrw [14]
  2.4      6.815    0.210   400106      525      525  _ufs_lock [17]
  2.1      7.002    0.187   400136      468      468  _ufs_unlock [18]
  2.0      7.178    0.176   100007     1762     9471  _read [7]
  2.0      7.353    0.175   100003     1750     8434  _vn_write [8]
  2.0      7.527    0.174   100003     1738     5077  _spec_write [12]
  2.0      7.698    0.171   100003     1712    10146  _write [6]
  1.9      7.868    0.170   100007     1702     7709  _vn_read [10]
  1.8      8.027    0.159   100046     1591     1591  _copyout [19]
  1.8      8.184    0.157   200137      784      784  _copyin [20]
  1.5      8.314    0.130   100000     1302     4635  _spec_read [15]
  1.4      8.438    0.123   202201      609      663  _doreti [21]
  0.9      8.520    0.082   100010      821     2411  _uiomove [16]
  0.9      8.594    0.074   200121      372    12840  _Xsyscall [2]
  0.7      8.655    0.061   100003      613     5690  _ufsspec_write [11]
  0.4      8.693    0.038   100000      377     5011  _ufsspec_read [13]
---
The above uses the Pentium timestamp count and was run on a P133.
The version in -current is not quite as accurate.  It doesn't compensate
for the profiling overheads properly, especially at leaf nodes like
ufs_unlock().  My current version is accurate to within a one or two
cpu cycles (after compensating for the entire profiling overhead) in
simple cases when there are no cache misses.

>I have non-statistical profiling data starting from the VFS consumer
>layer, and working its way down through the supporting code, but
>excluding some VM and driver effects... it was collected on Win95
>using the Pentium instruction clock using highly modified gprof code
>and compiler generated function entry points + stack hacking to get
>function exit counters.  The Win95 code had all of the gross
>architectural modifications I've been discussing for the past two
>years, so there are some functional bottlenecks removed.  The data
>is proprietary to my employer.

In FreeBSD-current, gprof is essentially unchanged; function entry
points are handled normally (compiling with cc -pg generates calls
to mcount) and extra code is generated by compiling with
cc -mprofiler-epilogue.  The stack isn't modified (modifying it might
be faster but I thought it would be too hard to implement).  Interrupt
entry and exit points and cross-jumping between functions is handled
specially.  Neither the code nor the data is propietary :-).

Future versions will support non-statistical profiling using any
available counter.  My version currently supports the following counters:
- standard i8254 counter.  The default in -current on non-Pentiums.  Its
  overhead is very high.
- Pentium performance-monitoring counters.  The code for this was mostly
  written by Garrett Wollman.  I have used it mainly to debug the
  high variance in the overhead of the profiling routines.  It turned
  out that two often used globals and a stack variable sometimes
  collided in the Pentium cache, causing several cache misses in the
  profiling routines.  The cache misses doubled the overheads.

Bruce