Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 11 Sep 1996 05:53:49 +1000
From:      Bruce Evans <bde@zeta.org.au>
To:        fwmiller@cs.UMD.EDU, terry@lambert.org
Cc:        freebsd-hackers@FreeBSD.org
Subject:   Re: kernel performance
Message-ID:  <199609101953.FAA01711@godzilla.zeta.org.au>

next in thread | raw e-mail | index | archive | help
>There are performance figure you can get via gprof.  These are
                                              ^^^^^
`config -p' and kgmon.  kgmon is just the messenger and gprof is just
the interpreter.

>statistical in nature, so it will be impossible to make a reasonable
>distinction between cache/non-cache cases (which is why statistical
>profiling sucks).

Actually, FreeBSD-current has had non-statistical profiling using
`config -pp', kgmon and gprof for 5 months.  E.g., the following
script:
---
sync
kgmon -rB
dd if=/dev/zero of=/dev/null bs=1 count=100000
kgmon -hp
gprof4 -u /kernel >zp
---
gives accurate timing for the parts of the system exercised by
per-char i/o to /dev/zero and /dev/null.  The gprof output is
standard, except the times are more accurate, e.g.:
---
granularity: each sample hit covers 4 byte(s) for 0.00% of 8.75 seconds

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 33.6      2.937    2.937                             _mcount (1987)
 14.8      4.235    1.298                             _mexitcount [4]
 12.7      5.342    1.107                             _cputime [5]
  7.7      6.013    0.671                             _user [9]
  4.2      6.378    0.365   200121     1822    12468  _syscall [3]
  2.6      6.604    0.227   200000     1134     2339  _mmrw [14]
  2.4      6.815    0.210   400106      525      525  _ufs_lock [17]
  2.1      7.002    0.187   400136      468      468  _ufs_unlock [18]
  2.0      7.178    0.176   100007     1762     9471  _read [7]
  2.0      7.353    0.175   100003     1750     8434  _vn_write [8]
  2.0      7.527    0.174   100003     1738     5077  _spec_write [12]
  2.0      7.698    0.171   100003     1712    10146  _write [6]
  1.9      7.868    0.170   100007     1702     7709  _vn_read [10]
  1.8      8.027    0.159   100046     1591     1591  _copyout [19]
  1.8      8.184    0.157   200137      784      784  _copyin [20]
  1.5      8.314    0.130   100000     1302     4635  _spec_read [15]
  1.4      8.438    0.123   202201      609      663  _doreti [21]
  0.9      8.520    0.082   100010      821     2411  _uiomove [16]
  0.9      8.594    0.074   200121      372    12840  _Xsyscall [2]
  0.7      8.655    0.061   100003      613     5690  _ufsspec_write [11]
  0.4      8.693    0.038   100000      377     5011  _ufsspec_read [13]
---
The above uses the Pentium timestamp count and was run on a P133.
The version in -current is not quite as accurate.  It doesn't compensate
for the profiling overheads properly, especially at leaf nodes like
ufs_unlock().  My current version is accurate to within a one or two
cpu cycles (after compensating for the entire profiling overhead) in
simple cases when there are no cache misses.

>I have non-statistical profiling data starting from the VFS consumer
>layer, and working its way down through the supporting code, but
>excluding some VM and driver effects... it was collected on Win95
>using the Pentium instruction clock using highly modified gprof code
>and compiler generated function entry points + stack hacking to get
>function exit counters.  The Win95 code had all of the gross
>architectural modifications I've been discussing for the past two
>years, so there are some functional bottlenecks removed.  The data
>is proprietary to my employer.

In FreeBSD-current, gprof is essentially unchanged; function entry
points are handled normally (compiling with cc -pg generates calls
to mcount) and extra code is generated by compiling with
cc -mprofiler-epilogue.  The stack isn't modified (modifying it might
be faster but I thought it would be too hard to implement).  Interrupt
entry and exit points and cross-jumping between functions is handled
specially.  Neither the code nor the data is propietary :-).

Future versions will support non-statistical profiling using any
available counter.  My version currently supports the following counters:
- standard i8254 counter.  The default in -current on non-Pentiums.  Its
  overhead is very high.
- Pentium performance-monitoring counters.  The code for this was mostly
  written by Garrett Wollman.  I have used it mainly to debug the
  high variance in the overhead of the profiling routines.  It turned
  out that two often used globals and a stack variable sometimes
  collided in the Pentium cache, causing several cache misses in the
  profiling routines.  The cache misses doubled the overheads.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199609101953.FAA01711>