From owner-freebsd-hackers Tue Sep 10 13:13:02 1996 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id NAA10361 for hackers-outgoing; Tue, 10 Sep 1996 13:13:02 -0700 (PDT) Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id NAA10349 for ; Tue, 10 Sep 1996 13:12:58 -0700 (PDT) Received: (from bde@localhost) by godzilla.zeta.org.au (8.6.12/8.6.9) id FAA01711; Wed, 11 Sep 1996 05:53:49 +1000 Date: Wed, 11 Sep 1996 05:53:49 +1000 From: Bruce Evans Message-Id: <199609101953.FAA01711@godzilla.zeta.org.au> To: fwmiller@cs.UMD.EDU, terry@lambert.org Subject: Re: kernel performance Cc: freebsd-hackers@FreeBSD.org Sender: owner-hackers@FreeBSD.org X-Loop: FreeBSD.org Precedence: bulk >There are performance figure you can get via gprof. These are ^^^^^ `config -p' and kgmon. kgmon is just the messenger and gprof is just the interpreter. >statistical in nature, so it will be impossible to make a reasonable >distinction between cache/non-cache cases (which is why statistical >profiling sucks). Actually, FreeBSD-current has had non-statistical profiling using `config -pp', kgmon and gprof for 5 months. E.g., the following script: --- sync kgmon -rB dd if=/dev/zero of=/dev/null bs=1 count=100000 kgmon -hp gprof4 -u /kernel >zp --- gives accurate timing for the parts of the system exercised by per-char i/o to /dev/zero and /dev/null. The gprof output is standard, except the times are more accurate, e.g.: --- granularity: each sample hit covers 4 byte(s) for 0.00% of 8.75 seconds % cumulative self self total time seconds seconds calls ns/call ns/call name 33.6 2.937 2.937 _mcount (1987) 14.8 4.235 1.298 _mexitcount [4] 12.7 5.342 1.107 _cputime [5] 7.7 6.013 0.671 _user [9] 4.2 6.378 0.365 200121 1822 12468 _syscall [3] 2.6 6.604 0.227 200000 1134 2339 _mmrw [14] 2.4 6.815 0.210 400106 525 525 _ufs_lock [17] 2.1 7.002 0.187 400136 468 468 _ufs_unlock [18] 2.0 7.178 0.176 100007 1762 9471 _read [7] 2.0 7.353 0.175 100003 1750 8434 _vn_write [8] 2.0 7.527 0.174 100003 1738 5077 _spec_write [12] 2.0 7.698 0.171 100003 1712 10146 _write [6] 1.9 7.868 0.170 100007 1702 7709 _vn_read [10] 1.8 8.027 0.159 100046 1591 1591 _copyout [19] 1.8 8.184 0.157 200137 784 784 _copyin [20] 1.5 8.314 0.130 100000 1302 4635 _spec_read [15] 1.4 8.438 0.123 202201 609 663 _doreti [21] 0.9 8.520 0.082 100010 821 2411 _uiomove [16] 0.9 8.594 0.074 200121 372 12840 _Xsyscall [2] 0.7 8.655 0.061 100003 613 5690 _ufsspec_write [11] 0.4 8.693 0.038 100000 377 5011 _ufsspec_read [13] --- The above uses the Pentium timestamp count and was run on a P133. The version in -current is not quite as accurate. It doesn't compensate for the profiling overheads properly, especially at leaf nodes like ufs_unlock(). My current version is accurate to within a one or two cpu cycles (after compensating for the entire profiling overhead) in simple cases when there are no cache misses. >I have non-statistical profiling data starting from the VFS consumer >layer, and working its way down through the supporting code, but >excluding some VM and driver effects... it was collected on Win95 >using the Pentium instruction clock using highly modified gprof code >and compiler generated function entry points + stack hacking to get >function exit counters. The Win95 code had all of the gross >architectural modifications I've been discussing for the past two >years, so there are some functional bottlenecks removed. The data >is proprietary to my employer. In FreeBSD-current, gprof is essentially unchanged; function entry points are handled normally (compiling with cc -pg generates calls to mcount) and extra code is generated by compiling with cc -mprofiler-epilogue. The stack isn't modified (modifying it might be faster but I thought it would be too hard to implement). Interrupt entry and exit points and cross-jumping between functions is handled specially. Neither the code nor the data is propietary :-). Future versions will support non-statistical profiling using any available counter. My version currently supports the following counters: - standard i8254 counter. The default in -current on non-Pentiums. Its overhead is very high. - Pentium performance-monitoring counters. The code for this was mostly written by Garrett Wollman. I have used it mainly to debug the high variance in the overhead of the profiling routines. It turned out that two often used globals and a stack variable sometimes collided in the Pentium cache, causing several cache misses in the profiling routines. The cache misses doubled the overheads. Bruce