From owner-freebsd-current@FreeBSD.ORG Tue Sep 28 20:08:10 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D807516A4CE for ; Tue, 28 Sep 2004 20:08:10 +0000 (GMT) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id D686943D60 for ; Tue, 28 Sep 2004 20:08:09 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.13.1/8.13.1) with ESMTP id i8SK87Ck008329 for ; Tue, 28 Sep 2004 22:08:07 +0200 (CEST) (envelope-from phk@critter.freebsd.dk) To: current@freebsd.org From: Poul-Henning Kamp Date: Tue, 28 Sep 2004 22:08:07 +0200 Message-ID: <8328.1096402087@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Subject: A mini-course in benchmark number crunching. X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2004 20:08:11 -0000 Hi there... Since we'll be entering silly season benchmark-wise in a few weeks when 5.3 goes golden, I'll share an interesting benchmark I ran here today. The situation: I have a kernel branch in perforce which I would like to compare to "straight -current" on buildworld performance. The target system has three disks in addition to the system disk, so I made a tar copy of the a newly checked out src tree and put it on one of the disks. I then set up the computer to boot single-user, and created the following script which would be run from the console in single user mode: #!/bin/sh # Always bail on errors. set -e # Avoid having cur-dir on one of the traffic disks. cd / # In single user root is mounted r/o so remount it. mount -o rw -u / # Unmount and mount the filesystem with the tar file. # (the unmount is in case we re-run the test) umount /hex > /dev/null 2>&1 || true mount /hex # Always get at least three samples. # Three or more samples allows us to calculate a standard deviation. for i in 1 2 3 do # In case of rerun: unmount the two filesystems. umount /usr/src > /dev/null 2>&1 || true umount /usr/obj > /dev/null 2>&1 || true # Create filesystems from scratch. # This improves repeatability. newfs -O 2 -U /dev/ad4 > /dev/null 2>&1 newfs -O 2 -U /dev/ad6 > /dev/null 2>&1 # Mount filesystems. mount /usr/src mount /usr/obj # Extract source tree. ( cd /usr && tar xf /hex/src.tar ) # Run test. # Note that stdout/stderr is not stored on disk, we are # only interested in the last two lines anyway: one to tell # us that the result was OK and one with the times. ( cd /usr/src /usr/bin/time make -j 12 buildworld 2>&1 | tail -2 ) done So, I built the two kernels from the same kernel config file and ran the test, and got these numbers: Plain -current: 1476.48 real 1972.63 user 798.28 sys 1475.75 real 1965.80 user 814.99 sys 1482.53 real 1969.07 user 814.13 sys buf_work branch: 1472.52 real 1965.67 user 792.49 sys 1469.86 real 1960.00 user 803.77 sys 1480.43 real 1958.09 user 814.67 sys Running src/tools/tools/ministat on the numbers in turn tells us that there is no statistical significant difference between the two datasets. Real time: x _current + _buf_work +--------------------------------------------------------------------------+ | + + x x + x| ||___________________M________A_|_________M________A_______|___________| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 3 1475.75 1482.53 1476.48 1478.2533 3.7216439 + 3 1469.86 1480.43 1472.52 1474.27 5.4980087 No difference proven at 95.0% confidence User time: x _current + _buf_work +--------------------------------------------------------------------------+ | + + * x x| ||____________M_____A__________________| |_______________A________________|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 3 1965.8 1972.63 1969.07 1969.1667 3.416026 + 3 1958.09 1965.67 1960 1961.2533 3.9423639 System time: x _current + _buf_work +--------------------------------------------------------------------------+ |+ x + x+x | ||___________________|__________AM______________A_____________M|__________|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 3 798.28 814.99 814.13 809.13333 9.4090931 + 3 792.49 814.67 803.77 803.64333 11.090543 No difference proven at 95.0% confidence I have, in so many words, proven nothing with my test. The direct statistical approach assumes that the three runs for each kernel were run under identical circumstances, but this is not the case here: The first is run right after a reboot, the other two sequentially after that. This means that a large number of tools may be cached in RAM for the second and third run. This should not lead us to belive that the second and third run are in identical circumstances: the third run may have filled ram to the extent where things needs to be thrown out again for instance. Let us look at the real time for the two kernels on a run by run basis: current buf_work difference 1476.48 1472.52 -3.96 1475.75 1469.86 -5.89 1482.53 1480.43 -2.10 Hmm, seen this way, buf_work is consistently faster than current by a fraction of a percent. The same situation holds for the user time. The first two runs of system time show the same pattern but in the last run buf_work is half a second slower than current. Eight out of nine doesn't sound bad, and the probability of buf_work being a tad better than current is probably very high, but we do not have an actual statistical proven difference: we have no standard deviation for the difference. In a situation like this there are two ways one can proceed in order to get that statistical proof: Either run more iterations per boot, increase the three to four, five or however much is necessary to get a better standard deviation so that the direct statstical approach works. Often it helps to throw the first iteration out since it is often atypical (loading a copy of make(1) etc into RAM). But even with 20 iterations, it may not be possible to get the standard deviation narrow enough. For instance a cyclic phenomena relating to ram/vm contents could spread the points. Running only one iteration per boot, and doing multile runs would be a mistake though, because that would only measure the performance right after boot, and that can vary distinctively from the real world experience. The correct method, is to run multiple runs (at least three) with three iterations per boot, and then examine the difference for each iteration separately. Three runs allows a standard deviation to be calculated and "ministat" will do all the hard math for you. I'm not a very good teacher, but I hope this example can inspire some less lame benchmarking when people start to compare 5.3-R to other versions, operating systems etc. If nothing else, please just remember the first rule of statistics: "You can't prove anything without a standard deviation". Poul-Henning -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence.