From owner-freebsd-current  Fri Apr  5 15:17:20 1996
Return-Path: owner-current
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id PAA20660
          for current-outgoing; Fri, 5 Apr 1996 15:17:20 -0800 (PST)
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id PAA20648
          for <current@freebsd.org>; Fri, 5 Apr 1996 15:17:12 -0800 (PST)
Received: (from bde@localhost) by godzilla.zeta.org.au (8.6.12/8.6.9) id JAA28956; Sat, 6 Apr 1996 09:13:46 +1000
Date: Sat, 6 Apr 1996 09:13:46 +1000
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199604052313.JAA28956@godzilla.zeta.org.au>
To: asami@CS.Berkeley.EDU, current@freebsd.org
Subject: Re: optimized bzeros found harmful (was: fast memory copy ...)
Cc: hasty@rah.star-gate.com, mrami@minerva.cis.yale.edu, nisha@CS.Berkeley.EDU,
        tege@matematik.su.se
Sender: owner-current@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

Optimizing bzero is almost as difficult as optimizing bcopy.  I
benchmarked all the current kernel bzeros and found that the i586
is pessimization in all interesting cases on my ASUS p133 system:

(tsc is the i586 timestamp counter).

                real      user      sys      tsc for bzero   time in bzero
making a kernel (immediately after make clean; reboot):
generic_bzero   492.43    377.86    31.03    1,219,568,103    9.195
i486_bzero      488.64    377.04    31.33    1,215,884,698    9.167
i586_bzero      492.96    378.79    32.27    1,617,865,274   12.198

10000 fork-execs of statically linked tiny processes:
generic_bzero    25.29      0.30    10.04    0x225ca555
i486_bzero       25.39      0.28    10.02    0x2252d2df
i586_bzero       28.58      0.34    12.00    0x304b123d
i586_bzerox      25.88      0.34    10.17    0x22d93f47

2500 fork-execs of dynamically linked tiny processes:
generic_bzero    24.74      8.43    16.15    0x16b85bde
i486_bzero       24.83      8.26    16.39    0x16b6fc72
i586_bzero       26.65      7.59    18.80    0x1f28c253
i586_bzerox      24.81      8.31    16.29    0x16f8e380

i586_bzerox is i586_bzerox with the read-before-write (RBW) replaced by
a 3-byte pairable instruction.  The Time Stamp Counts were obtained by
adding rdtsc's at the start and before the returns of all the bzero
functions.

This behaviour is consistent with the data being zeroed usually not being
in the L2 cache.  RBW is 33% slower in that case on my system.  Other
cases: if the data is in the L2 cache but not in the L1 cache, then RBW
is between 0% and 33% faster; if data the data is in the L1 cache, then
RBW is 8.5 times faster (740MB/s!).

Bruce