From owner-freebsd-current Fri Apr 5 15:17:20 1996 Return-Path: owner-current Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id PAA20660 for current-outgoing; Fri, 5 Apr 1996 15:17:20 -0800 (PST) Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19]) by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id PAA20648 for ; Fri, 5 Apr 1996 15:17:12 -0800 (PST) Received: (from bde@localhost) by godzilla.zeta.org.au (8.6.12/8.6.9) id JAA28956; Sat, 6 Apr 1996 09:13:46 +1000 Date: Sat, 6 Apr 1996 09:13:46 +1000 From: Bruce Evans Message-Id: <199604052313.JAA28956@godzilla.zeta.org.au> To: asami@CS.Berkeley.EDU, current@freebsd.org Subject: Re: optimized bzeros found harmful (was: fast memory copy ...) Cc: hasty@rah.star-gate.com, mrami@minerva.cis.yale.edu, nisha@CS.Berkeley.EDU, tege@matematik.su.se Sender: owner-current@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Optimizing bzero is almost as difficult as optimizing bcopy. I benchmarked all the current kernel bzeros and found that the i586 is pessimization in all interesting cases on my ASUS p133 system: (tsc is the i586 timestamp counter). real user sys tsc for bzero time in bzero making a kernel (immediately after make clean; reboot): generic_bzero 492.43 377.86 31.03 1,219,568,103 9.195 i486_bzero 488.64 377.04 31.33 1,215,884,698 9.167 i586_bzero 492.96 378.79 32.27 1,617,865,274 12.198 10000 fork-execs of statically linked tiny processes: generic_bzero 25.29 0.30 10.04 0x225ca555 i486_bzero 25.39 0.28 10.02 0x2252d2df i586_bzero 28.58 0.34 12.00 0x304b123d i586_bzerox 25.88 0.34 10.17 0x22d93f47 2500 fork-execs of dynamically linked tiny processes: generic_bzero 24.74 8.43 16.15 0x16b85bde i486_bzero 24.83 8.26 16.39 0x16b6fc72 i586_bzero 26.65 7.59 18.80 0x1f28c253 i586_bzerox 24.81 8.31 16.29 0x16f8e380 i586_bzerox is i586_bzerox with the read-before-write (RBW) replaced by a 3-byte pairable instruction. The Time Stamp Counts were obtained by adding rdtsc's at the start and before the returns of all the bzero functions. This behaviour is consistent with the data being zeroed usually not being in the L2 cache. RBW is 33% slower in that case on my system. Other cases: if the data is in the L2 cache but not in the L1 cache, then RBW is between 0% and 33% faster; if data the data is in the L1 cache, then RBW is 8.5 times faster (740MB/s!). Bruce