From owner-freebsd-current  Sat Apr  6 09:54:52 1996
Return-Path: owner-current
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id JAA09343
          for current-outgoing; Sat, 6 Apr 1996 09:54:52 -0800 (PST)
Received: from insanus.matematik.su.se (insanus.matematik.su.se [130.237.198.12])
          by freefall.freebsd.org (8.7.3/8.7.3) with ESMTP id JAA09336
          for <current@freebsd.org>; Sat, 6 Apr 1996 09:54:49 -0800 (PST)
Received: from localhost (prudens.matematik.su.se [130.237.198.5]) by insanus.matematik.su.se (8.7.5/8.6.9) with ESMTP id TAA17355; Sat, 6 Apr 1996 19:54:30 +0200 (MET DST)
Message-Id: <199604061754.TAA17355@insanus.matematik.su.se>
X-Address: Department of Mathematics, Stockholm University 
	      S-106 91  Stockholm
	      SWEDEN
X-Phone: int+46 8 162000
X-Fax:   int+46 8 6126717
X-Url:   http://www.matematik.su.se
To: Bruce Evans <bde@zeta.org.au>
cc: asami@cs.berkeley.edu, current@freebsd.org, hasty@rah.star-gate.com,
        mrami@minerva.cis.yale.edu, nisha@cs.berkeley.edu,
        tege@matematik.su.se
Subject: Re: optimized bzeros found harmful (was: fast memory copy ...) 
In-reply-to: Your message of "Sat, 06 Apr 1996 09:13:46 +1000."
             <199604052313.JAA28956@godzilla.zeta.org.au> 
Date: Sat, 06 Apr 1996 19:54:25 +0200
From: Torbjorn Granlund <tege@matematik.su.se>
Sender: owner-current@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

  This behaviour is consistent with the data being zeroed usually not being
  in the L2 cache.  RBW is 33% slower in that case on my system.  Other
  cases: if the data is in the L2 cache but not in the L1 cache, then RBW
  is between 0% and 33% faster; if data the data is in the L1 cache, then
  RBW is 8.5 times faster (740MB/s!).

This must be a misunderstanding!

If the data is really in the L1 cache, the read-before-write is wasted and
just contributes to the overhead.

The read-before-write is effective if and only if the data is not in the L1
cache.  In that case, it forces allocation of the cache line in the L1
cache, and thereby allows a 14x peak speedup.

If other behaviours are observed, the timing framework confuses you.

All other CPUs I know of have caches that do allocate-on-write.