Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 5 Apr 1996 13:19:19 -0800 (PST)
From:      asami@cs.berkeley.edu (Satoshi Asami)
To:        bde@zeta.org.au
Cc:        current@FreeBSD.org, hasty@star-gate.com, nisha@cs.berkeley.edu, tege@matematik.su.se
Subject:   Re: fast memory copy for large data sizes
Message-ID:  <199604052119.NAA25877@silvia.HIP.Berkeley.EDU>
In-Reply-To: <199604052055.GAA23015@godzilla.zeta.org.au> (message from Bruce Evans on Sat, 6 Apr 1996 06:55:42 %2B1000)

next in thread | previous in thread | raw e-mail | index | archive | help
 * Oops.  I put together 5 fast memory copies that don't use floating point
 * registers.  Speeds range from 40K/sec to 340K/sec. on a 133MHz Pentium
 * (ASUS), Triton chipset, 512KB PB cache, 60ns non-EDO main memory.  This
 * is after attempting to minimize the differences caused by the cache
 * state.  Details in other mail.

Cool cool.  This is the kind of response I was waiting for!  Oh by the 
way, the 133MHz Pentium system we tested has 60ns EDO memory.

 * The speed differences are so large and the cache state is so variable
 * that it is easy to create benchmarks showing that all methods are the
 * best :-).  We seemed to have fooled ourselves with the optimized kernel

 :)

 * This seemed like a bad idea.  I added a test using it (just 8 fldl's
 * followed by 8 fstpl's, storing in reverse order - this works for at
 * least all-zero data) and got good results, but I still think it is a bad
 * idea.  

Well, from the numbers below, it certainly seems faster than yours for 
larger sizes even if things are in the L2 cache!

Note that the speed of fldls depend on the actual data.  All-zero data
is faster than random data (to avoid traps, try ((double *)src[i] =
random())), probably because the all-zero bit pattern can be converted
to floating point (ok, no conversion necesarry in this case :) in a
snap.

 * 	  Perhaps it can the duplicated by copying via integer registers
 * through the L1 cache.

This is what I don't understand, people keep saying that we can do it
using integer regesters but we simply can't get it to work as fast.
If we can get it to work as fast as our FP copy, I won't utter "fildq"
for the rest of my life, I swear!

 * >133MHz Pentium (sunrise), Triton chipset, 512KB (pipeline burst) cache:
 * 
 *                                                 new columns
 *                                            vvvvvvvvv  vvvvvvvvv      vvvvvvvv
 * >    size     libc             ours        mine-libc  mine-best(int) mine-fp
 * >      32      N/A         30.517578 MB/s   51493147   98069887

 * >    1024  40.690104 MB/s  51.398026 MB/s   85214370  379715593
                                                         ^^^^^^^^^
                                                      wow

 * >   16384  39.556962 MB/s  52.966102 MB/s   65103489   97472157
 * >   32768  39.506953 MB/s  53.146259 MB/s   66593990   99217964      93604474
 * >   65536  39.457071 MB/s  53.282182 MB/s   61407673   79866591      93721503
 * >  131072  39.457071 MB/s  53.327645 MB/s   65457449   68011573      79960595
 * >  262144  39.345294 MB/s  53.350405 MB/s   51273532   53702491      75576993
 * >  524288  39.044198 MB/s  53.430220 MB/s   49370136   50029142      67400433
 * > 1048576  38.086533 MB/s  53.447354 MB/s   44054746   44095308      58624791
 * > 2097152  37.706680 MB/s  53.387433 MB/s   42742240   42770154      56946700
 * > 4194304  37.628643 MB/s  53.280763 MB/s   43381238   43381238      57727588

 * My tests are obviously not equivalent for small copies - the libc
 * times are about twice as high.  This is because I keep copying the
 * same data.  I want to do this to test in-cache copies.  Not-in-cache
 * copies get tested as a side effect when the buffer is much larger
 * that the cache (L1 or L2).

Ok.  By the way, why is your data lacking smaller sizes for your FP
copy?

 * Your test gives similar times on my system.  It tests the speed of
 * copying data that isn't in the cache.  This seems to be the usual

Yeah, lmbench's mem_cp was giving me ungodly numbers for small-sized
copies until I realized that it's copying the same things over and
over.  Since we were interested in optimizing filesystem troughput
(large sequentian read/writes), that wasn't what we wanted, and I
changed it to walk through a larger buffer.

 * Only if traps are enabled.  Rounding may be a problem.
 :
 * Useing 64-bit precision may be enough to avoid rounding problems.
 * fldl is much faster than fildl if the data is in the cache.

Well, we tried disabling the traps too, and got our data mangled. ;)

 * >Please type "make" and it will compile & run the tests.  The output
 * 
 * It didn't :-).  It assumes that "." is in the $PATH.

Duuh, sorry.  Next time I send out a test script, I'll make sure to
put "./" in front of all our programs!

Satoshi



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199604052119.NAA25877>