Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 6 Apr 1996 06:55:42 +1000
From:      Bruce Evans <bde@zeta.org.au>
To:        asami@cs.berkeley.edu, current@FreeBSD.org
Cc:        hasty@rah.star-gate.com, nisha@cs.berkeley.edu, tege@matematik.su.se
Subject:   Re: fast memory copy for large data sizes
Message-ID:  <199604052055.GAA23015@godzilla.zeta.org.au>

next in thread | raw e-mail | index | archive | help
>We've put together a fast memory copy that uses floating point
>registers to speed up large transfers.  The original idea was taken

Oops.  I put together 5 fast memory copies that don't use floating point
registers.  Speeds range from 40K/sec to 340K/sec. on a 133MHz Pentium
(ASUS), Triton chipset, 512KB PB cache, 60ns non-EDO main memory.  This
is after attempting to minimize the differences caused by the cache
state.  Details in other mail.

The speed differences are so large and the cache state is so variable
that it is easy to create benchmarks showing that all methods are the
best :-).  We seemed to have fooled ourselves with the optimized kernel
bzeros already.  On the above i586 system, the i586-optimized bzero is
the slowest for compiling the kernel; for fork-exec of small processes
it is significantly the slowest.

>from Amancio Hasty's old post to use floating point registers to move
>8 bytes at a time.  (We tried using integer registers too but with our
>wits we could only get 10MB/s less than the FP case.)

This seemed like a bad idea.  I added a test using it (just 8 fldl's
followed by 8 fstpl's, storing in reverse order - this works for at
least all-zero data) and got good results, but I still think it is a bad
idea.  Perhaps it can the duplicated by copying via integer registers
through the L1 cache.

>133MHz Pentium (sunrise), Triton chipset, 512KB (pipeline burst) cache:

                                                new columns
                                           vvvvvvvvv  vvvvvvvvv      vvvvvvvv
>    size     libc             ours        mine-libc  mine-best(int) mine-fp
>      32      N/A         30.517578 MB/s   51493147   98069887
>      64  61.035156 MB/s  30.517578 MB/s   65049070  196997754
>     128  40.690104 MB/s  40.690104 MB/s   74971005  254666769
>     256  40.690104 MB/s  40.690104 MB/s   80998485  327390112
>     512  40.690104 MB/s  48.828125 MB/s   84416182  376524453
>    1024  40.690104 MB/s  51.398026 MB/s   85214370  379715593
>    2048  39.859694 MB/s  51.398026 MB/s   86936111  350385424
>    4096  39.859694 MB/s  52.083333 MB/s   87266431  326943762
>    8192  39.457071 MB/s  52.787162 MB/s   84805486   97567163
>   16384  39.556962 MB/s  52.966102 MB/s   65103489   97472157
>   32768  39.506953 MB/s  53.146259 MB/s   66593990   99217964      93604474
>   65536  39.457071 MB/s  53.282182 MB/s   61407673   79866591      93721503
>  131072  39.457071 MB/s  53.327645 MB/s   65457449   68011573      79960595
>  262144  39.345294 MB/s  53.350405 MB/s   51273532   53702491      75576993
>  524288  39.044198 MB/s  53.430220 MB/s   49370136   50029142      67400433
> 1048576  38.086533 MB/s  53.447354 MB/s   44054746   44095308      58624791
> 2097152  37.706680 MB/s  53.387433 MB/s   42742240   42770154      56946700
> 4194304  37.628643 MB/s  53.280763 MB/s   43381238   43381238      57727588

>As you can see, from a certain size and onwards, it is much faster
>than the libc version.  ("size" is in bytes.)

>The program allocates two 4MB buffers and calls libc's bcopy (which is
>essentially a string move using rep/movsl; see below for more on this)

My tests are obviously not equivalent for small copies - the libc
times are about twice as high.  This is because I keep copying the
same data.  I want to do this to test in-cache copies.  Not-in-cache
copies get tested as a side effect when the buffer is much larger
that the cache (L1 or L2).

Your test gives similar times on my system.  It tests the speed of
copying data that isn't in the cache.  This seems to be the usual
case for kernel bzeros - that's why the i586 optimizations are
pessimizations.

>operation.  (You can't use fld and fst because they will trap on
>illegal (as a floating point number) bit patterns -- by the way, the

Only if traps are enabled.  Rounding may be a problem.

>Pentium FP regs are 80 bits with a 64-bit mantissa so there's no loss
>of data by using the integer load/store.)

Useing 64-bit precision may be enough to avoid rounding problems.
fldl is much faster than fildl if the data is in the cache.

>Please type "make" and it will compile & run the tests.  The output

It didn't :-).  It assumes that "." is in the $PATH.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199604052055.GAA23015>