Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 5 Apr 1996 01:35:16 -0800 (PST)
From:      asami@cs.berkeley.edu (Satoshi Asami)
To:        current@freebsd.org
Cc:        nisha@cs.berkeley.edu, tege@matematik.su.se, hasty@rah.star-gate.com
Subject:   fast memory copy for large data sizes
Message-ID:  <199604050935.BAA24263@silvia.HIP.Berkeley.EDU>

next in thread | raw e-mail | index | archive | help
We've put together a fast memory copy that uses floating point
registers to speed up large transfers.  The original idea was taken
from Amancio Hasty's old post to use floating point registers to move
8 bytes at a time.  (We tried using integer registers too but with our
wits we could only get 10MB/s less than the FP case.)

By the way, we plugged this thing in as a replacement to
copyin/copyout and our ccd testing machine, (striping disk driver, see
http://stampede.cs.berkeley.edu/ccd/ for details) and maximum read
performance improved from 21MB/s to 24MB/s using 9 disks.  But that's
only to our interest, so here's a comparison with the libc bcopy()
(which is essentially the same code as the stock copyin/copyout).

Here are the kind of numbers we are seeing, and hope you will see, if
you run the program attached at the end of this mail:

 90MHz Pentium (silvia), SiS chipset, 256KB cache:

    size     libc             ours
      32  15.258789 MB/s   6.103516 MB/s 
      64  20.345052 MB/s  15.258789 MB/s
     128  17.438616 MB/s  15.258789 MB/s
     256  17.438616 MB/s  20.345052 MB/s
     512  17.438616 MB/s  22.194602 MB/s
    1024  17.438616 MB/s  23.251488 MB/s
    2048  17.755682 MB/s  23.818598 MB/s
    4096  17.836758 MB/s  23.390719 MB/s
    8192  17.715420 MB/s  24.112654 MB/s
   16384  17.341842 MB/s  24.338006 MB/s
   32768  17.361111 MB/s  25.080257 MB/s
   65536  17.715420 MB/s  24.423603 MB/s
  131072  17.373176 MB/s  25.237230 MB/s
  262144  17.553714 MB/s  24.723101 MB/s
  524288  17.401594 MB/s  24.951345 MB/s
 1048576  14.804506 MB/s  24.252419 MB/s
 2097152  17.732383 MB/s  24.392326 MB/s
 4194304  17.219484 MB/s  23.491825 MB/s

133MHz Pentium (sunrise), Triton chipset, 512KB (pipeline burst) cache:

    size     libc             ours
      32      N/A         30.517578 MB/s
      64  61.035156 MB/s  30.517578 MB/s
     128  40.690104 MB/s  40.690104 MB/s
     256  40.690104 MB/s  40.690104 MB/s
     512  40.690104 MB/s  48.828125 MB/s
    1024  40.690104 MB/s  51.398026 MB/s
    2048  39.859694 MB/s  51.398026 MB/s
    4096  39.859694 MB/s  52.083333 MB/s
    8192  39.457071 MB/s  52.787162 MB/s
   16384  39.556962 MB/s  52.966102 MB/s
   32768  39.506953 MB/s  53.146259 MB/s
   65536  39.457071 MB/s  53.282182 MB/s
  131072  39.457071 MB/s  53.327645 MB/s
  262144  39.345294 MB/s  53.350405 MB/s
  524288  39.044198 MB/s  53.430220 MB/s
 1048576  38.086533 MB/s  53.447354 MB/s
 2097152  37.706680 MB/s  53.387433 MB/s
 4194304  37.628643 MB/s  53.280763 MB/s

As you can see, from a certain size and onwards, it is much faster
than the libc version.  ("size" is in bytes.)

The program allocates two 4MB buffers and calls libc's bcopy (which is
essentially a string move using rep/movsl; see below for more on this)
or our enhanced version (called "unrolled") repeatedly with "size"
increments until all data is copied.  The test program itself is taken
from lmbench, and I added some random garbage initialization and
post-testing to make sure all the data is correctly copied.

I'll include the C program too, but it won't do you much good other
than as a reference to the testing method because we rewrote the
function "unrolled" in assembly language (why such a simple program
gets so mungled up by the compiler is a mystery).

The "meat" of each code is included here in plaintext.  First, the
libc bcopy:

===
	movl	%ecx,%edx
	cld			/* clear direction flag: copy forward */
	shrl	$2,%ecx         /* count /= 4 for word copy */
	rep
	movsl
	movl	%edx,%ecx
	andl	$3,%ecx         /* count %= 4 for the remaining bytes */
	rep
	movsb
===

This is not exactly what it is doing but I've taken the liberty to
simplify it for discussion purposes.

The "movs*" instruction, with a "rep" prefix, will copy %ecx ("count")
things from %esi ("source index") to %edi ("destination index").  The
"movsl" is for 32-bit moves (hence the shift right 2 in the count) and
"movsb" is for 8-bit moves (to copy up to 3 bytes left).

Of course, the whole thing could be done as:

===
	rep
	movsb
===

but that's going to be a little slow 'cause it's byte-by-byte moves.

Here's ours:

===
	cmpl $63,%ecx	/* if less than 64 bytes, go to end */
	jbe L54
	
#	movl %cr0,%edx
#	movl $8, %eax	/* CR0_TS */
#	not %eax
#	andl %eax,%edx	/* clear CR0_TS */
#	movl %edx,%cr0

	subl $108,%esp	/* save all floating point registers */
	fsave (%esp)

	.align 2,0x90
L55:
	fildq 0(%esi)	/* load quadword (64-bit) int into FP registers */
	fildq 8(%esi)
	fildq 16(%esi)
	fildq 24(%esi)
	fildq 32(%esi)
	fildq 40(%esi)
	fildq 48(%esi)
	fildq 56(%esi)
	fxch %st(7)	/* exchange top of stack with 8th position */
	fistpq 0(%edi)  /* store quadword */
	fxch %st(5)
	fistpq 8(%edi)
	fxch %st(3)
	fistpq 16(%edi)
	fxch %st(1)
	fistpq 24(%edi)
	fistpq 32(%edi)
	fistpq 40(%edi)
	fistpq 48(%edi)
	fistpq 56(%edi)
	addl $-64,%ecx
	addl $64,%esi
	addl $64,%edi
	cmpl $63,%ecx
	ja L55

	frstor (%esp)	/* restore FP registers */
	addl $108,%esp

#	andl $8,%edx
#	movl %cr0,%eax
#	orl %edx, %eax	/* reset CR0_TS to the original value */
#	movl %eax,%cr0

L54:
	cld		/* do the rest; at most 63 bytes so we */
	rep		/* don't really care about speed here */
	movsb
===

(Don't worry about the commented out lines for now, those were
 necessary to temporarily enable FP operations in the kernel.)

This routine works by loading eight bytes at a time into a floating
point register using the fildq (integer load quadword) operation, and
storing them with the fistpq (integer store and pop quadword)
operation.  (You can't use fld and fst because they will trap on
illegal (as a floating point number) bit patterns -- by the way, the
Pentium FP regs are 80 bits with a 64-bit mantissa so there's no loss
of data by using the integer load/store.)

The Pentium FP unit is a stack of 8 registers, hence the "pop" and
fxch thingies.  Also, we save the FP state using fsave and frstor if
we decide to use FP regs.  Since there are 108 bytes to write/read in
this case, the use of this should be limited to large transfers.

I'd like people to try the following tarball on their machines, so we
can see if it really works for everybody and not just in California.
Please type "make" and it will compile & run the tests.  The output
already formatted (like the table you see above) so you can easily
forward it to the list.

Of course, before this to go into the library/kernel, we need to make
sure it is run only on machines with the FP unit (prob. only on
Pentiums), make it work with overlapping copies, etc., but I wanted to
see what people think about it first.

Satoshi
-------
begin 644 bcopy.tar.gz
M'XL(`#KD9#$"`^T\:W/;-K;Y2OX*5+%;29%D/B59KK-U4V\F4\?VV,XVO4U&
MI4C*8D.16I)RY+2YO_V>@Q=!2G:<;1XS=\UI9>(`.#@X;X!`)GZZN-YY\%D?
MXA@#UR4/B&GU;1O^XF/PO[Q`^J9MFGW+Z#N$F(9K60^(^^`+/,N\\#)"'GBY
M-X]N:?=V%H;Q@_]WSX3*?Q[.Q_ZBYW^>,4P#I.K<*/^^85A<_J[A.'V0OV6A
M_(U[^7_V9Z>MDS:9O!T+%2!=DD?S11P2@*39-4$%(468%]!P1W\8)7Z\#$*M
M443S*+FT>GZC!'Z?7^<[4!'V9H]U_6$03J,D)!>_GAZBH(-T.8G#"A@$3^(T
MN:0_HD8[?_8_AUH>O0O3:1-;M11<)Q<'1UA/''/7L0U'VVD3Y_F/2)M^E4:!
MMDRR-([#H-G:HP"BA?-%<8U%?>Y%2=/S.\2[:NG"^?@S5(!VV[O:T_^4T"@I
MZ-]E'OIY)T[3!7LC48=,KH$?>[(IGU\[SP!S.\B+/1VHR*/+)`S8[(KY`F!L
M_MI\`N]YD2W]@B"OKKR8%#"VAD,>`Y5:-"5`)?EFG]@M\J>N:=-%!I739EX$
M899U`(+C-E[DWF4X(MLY06:1.)KX?XGI$R^F!`"`O[U*&CCQWXS7P`I-"U=1
MT33Q];VNT1F1?>(5:=2$-B9M@W0T8R\O.(CL[Y/OWGS7(G_]1=;@/W_7:@%:
MAJF]#T[<<F[",;\!Q_,:CB8B`?W$/TC0,9`H-6"G)@9@/U13A2'MUMR+X]1O
MTC;M8_*(F'U[2)&`@.[4#BG_!I$"I=]`)R:*!0@@S9H-UJ^QQDI=`V$C_HH&
MM`#/'JMZM$^&YJ[%2]_ND_^%HHDJ4:&?J<Q&7$S#;L55G2/%M=.F,P+%\^<+
M9+K]ND,:7#<:+=0V@TWQED'%J*8UQ-*&8;3W:(K:95B@;J?3P+MN?EM<=<CQ
MBZ.C%IVFEP3IO%E<]8JK,1H50J=I1IH1(#/('HG(]T0(!$J/'H%2H(2;)4W4
M:-O(UM9OT6OHQ[$BKLF[,$N;0%M'8$';U[@--;:9)=`J6@$>)$SR91:225K,
MJ!X!,H*3\P"8I`7)WWJ+!0R;+@LZ/1HVF]3@:^/4F&PAD]$P&U3#.8<QYA24
M5F7B;-['>^31HX@U*X=Y1`?80;5O1W3,&D3,1D-KUJBK`IQYD2Z0$;1"^K!]
M@XHIC//PHZB1GO43$23');7G5BJ8-U\C88V"*@%R\C4BWM^D>A7%HT[L1N4#
MRUFO1/Q0B>Y,*![U'".*GFP''9+/TF4<D`F4#&L%D26G+ZB=W,/CT)TZ<CEP
MYY9A86KS"<R)S66'>=)>6SA2QHJ=?8PW"F\XH!9N2&-[2B#$@C)&&%@@,%+`
M3DX-:3[!_W&,)L72E?A:`(3D#Y^>T<)QL[!89@EI&E``7\FRCY<O7T+6<1Q"
M',P]"-C%#(8H9L`.7S!H[@4AF4*T"#,4QC,21!CA('#.(.ZE7I`C(B@D$@I2
MSL*\!^%Y%B5OB#=!VT6L`'I60#(2YLEW!9FE,`($XY"DT)?DN=6CB0ZF#KK4
MM4YIY:UZW*\$_GKZP&-4F5>\G4606[%X0QZ#)^T#X:BRJ-YU0V"MNDJK=6-!
M-X4>N;]>A9Z,5X&B`[_9K%0[5OT7Z"K-R;1*)@,ST>0L:`.2A9<1B@*C@=&!
M'Q-_+/RQ\<?!'Q=_^O@S0"=;G_=0F38-*P9:9N9#>L+CC,D!I@!8'&`)@,T!
MM@`X'.`(@,L!K@#T.:`O``,.&+SF`0T(`!`2)``F!Y@"8'&`)0`V!]@"X'"`
M(P`N![@"T.>`O@`,.&"P)S.@;LDFZB]!G#3DB]#+WZEH:2J%(D1'`!*DTN),
M;^?4DVF*`>:_=:W7U`@?W#__Q8^Z_A_V\J^P_C==:V"+_1_#&>#^CV7W^_?K
M_R_Q:+TIN`BM(9;_#?W2]ZVQG\X7``]Z(WT\'E\F2PD9^R.]5X2K0M=Z=/%`
M++UW&:>3F(QI0@?PXGH1:KS8^6&Z3/PB2A.=`4:0B2_S64RVP\D"LI3T"E_S
M18>56=V6(=]8KW%Q!:X-8_M875I`\A)Z5R'U:_K1-#0!>P\7PW)X!'8E9>L4
MIPN5X'11I3==?(#<');T9&O80=!FXM/%C;2OSW"M6XQ947?8Q-%:,(JW*NG!
M=X:VN`(Z.'%*8W\E"?974`Z@G&.'+5>4D'REEK<.5GPHUKHO2JRUK*7$-7EW
M;]49MC9V\X(`R'(J<ZB*S:J(#82`L*Z03EUH?+9":K2HBHT"[J1F\"XFO8%K
MIE4AF<'Z$B;YU<1ZA$P$CYI8*R!\Z,FJ0UG5XA!'=I.L=60_K^P'?'5$QS^2
MG!P-87D?^ISQ+<[>+9YGR[9'PY%0B)+SDS7.VPKG&2,1V.4\/GIB8+V7^U'T
MT=M-IO7*:``*4T'!=DT8W%+@VV5S6P'C$`SJ*-"[KJ,$1E<=Z*:5#&M<*MO1
MDSYVH]MWQLKH&"O'M"F+*XT&:XVL<.@,C0UJBYN/I=9B255:+-_)USA&Q=F`
MQD3E>Q[=JMJ3#:J-W9D7&7,2_?D"QK%Y^S]"<F1*CX8:&K6DHP(%*9W6>)S_
M\Y$YZ`M\?!DI&YBB`O?+2@8!UW8-'<:0TW?$**PY;DJJYM`51GA#>TR%A5D(
MS\SF9!H#;EQT5M8=$#@5!`-7]$\0@0EDHQHS?V<:"FE0.?IH]*:QJ]+G?#Q]
M@PI]-I#A2/JL*GVV<!!;?"N==_7C(H!E7Q"A[RA9K?#?&E:`7<4IH@)&\V7,
MFS(`C1-T7[6)`*&)JJ*RB3'W4!W+J,KZ0\T=T7R3_`T5'V4Q*"M^VY#DEKQS
M@47]D:+I4GW9%O!=U=J5?E@,K8;%+=R_%8`$`5T%LI$/')GS=Y`Y5:9BQG*W
M3&-3`M.5V%@;OL.K:ZLTXY&/28U'*DN5"2J$\#27R/?A&@LM]`P,M<"LS,56
M-71"9+FS'<0W<)XVA&H,R!,F]19N=/A2O>J4Q02H`&&60571>6S!=%Y:Q@8-
MY[RR:LRBF]6RMN[90.TLJ785;WH#4@$P:J/@,D\)!T$DC)5Y``B[/!2PLLOS
M(#\.I!BE?63A(J0\RB?,BIP^[?=N@C.0)LXY+:%L3+^6=4(KZ#\2RF$)@]UL
ME9;D!L_HZRK&)3>I>*D_8NSJ;!`=SCB?93S<88DCI*)'_M1UL70)BK"H[1A-
MVH5EP3(YW&BH2@ZMFAA+E!5E#90&P:HFS)HU"<2"\'5UKC'E$GC2AQDYHQN5
MDR7@:N2U*EX(':I3.M3Y`G3!7>>9-?J/I68;GT)JMO7UI28RY4\B.!MR3]OX
MCP1WJR`VY!2;Q>)^$K$,OKY8Z!;_IY')`&;D?H1,I.6`T8SJO+_%1$I645DX
MQH8P)UVYRK(-(G"L$8^'PD-[,1T,`J1'`V0>\5S),:4_K_AXE>EE?;"Q?CUV
M0>AQZA&."Z//6`$#CZ0$@E(8-\M%81#(Q0%;<=!6IC&@A74=O&%VFZVGM]W:
MDK^2"M>RWHT.4<KTS@@<XT8*E`T:3VS0\#D,JJHLYH-R*7-?H4I3K(1E.YY1
MZ&SCTJ&Z:S7-BT79][8Z*3+W#JN^>MK`$DRWG!]=Q*8+J0_I0BYAV6L0U;<K
M'&6[@JZ@$=852^OZBIO;MEARTZ*ZYJ:`#RRZZ[L^WDHNUBRYV)J@>>RN&9=K
M2+,&N0YOZNT1:`DZNCNJ3]=5ILNH1V!73JP^X3+&B#D+B#IM`;OC%EE]?Z&V
MIR!!M3V%.N,@\1-S[]L\$:2,<\'^]8=\>#\SF*9S`"@BU2$\I/'DS!A?G--3
M:!H>RJ#L>\A6.M)$L*$/7,PJS95]31A!%UIN&J6>`]^%GJ\+$@4QC>+@W\00
M+I$5A]4BG:]2MIQJV;:J9:>&SJGA<TM\*W]&;7A`*\$L&2U!I=(M*X=KE799
M2>FLUIIEK27W&SB`DJT"'*,.&-8`KAR!&T#?X4)G95I$W5&*="NJIB%H'"Y(
M9)KA!_V:AY/R$VI`_5:I/5R=J)HP;P0:(/4I"_.P$&I2I/340)I%EU$"CNK*
MBY=A17D\H3R@L2.^-(*E$-5T6`FMKZ`5_W:K4^LK5B[M%>%=U:#KMLXV@82=
M8TFU<2S?Q;,-*]&FKO9#-3E8\>1`1.(56Z2Q1('6T"6:.^0<P8A7V="NSGJ@
MS)I2C[`NFU8O]M/YG,@/(YVA`J)?1@!R__'VDWW_S98)BO+S?/[]T/EOXIBN
M./_MVO3\O^68]]]_O\CS\)N=293LY#-=C_9A51GZLY0TZ.$A_,:##WZ$J1PK
M2I=9WM#9V9+?R%9$NO#BLN]/Y/4>"5*=$'ZD`*LI`NOQMR9YO!.$5SL);I7]
M1;RW;TCC3W[4[%6#;`\"0K9W<_H]AI!7C0YT[I!76V[K?:.*47YU^B!6@>]5
M@O@DKFC_]W"UR!#7JS:Q?M>#-`GU_UK[?^Z]"?$<P->Q_X'ARO,??=>E]S\<
M]][^OXC]$R%[>NIV(F][Y+H.R[D1X><^B)?GX7P"+_0JB(XI-GZOS.;<+GLY
M?Q%V"FU8UY%LH>OB3<)\7?N!^9S'C_E847(I:RE1U^!P(%F;AEF8^&%#URY]
MGW1/+++UF'3/@4Y.VFB-`O:B$*`,QGO=<330DB+RZ8@IV?I!H!X*W,,/(!]R
M[,@]@-Z(%NLE3A4CA.@$T5'9-'"CC8BHK?]M^Y=W>;Z&_5N6V1?QWW0&)CW_
M9=_;_Q>\_[7U+!@1I@0]OW-%S%Z?F+N[[HXQW+%<8M@CVX#_2#PGAZL%V8(^
MV.T)Z`ZLEV8%:?HM[."0(R_+KLES_U_I=0]DRQL>0&A.E\5B69#+-,QQL<7.
MEO<JE\I(8P(V-^O-RBMEY'MH&*5XGTP!P2+%BZNP90+KSJ#6#F^CP?HHWP"6
ME]2B:1!.M7R9Z.&J"+.$'GBN7%]I[:E5XG`\0!^&21!-)8ZS%^<'3P]K`\$B
M$QR*7[D1I^'A^&9QU=+X+1@HDT>DO!%3.3=?#L-[)^^:*^@*/_1*"?D',<F(
M0+$EFSS_4:,G_NF!?XJ#5_S,*Q"&#W=`M>MH<MU%^')K3YS5/\<:KBHD2=\J
MA^7Y=0X\=]Q$0(M4+P&52)O*<.\@\2+M%JG<"#B'4<4@>!6'GUK.<(&-G4B4
MD'GD9RGP*DV"G%&QI)<(Z>T.2D7]CATP<8I'T6^@CJTK;R0.5KMX)JKY+<73
M(64/.3/UA@-M)63;%N)$*8L*?O7I/3^5S[!SY(4)_QOE58/:5-J\61O;M0M#
MO6!`J[J/^<C[@$H6NH!3%/8V=%A6>BPK79:5/G@AIM;O>Z+0NY&8;K=3'PQO
M)S#.4$;<K\>_3O[/W>YG&N,#\=^V^H82_^GZWW#M^_C_A>,_5P(:_AT>_OL[
MEDDL<V18(ZL2_I6@S9\-@5K4;(C-LM/&$"UKUP.]J(+D(4G5"'Y\,CX].;MX
M?G!Z>G@FX]W%D]/QRX,G%YJ]V[<JT"<GQQ=G)T=885<J?CJX.$"H4V]^?,CP
MN!"1XSQ<'Z.Y'-.[LH[A&/8N;C&SJVW+1-R8"@/*N@UDJ'T=XVY]*:65CN:=
M!Z63J?2U/MSWQ4]LIJ0ZU3OV1')K/>T/]_S7X=FY)GN9,AMBFING_INP(#D`
M(=7RK\AD.86E&TD7F#Z\\W`W/-=5?.<G3WX^.;T8GQT>_*09:^!?SIY='&KF
M>O.??CG3K#7P\<GQH>;H%?B/+_ZI*>E7JU)Y?G%X"I5]7L6UEYS_>N[*=NP"
M\Z1#XC!I:;`8S,,"BP:#E.W8]>`.":""-O2AC">BL2R;0;X:A"MLY[<P*<K\
M6<9*:ZDE3:&J_WB!K'MZ,C[XY>!7C1EE\_S9TX.CL^<=@L<>6WO$B[ULW@17
MWB9]FB_1?XV!7HWG_Q)#>>&7#<,R-5[G!7^`*Y;%;(F'S65Q`EG@VR@H9A+R
M9B)?Y\HK)H5**8XC64*-*#&R=*LE$]NG81)F7@&3GF;IG,SSR]X*+VOZ,SQ/
MSKU/0&:@GB-ZF3R]S+PY06L8GYZ=/,5KC.B8KL(L!YUC%:B[]'XC7LJ#/V>G
M3ZC]-+$,V3LQ64JEO6?O]*^].X!L"&DZC4,O#TF0TCOH81#Q:[&X78,I+]YC
M?>OED,D*VI<YILS9P@<02XJ5Y0B`2U\K_T6-<@9-861(0:O:@,Y$-C#+6C&E
M&RL/7SY3*JT67TOQ"\M`TWCE^<78+)=9O,J/HS`IQM46]QG;_7/_W#_WS_US
.__S]Y_\`^.BV8P!0``!`
`
end



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199604050935.BAA24263>