Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 Dec 2006 23:37:53 +1100 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Adrian Chadd <adrian@FreeBSD.org>
Cc:        freebsd-performance@FreeBSD.org, David Xu <davidxu@FreeBSD.org>, Mark Kirkwood <markir@paradise.net.nz>
Subject:   Re: Cached file read performance
Message-ID:  <20061222222757.G18486@delplex.bde.org>
In-Reply-To: <d763ac660612212009x30bab8d6kecec9bc2e49a2b66@mail.gmail.com>
References:  <458B3651.8090601@paradise.net.nz> <458B3E0C.6090104@freebsd.org> <d763ac660612212009x30bab8d6kecec9bc2e49a2b66@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 22 Dec 2006, Adrian Chadd wrote:

> On 22/12/06, David Xu <davidxu@freebsd.org> wrote:
>
>> I suspect in such a test, memory copying speed will be a key factor,
>> I don't have number to back up my idea, but I think Linux has lots
>> of tweaks, such as using MMX instruction to copy data.
>
> I had the oppertunity to study the AMD Athlon XP Optimisation guide
> and noted their example copy routine, optimised for the chipset, was
> quite a hell of a lot faster over a straight block copy.
>
> Has anyone here done any similar modifications to optimise
> copyin/copyout? I can't imagine it'd be a bad thing to have.

Sure.  It's a larger win mainly in benchmarks.  It's a twisty MD maze.
It's a small loss for the MMX method used in linux (2.6.10 at least)
on the original poster's machine (PIII).  The main win is from using
nontemporal writes, but that requires SSE2 and the kernel already uses
these in the most important case (sse2_pagezero(); other cases have
tradeoffs).

Times for some copy methods:

freefall (800MHz PIII), block size 4K (fully cached)
%%%
copy0: 3448148133 B/s (  29001 us) ( 25037077 tsc) (movsl)
copy1: 1840531252 B/s (  54332 us) ( 46333183 tsc) (unroll 16)
copy2: 1571211313 B/s (  63645 us) ( 52383615 tsc) (unroll 16 prefetch)
copy3: 2246932794 B/s (  44505 us) ( 36824018 tsc) (unroll 64 i586-opt)
copy4: 1970554791 B/s (  50747 us) ( 43268191 tsc) (unroll 64 i586-opt prefetch)
copy5: 2117741296 B/s (  47220 us) ( 38651415 tsc) (unroll 64 i586-opx prefetch)
copy6: 1684092760 B/s (  59379 us) ( 48916219 tsc) (unroll 32 prefetch 2)
copy7: 1506746384 B/s (  66368 us) ( 54357751 tsc) (unroll 64 fp++)
copy8: 1574228925 B/s (  63523 us) ( 52241051 tsc) (unroll 128 fp i-prefetch)
copy9: 1579051367 B/s (  63329 us) ( 51821088 tsc) (unroll 64 fp reordered)
copyA: 1625298552 B/s (  61527 us) ( 51037242 tsc) (unroll 256 fp reordered++)
copyB: 1633849261 B/s (  61205 us) ( 50367459 tsc) (unroll 512 fp reordered++)
copyC:  452936367 B/s ( 220781 us) (181557329 tsc) (Terje cksum)
copyD: 1449124640 B/s (  69007 us) ( 56524152 tsc) (kernel bcopy (unroll 64 fp i-prefetch))
copyE: 1525968138 B/s (  65532 us) ( 53339199 tsc) (unroll 64 fp i-prefetch++)
copyF: 1513634002 B/s (  66066 us) ( 54251674 tsc) (new kernel bcopy (unroll 64 fp i-prefetch++))
copyG: 3389821831 B/s (  29500 us) ( 23686951 tsc) (memcpy (movsl))
copyK: 3522482088 B/s (  28389 us) ( 23081104 tsc) (movq)
copyL: 3018586815 B/s (  33128 us) ( 27671714 tsc) (movq with prefetchnta)
copyM: 3057441649 B/s (  32707 us) ( 27641525 tsc) (movq with block prefetch)
copya: 2584306603 B/s (  38695 us) ( 31756341 tsc) (~i686_memcpy (movaps))
%%%

movsl/memcpy is simplest and best here.

copyL is like the Linux-2.6.10 _mmx_memcpy().  The latter uses `prefetch'
which is older than prefetchnta and differently unportable.  I've never
noticed much difference between these, but the older instruction might
work better on older CPUs like PIII's.  Note that prefetchnta is slower
than explicit block prefetch.  This happens on AthlonXP's too, and IIRC
the XP optimisation guide points this out and uses block prefetch for
its last and biggest copy optimization.  Note that prefetching is just
a loss for the fully cached case.  The main point of interest here is
that block prefetch still beats prefetchnta by an insignificant amount
(it might be expected to lose because it takes more instructions and
the bottleneck in the fully cached case is is instruction execution.

There aren't many methods using XMM registers here because the methods
here are limited to ones that work on machines with only plain SSE and
I couldn't find any such machines where using either MMX or XMM was
any use.  CopyL uses MMX registers.  copya uses XMM registers.  copya
loses significantly to movsl/memcpy and copyL.

freefall, block size 4096K (fully uncached)
%%%
copy0:  199343794 B/s ( 493138 us) (613912221 tsc) (movsl)
copy1:  185455801 B/s ( 530067 us) (636100521 tsc) (unroll 16)
copy2:  181088365 B/s ( 542851 us) (474134548 tsc) (unroll 16 prefetch)
copy3:  183647620 B/s ( 535286 us) (456166441 tsc) (unroll 64 i586-opt)
copy4:  177010464 B/s ( 555357 us) (466214836 tsc) (unroll 64 i586-opt prefetch)
copy5:  176540627 B/s ( 556835 us) (465979821 tsc) (unroll 64 i586-opx prefetch)
copy6:  181682761 B/s ( 541075 us) (457801523 tsc) (unroll 32 prefetch 2)
copy7:  174978240 B/s ( 561807 us) (486332757 tsc) (unroll 64 fp++)
copy8:  192576224 B/s ( 510468 us) (429718012 tsc) (unroll 128 fp i-prefetch)
copy9:  177291074 B/s ( 554478 us) (473659591 tsc) (unroll 64 fp reordered)
copyA:  179384243 B/s ( 548008 us) (476730841 tsc) (unroll 256 fp reordered++)
copyB:  182308792 B/s ( 539217 us) (455082354 tsc) (unroll 512 fp reordered++)
copyC:  132747808 B/s ( 740532 us) (621009558 tsc) (Terje cksum)
copyD:  191875581 B/s ( 512332 us) (434236713 tsc) (kernel bcopy (unroll 64 fp i-prefetch))
copyE:  192663787 B/s ( 510236 us) (430394085 tsc) (unroll 64 fp i-prefetch++)
copyF:  192714776 B/s ( 510101 us) (431859413 tsc) (new kernel bcopy (unroll 64 fp i-prefetch++))
copyG:  184343619 B/s ( 533265 us) (451905971 tsc) (memcpy (movsl))
copyK:  182133150 B/s ( 539737 us) (479260121 tsc) (movq)
copyL:  185353345 B/s ( 530360 us) (449688860 tsc) (movq with prefetchnta)
copyM:  187979371 B/s ( 522951 us) (442852446 tsc) (movq with block prefetch)
copya:  185523701 B/s ( 529873 us) (465249860 tsc) (~i686_memcpy (movaps))
%%%

movsl/memcpy is still simplest and best.

Other methods are only slightly slower (except copyC, which does a checksum
in parallel with read/dwrite; extra operations combined with copying are
free on some machines, but not here, even in the fully uncached case).

freefall's times may be inaccurate since freefall is loaded, and the tsc's
may be very inaccurate because freefall is SMP, but the following are very
accurate since the machine is unloaded !SMP:

Athlon XP2600, 193MHz FSB, 8-3-3-2.5 memory (not quite PC3200), block size 4K:
%%%
copy0: 6492646669 B/s (  15402 us) ( 34282451 tsc) (movsl)
copy1: 5815290998 B/s (  17196 us) ( 38282332 tsc) (unroll 16)
copy2: 5099686063 B/s (  19609 us) ( 44504640 tsc) (unroll 16 prefetch)
copy3: 6580229256 B/s (  15197 us) ( 33837406 tsc) (unroll 64 i586-opt)
copy4: 6608931597 B/s (  15131 us) ( 33685684 tsc) (unroll 64 i586-opt prefetch)
copy5: 6620745763 B/s (  15104 us) ( 33624302 tsc) (unroll 64 i586-opx prefetch)
copy6: 5371132452 B/s (  18618 us) ( 41448096 tsc) (unroll 32 prefetch 2)
copy7: 7544303584 B/s (  13255 us) ( 29523386 tsc) (unroll 64 fp++)
copy8: 8178600147 B/s (  12227 us) ( 28765763 tsc) (unroll 128 fp i-prefetch)
copy9: 9280718701 B/s (  10775 us) ( 25879055 tsc) (unroll 64 fp reordered)
copyA: 8625128860 B/s (  11594 us) ( 25817196 tsc) (unroll 256 fp reordered++)
copyB: 8883338723 B/s (  11257 us) ( 25161370 tsc) (unroll 512 fp reordered++)
copyC: 2927478673 B/s (  34159 us) ( 76030527 tsc) (Terje cksum)
copyD: 7751918140 B/s (  12900 us) ( 28727306 tsc) (kernel bcopy (unroll 64 fp i-prefetch))
copyE: 7834514572 B/s (  12764 us) ( 28403840 tsc) (unroll 64 fp i-prefetch++)
copyF: 7818588272 B/s (  12790 us) ( 28475409 tsc) (new kernel bcopy (unroll 64 fp i-prefetch++))
copyG: 6419292849 B/s (  15578 us) ( 34670640 tsc) (memcpy (movsl))
copyH: 2950106027 B/s (  33897 us) ( 75467527 tsc) (movntps)
copyI: 2939094286 B/s (  34024 us) ( 77103399 tsc) (movntps with prefetchnta)
copyJ: 2940477064 B/s (  34008 us) ( 77144512 tsc) (movntps with block prefetch)
copyK: 11064366453 B/s (   9038 us) ( 20691582 tsc) (movq)
copyL: 9832816519 B/s (  10170 us) ( 22685200 tsc) (movq with prefetchnta)
copyM: 9853162282 B/s (  10149 us) ( 22599677 tsc) (movq with block prefetch)
copyN: 2950018998 B/s (  33898 us) ( 75452984 tsc) (movntq)
copyO: 2933576156 B/s (  34088 us) ( 77122605 tsc) (movntq with prefetchnta)
copyP: 2885246083 B/s (  34659 us) ( 77147363 tsc) (movntq with block prefetch)
copyQ: 6749442765 B/s (  14816 us) ( 32985677 tsc) (movdqa)
copya: 7504108059 B/s (  13326 us) ( 29680371 tsc) (~i686_memcpy (movaps))
%%%

Now movq is best.  It is almost twice as fast as movsl.  This is because
movsl only issues 32-bit accesses and the number of those per cycle
has the same limit as 64-bit accesses, at least for read/write in parallel
(AXP's have some asymmetry for read/write that gets in the way of other
access mixes.  A64's are better here).

Even the old PI FPU method easily beats movsl.  It was turned off because
it was a large loss on PII's.

There are now some SSE+ extensions (movnt*).  These use an AthlonXP extension
of SSE.  Thes are just a loss in the fully cached case (and in all cases
for small data unless you know that the target shouldn't be cached).

AthlonXP... block size 4096K:
%%%
copy0:  636873680 B/s ( 154354 us) (344356579 tsc) (movsl)
copy1:  649887944 B/s ( 151263 us) (337326810 tsc) (unroll 16)
copy2:  582949855 B/s ( 168632 us) (376274011 tsc) (unroll 16 prefetch)
copy3:  736911544 B/s ( 133400 us) (315267117 tsc) (unroll 64 i586-opt)
copy4:  683944313 B/s ( 143731 us) (320308617 tsc) (unroll 64 i586-opt prefetch)
copy5:  684006179 B/s ( 143718 us) (320114790 tsc) (unroll 64 i586-opx prefetch)
copy6:  656704054 B/s ( 149693 us) (333513466 tsc) (unroll 32 prefetch 2)
copy7:  675350371 B/s ( 145560 us) (324722661 tsc) (unroll 64 fp++)
copy8:  793971554 B/s ( 123813 us) (276326666 tsc) (unroll 128 fp i-prefetch)
copy9:  679120150 B/s ( 144752 us) (322757764 tsc) (unroll 64 fp reordered)
copyA:  650429743 B/s ( 151137 us) (336686142 tsc) (unroll 256 fp reordered++)
copyB:  686849773 B/s ( 143123 us) (318835219 tsc) (unroll 512 fp reordered++)
copyC:  656370811 B/s ( 149769 us) (333826275 tsc) (Terje cksum)
copyD:  777715366 B/s ( 126401 us) (282197950 tsc) (kernel bcopy (unroll 64 fp i-prefetch))
copyE:  779930499 B/s ( 126042 us) (280900317 tsc) (unroll 64 fp i-prefetch++)
copyF:  773888810 B/s ( 127026 us) (283770359 tsc) (new kernel bcopy (unroll 64 fp i-prefetch++))
copyG:  636189490 B/s ( 154520 us) (344278918 tsc) (memcpy (movsl))
copyH: 1056702749 B/s (  93029 us) (207224289 tsc) (movntps)
copyI: 1072590588 B/s (  91651 us) (204188841 tsc) (movntps with prefetchnta)
copyJ: 1395630138 B/s (  70437 us) (156912756 tsc) (movntps with block prefetch)
copyK:  708242075 B/s ( 138800 us) (309879060 tsc) (movq)
copyL:  706770485 B/s ( 139089 us) (311075317 tsc) (movq with prefetchnta)
copyM:  814300625 B/s ( 120722 us) (269160923 tsc) (movq with block prefetch)
copyN: 1076549051 B/s (  91314 us) (203659502 tsc) (movntq)
copyO: 1066898198 B/s (  92140 us) (205514511 tsc) (movntq with prefetchnta)
copyP: 1413551133 B/s (  69544 us) (155496730 tsc) (movntq with block prefetch)
copyQ:  680954822 B/s ( 144362 us) (321945223 tsc) (movdqa)
copya:  710699826 B/s ( 138320 us) (308106574 tsc) (~i686_memcpy (movaps))
%%%

Now the movnt* methods win easily.  Block prefetch wins easily over
prefetchnta.  (Unlike for PIII's, I know that it is preferred to plain
"prefetch".)

Athlon64's behave significantly differently here (details not shown):
- movsl is still quite slow
- movsq/memcpy has the same speed as movq (MMX) and movq(64-bit integer)
- the memory system is better relative to the CPU, so the fully cached case
   is not so much faster, especially with DDR2
- prefetchnta now wins over block prefetch, since the memory system now
   actually understands prefetchnta
- movnt* us a larger win.

Memcpy (movsq) is simplest and best again unless movnt* is used.  amd64
already uses simplest and best methods except for large copyin/copyout's
where it should probably use movnt*.  It is unclear whether a block
size of 8K is large -- in cases where the application actually uses
the data, it may be best to not use movnt*.  movnt* for 8K writes is
more likely to right, since in many cases the kernel's only "use" of
the data is to DMA it to a disk drive and for that it should never be
put in the CPU's caches.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20061222222757.G18486>