Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 2 Apr 2005 20:49:41 +1000 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Matthew Dillon <dillon@apollo.backplane.com>
Cc:        bde@FreeBSD.org
Subject:   Re: Fwd: 5-STABLE kernel build with icc broken
Message-ID:  <20050402191529.Q1235@epsplex.bde.org>
In-Reply-To: <200504011804.j31I4Ens059405@apollo.backplane.com>
References:  <423C15C5.6040902@fsn.hu> <20050327133059.3d68a78c@Magellan.Leidinger.net> <5bbfe7d405032823232103d537@mail.gmail.com> <424A23A8.5040109@ec.rr.com><20050330130051.GA4416@VARK.MIT.EDU> <200504010315.j313FGLn056122@apollo.backplane.com> <200504011804.j31I4Ens059405@apollo.backplane.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 1 Apr 2005, Matthew Dillon wrote:

> :>    The use of the XMM registers is a cpu optimization.  Modern CPUs,
> :>    especially AMD Athlon and Opterons, are more efficient with 128 bit
> :>    moves then with 64 bit moves.   I experimented with all sorts of
> :>    configurations, including the use of special data caching instructions,
> :>    but they had so many special cases and degenerate conditions that
> :>    I found that simply using straight XMM instructions, reading as big
> :>    a glob as possible, then writing the glob, was by far the best solution.
> :
> :Are you sure about that?  The amd64 optimization manual says (essentially)

This is in 25112.PDF section 5.16 ("Interleave Loads and Stores", with
128 bits of loads followed by 128 bits of stores).

> :that big globs are bad, and my benchmarks confirm this.  The best glob size
> :is 128 bits according to my benchmarks.  This can be obtained using 2
> :...
> :
> :Unfortunately (since I want to avoid using both MMX and XMM), I haven't
> :managed to make copying through 64-integer registers work as well.
> :Copying 128 bits at a time using 2 pairs of movq's through integer
> :registers gives only 7.9GB/sec.  movq through MMX is never that slow.
> :However, movdqu through xmm is even slower (7.4GB/sec).

I forgot many of my earlier conclusions when I wrote the above.  The
speeds between 7.4GB/sec and 12.9GB/sec for the fully (L1) cached case
are almost irrelevant.  They basically just tell how well we have
used the instruction bandwidth.  Plain movsq uses it better and gets
15.9GB/sec.  I believe 15.9GB/sec is from saturating the L1 cache.
The CPU is an Athlon64 and its clock frequency is 1994 MHz, and I think
the max L1 cache bandwidth is with a 16-byte load and store per cycle;
16*1994*10^6 is 15.95GB/sec (disk manufacturers GB's).

Plain movsq is best here for many other cases too...

> :
> :The fully cached case is too unrepresentative of normal use, and normal
> :(partially cached) use is hard to bencmark, so I normally benchmark
> :the fully uncached case.  For that, movnt* is best for benchmarks but
> :not for general use, and it hardly matters which registers are used.
>
>    Yah, I'm pretty sure.  I tested the fully cached (L1), partially
>    cached (L2), and the fully uncached cases.   I don't have a logic

By the partially cached case, I meant the case where some of the source
and/or target addresses are in the L1 or L2 cache, but you don't really
the chance that they are there (or should be there after the copy), so
you can only guess the best strategy.

>    analyzer but what I think is happening is that the cpu's write buffer
>    is messing around with the reads and causing extra RAS cycles to occur.
>    I also tested using various combinations of movdqa, movntdq, and
>    prefetcha.

Somehow I'm only seeing small variations from different strategies now,
with all tests done in userland on an Athlon64 system (and on athlonXP
systems for reference).  Using XMM or MMX can be twice as fast on
the AthlonXPs, but movsq is absolutely the fastest in many cases on
the Athlon64, and is < 5% slower than the fastest in all cases
(except for the fully uncached case since it can't do nontemporal
stores), so it is the best general method.

>...
>    I also think there might be some odd instruction pipeline effects
>    that skew the results when only one or two instructions are between
>    the load into an %xmm register and the store from the same register.
>    I tried using 2, 4, and 8 XMM registers.  8 XMM registers seemed to
>    work the best.

I'm getting only small variations from different load/store patterns.

>
>    Of course, I primarily tested on an Athlon 64 3200+, so YMMV.  (One
>    of the first Athlon 64's, so it has a 1MB L2 cache).

My test system is very similar:

%%%
CPU: AMD Athlon(tm) 64 Processor 3400+ (1994.33-MHz K8-class CPU)
   Origin = "AuthenticAMD"  Id = 0xf48  Stepping = 8
   Features=0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2>
   AMD Features=0xe0500800<SYSCALL,NX,MMX+,LM,3DNow+,3DNow>
L1 2MB data TLB: 8 entries, fully associative
L1 2MB instruction TLB: 8 entries, fully associative
L1 4KB data TLB: 32 entries, fully associative
L1 4KB instruction TLB: 32 entries, fully associative
L1 data cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative
L1 instruction cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative
L2 2MB unified TLB: 0 entries, disabled/not present
L2 4KB data TLB: 512 entries, 4-way associative
L2 4KB instruction TLB: 512 entries, 4-way associative
L2 unified cache: 1024 kbytes, 64 bytes/line, 1 lines/tag, 16-way associative
%%%

>    The prefetchnta I have commented out seemed to improve performance,
>    but it requires 3dNOW and I didn't want to NOT have an MMX copy mode
>    for cpu's with MMX but without 3dNOW.  Prefetching less then 128 bytes
>    did not help, and prefetching greater then 128 bytes (e.g. 256(%esi))
>    seemed to cause extra RAS cycles.  It was unbelievably finicky, not at
>    all what I expected.

Prefetching is showing some very good effects here, but there are MD
complications:
- the Athlon[32] optimization manual says that block prefetch is sometimes
   better than prefetchnta, and gives examples.  The reason is that you
   can schedule the block prefetch.
- alc@ and/or the Athlon64 optimization manual say that prefetchnta now
   works better.
- testing shows that prefetchnta does work better on my Athlon64 in some
   cases, but in the partially cached case (source in the L2 cache) it
   reduces the bandwidth by almost a factor of 2:

%%%
copyH: 2562223788 B/s ( 390253 us) (778523741 tsc) (movntps)
copyI: 1269129646 B/s ( 787875 us) (1571812294 tsc) (movntps with prefetchnta)
copyJ: 2513196704 B/s ( 397866 us) (793703852 tsc) (movntps with block prefetch)
copyN: 2562020272 B/s ( 390284 us) (778737276 tsc) (movntq)
copyO: 1279569209 B/s ( 781447 us) (1559037466 tsc) (movntq with prefetchnta)
copyP: 2561869298 B/s ( 390307 us) (778732346 tsc) (movntq with block prefetch)
%%%

The machine has PC2700 memory so we can hope for a copy bandwidth of
nearly 2.7GB/sec for repeatedly copying a buffer of size 160K as the
benchmark does, since the buffer should stay in the L2 cache.  We
actually get 2.5+GB/sec here and for all bzero benchmarks using movnt*,
but when we use prefetchnta we get about half this, and not much more
than for the fully uncached case (1.2GB/sec).

The corresponding speeds for the fully uncached case (copying 1600K) are:

%%%
copyH: 1061395711 B/s ( 941613 us) (1879293692 tsc) (movntps)
copyI: 1246904647 B/s ( 801524 us) (1599118394 tsc) (movntps with prefetchnta)
copyJ: 1227740822 B/s ( 814035 us) (1624787631 tsc) (movntps with block prefetch)
copyN: 1049642023 B/s ( 952157 us) (1900292204 tsc) (movntq)
copyO: 1247088242 B/s ( 801406 us) (1598888249 tsc) (movntq with prefetchnta)
copyP: 1226714585 B/s ( 814716 us) (1625985669 tsc) (movntq with block prefetch)
%%%

For the fully uncached case, the speeds for simple copying methods are all
about 0.64GB/sec on this machine, and sophisticated methods that don't use
nontemporal writes only improve this to 0.68GB/sec.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050402191529.Q1235>