Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 6 Oct 2012 08:44:17 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        John Baldwin <jhb@freebsd.org>
Cc:        Garrett Cooper <yanegomi@gmail.com>, Andriy Gapon <avg@freebsd.org>, freebsd-arch@freebsd.org, Konstantin Belousov <kostikbel@gmail.com>, Dag-Erling Sm??rgrav <des@des.no>, Dimitry Andric <dimitry@andric.com>
Subject:   Re: x86 boot code build
Message-ID:  <20121006072636.V978@besplex.bde.org>
In-Reply-To: <201210051141.16147.jhb@freebsd.org>
References:  <506C385C.3020400@FreeBSD.org> <86a9w1kq94.fsf@ds4.des.no> <20121005133616.GP35915@deviant.kiev.zoral.com.ua> <201210051141.16147.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 5 Oct 2012, John Baldwin wrote:

> On Friday, October 05, 2012 9:36:16 am Konstantin Belousov wrote:
>> On Fri, Oct 05, 2012 at 03:22:31PM +0200, Dag-Erling Sm??rgrav wrote:
>>> Konstantin Belousov <kostikbel@gmail.com> writes:
>>>> So what ISA additions do you expect to get advantage of by switching
>>>> to pentium-mmx from 486 ? As I already said, I am not aware of any.
>>>
>>> The TSC, for one.  MMX, and the ability to use MMX registers to copy
>>> data.
>>
>> TSC is used regardless of the compiler flags, we use it if CPU claims
>> that TSC is supported, even in usermode.
>>
>> Compiler never generates MMX copies. More, in kernel, the manual
>> FPU context save/restore is needed around the FPU/MMX register file access.

1. The TSC provides no significant performance advantage for boot code.

    In the kernel it is very difficult to use, and provides few advantages
    for pentium-mmx.  It takes about core2 or later for it to be P-state
    invariant.  Then it is not quite so difficult to use, and provides
    some advantages.

2. MMX for copying data provides no signifant performance advantage for
    boot code.

    In the kernel, it is difficult to use, and provides few advantages for
    pentium-mmx.  MMX registers are only 64 bits wide, and the copying
    speed tends to be limited more by (lack of) caches and write buffers
    than by the registers used.  SSE registers provide larger advantages
    by being 128 bits wide.  It takes about an AthlonXP or later for SSE
    plus enough extensions (at least one movnt* instruction is needed and
    I think basic SSE doesn't have any).  The best method is very machine-
    and context-dependent.  Someone named des removed my hooks for plugging
    in the best known copying routines at runtime.  I was happy to see them
    gone, since they are too compicated to used.  There would have to be
    about 100 different versions for each of bcopy, bzero, copyin and
    copyout (memcpy and friends are intentionally not optimized, since use
    use of them for large data asks for slowness).  I only tested about
    40 different versions of bcopy and 20 of bzero.

> I agree with kib.  I don't think building i386 releases with > i486 buys
> you much of anything.  Using MMX in the kernel is of dubious value (have to
> be very careful to use it, and when tested in the past by bde@ for things like
> bcopy() and bzero() it wasn't a clear win IIRC).

Here are results of a current run of old test code: on core2
(ref10-i386): results only for a data size of 4K (for much smaller
sizes, simple methods are best, and for much larger sizes, all
reasonable methods are limited by the speed of main memory and cache
overheads, and all reasonable methods have the same speed, except ones
using movnt* are faster since they bypass the caches):

% copy0: 12146747898 B/s ( 263445 us) (511794241 tsc) (movsl)

movsl is a good general method, and on this CPU it is almost twice as
fast as all other methods that don't use SSE.  (On Athlon64, some of
the other non-SSE methods are competitive).

% copy1: 7120415120 B/s ( 449412 us) (838775735 tsc) (unroll *4)
% copy2: 5773557468 B/s ( 554251 us) (1032266095 tsc) (unroll *4 prefetch)
% copy3: 4452898768 B/s ( 718633 us) (1338746402 tsc) (unroll *16 i586-opt)
% copy4: 6465613041 B/s ( 494926 us) (921710503 tsc) (unroll *16 i586-opt prefetch)
% copy5: 6328337902 B/s ( 505662 us) (942113053 tsc) (unroll *16 i586-opx prefetch)
% copy6: 4838090285 B/s ( 661418 us) (1231845839 tsc) (unroll *8 prefetch 4)
% copy7: 7290755322 B/s ( 438912 us) (817908588 tsc) (unroll 64 fp++)
% copy8: 6463210196 B/s ( 495110 us) (922004965 tsc) (unroll 128 fp i-prefetch)
% copy9: 7264439208 B/s ( 440502 us) (820443267 tsc) (unroll 64 fp reordered)
% copyA: 7298770613 B/s ( 438430 us) (816486286 tsc) (unroll 256 fp reordered++)
% copyB: 7296257704 B/s ( 438581 us) (816792606 tsc) (unroll 512 fp reordered++)
% copyC:  700413769 B/s (4568728 us) (8509304678 tsc) (Terje cksum)
% copyD: 6266866684 B/s ( 510622 us) (951099730 tsc) (kernel bcopy (unroll 64 fp i-prefetch))
% copyE: 6479962740 B/s ( 493830 us) (919570911 tsc) (unroll 64 fp i-prefetch++)

Raw (i586-optimized) kernel bcopy (copy9) is 12.5% faster than the
non-raw version (copyD) mainly because it is sloppy and doesn't do FPU
state switching.

% copyF: 6463432123 B/s ( 495093 us) (922068252 tsc) (new kernel bcopy (unroll 64 fp i-prefetch++))

"new" kernel bcopy has some improvements for Pentium1 related to fxch.  These
make little difference on core2, since the overhead of fxch is pipelined
almost out of existence on core2.

% copyG: 11128460690 B/s ( 287551 us) (535363248 tsc) (memcpy (movsl))
% copyH: 2494210703 B/s (1282971 us) (2389591890 tsc) (movntps)
% copyI: 2283259781 B/s (1401505 us) (2611152194 tsc) (movntps with prefetchnta)
% copyJ: 2246123156 B/s (1424677 us) (2662460521 tsc) (movntps with block prefetch)
% copyK: 13432566418 B/s ( 238227 us) (443974286 tsc) (movq)
% copyL: 11812171705 B/s ( 270907 us) (504438067 tsc) (movq with prefetchnta)
% copyM: 12430515361 B/s ( 257431 us) (479327961 tsc) (movq with block prefetch)

movq (64 bits through MMX registers) gives the same speed as movsl.  But
state switching for MMX would probably cost 12.5% like it does for i586-
optimized kernel bcopy.

% copyQ: 26618974338 B/s ( 120215 us) (223928117 tsc) (movdqa)
% copyR: 21855833459 B/s ( 146414 us) (272801830 tsc) (movdqa with prefetchnta)
% copyS: 22343716179 B/s ( 143217 us) (266771960 tsc) (movdqa with block prefetch)

movdqa (128 bits through SSE registers using an SSE2 instruction) is the
only method tested that is significantly faster than movsl (about twice
as fast).  Here all data is in the L1 cache except possibly for the first
iteration (there are several hundred thousand iterations).

% copyT: 6627728760 B/s ( 482820 us) (899276378 tsc) (unroll *8 a64-opt)
% copyU: 6441859201 B/s ( 496751 us) (925378496 tsc) (unroll *8 a64-opt with prefetchnta)
% copyV: 6514737558 B/s ( 491194 us) (914475275 tsc) (unroll *8 a64-opt with block prefetch)
% copyW: 2769764215 B/s (1155333 us) (2151805649 tsc) (movnti)
% copyX: 2519306152 B/s (1270191 us) (2365494292 tsc) (movnti with prefetchnta)
% copyY: 2494284581 B/s (1282933 us) (2389247728 tsc) (movnti with block prefetch)

movnti gives the speed of main memory, which is very slow for ref10-i386
(2.7 GB/S).  The source is cached, so the only limit should be for writing
to the target; movnti prevents this being cached.  If the data size were
larger than all caches, then movnti would be best and we would hope for
a speed of 2.7/2 GB/S; without movnti, we would only hope for 2.7/3 GB/S.
It is very difficult for copy routines or callers to know whether movnti
should be used to possibly get this speedup by a factor of 1.5 for large
data at the possible cost of a speed-down by a factor of 10 for small
data.  Some systems have relatively faster main memory and it is clear
that movnti is less good for them.

% copyZ: 18939281846 B/s ( 168961 us) (314562465 tsc) (i686_memcpy( movdqa))
% copya: 19157661568 B/s ( 167035 us) (311110842 tsc) (~i686_memcpy (movaps))

> Also, for the boot code, the most important thing is size.  The text + data +
> stack for /boot/loader has to all fit below 640k (and the first 40k is
> reserved by BTX, so you really only have 600k for that, minus any "low" memory
> consumed by things like PXE ROMs).  That is true even on amd64, and won't be
> any better on x86 until we fully support EFI for booting.

Compiling boot code for newer processors would mainly break it for
emergency use on older processors.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121006072636.V978>