Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 26 Oct 1996 15:53:20 +1000
From:      Bruce Evans <bde@zeta.org.au>
To:        asami@freebsd.org, mark@quickweb.com
Cc:        current@freebsd.org, ejs@bfd.com, michaelv@MindBender.serv.net, rgrimes@gndrsh.aac.dev.com, scrappy@ki.net, smp@freebsd.org
Subject:   Re: Recommendations...
Message-ID:  <199610260553.PAA14727@godzilla.zeta.org.au>

next in thread | raw e-mail | index | archive | help
>>  * What low memory bandwidth on the Natoma???  That thing smokes when comparied
>>  * to a 430HX chipset.
>> 
>> That contradicts our findings.  A P5-133 with Triton or Triton II can
>> move 70-80MB/s (depending on EDO or non-EDO), but I can't get more
>> than 45MB/s out of a P6-200 with Natoma/server (at least that's what
>> Intel told us).
>
>That's odd, here are my speeds on a P6-200 with Natoma (440fx)/server
>board straight from intel:
>
>Function      Rate (MB/s)   RMS time     Min time     Max time
>Copy:          76.1639       0.0633       0.0630       0.0648
>Scale:         75.5894       0.0636       0.0635       0.0638
>Add:           81.3670       0.0886       0.0885       0.0887
>Triad:         80.6036       0.0894       0.0893       0.0896

This is because the 4 Rates reported by the STREAM benchmark are scaled
by factors of 2, 2, 3 and 3, respectively, and Natoma is very slow :-).
On a P5-133 with Triton 1 (ASUS P55TP4XE) with non-EDO RAM (66 MHz memory
clock):

Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:          88.7256       0.1446       0.1443       0.1471
Scale:         80.4207       0.1608       0.1592       0.1624
Add:           89.6191       0.2222       0.2142       0.2318
Triad:         88.3433       0.2232       0.2173       0.2318

This is still slow.  This machine can copy at > 75MB/s throughput or
150 MB/s on the same scale as the STREAM tests.  Getting this throughput
involves prefetching the source bytes a few K at a time and then using
FP operations to store them (and perforce FP operations to load them).
gcc "optimizes" the Copy benchmark to not use FP at all.  This is why the
more complicated Add an Triad benchmarks can be faster.  I guess the more
complicated benchmarks would be speeded up to only about 120MB/s by the
same method.  The full memory bandwidth of 176MB/sec (on this system)
isn't quite reachable even for copying because the FPU is too slow
(fistpq takes 6 cycles, which is more than the minimum memory cycle time
and leaves no time for loop overheads).

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199610260553.PAA14727>