FreeBSD Mail Archives

Date:      Thu, 23 Mar 2017 20:14:03 +0100
From:      Stefan Esser <se@freebsd.org>
To:        freebsd-amd64@freebsd.org
Subject:   Re: FreeBSD on Ryzen
Message-ID:  <51b6c5d5-fc66-f371-ef54-c3d85a6f2c2d@freebsd.org>
In-Reply-To: <201703222030.v2MKUJJs026400@gw.catspoiler.org>
References:  <201703222030.v2MKUJJs026400@gw.catspoiler.org>

Am 22.03.17 um 21:30 schrieb Don Lewis:
> I put together a Ryzen 1700X machine over the weekend and installed the
> 12.0-CURRENT r315413 snapshot on it a couple of days ago.  The RAM is
> DDR4 2400.
> 
> First impression is that it's pretty zippy.  Compared to my previous
> fastest machine:
>   CPU: AMD FX-8320E Eight-Core Processor (3210.84-MHz K8-class CPU)
> make -j8 buildworld using tmpfs is a bit more than 2x faster.  Since the
> Ryzen has SMT, it's eight cores look like 16 CPUs to FreeBSD, I get
> almost a 2.6x speedup with -j16 as compared to my old machine.
> 
> I do see that the reported total CPU time increases quite a bit at -j16
> (~19900u) as compared to -j8 (~13600u) so it is running into some
> hardware bottlenecks that are slowing down instruction execution.  It
> could be the resources shared by both SMT threads that share each core,

It is the resources shared by the cores. Under full CPU load, SMT makes
a 3.3 GHz 8 core CPU "simulate" a ~2 GHz 16 core CPU.

The throughput is (in 1st order) proportional to cores * CPU clock, and
comes out as

	8 * 3.3 = 26.4  vs.  16 * ~2 = ~32  (estimated)

I'm positively surprised by the observed gain of +30% due to SMT. This
seems to match the reported user times:

13,600 /  8 = 1,700 seconds user time per physical core (on average)
19,900 / 16 = 1,244 seconds per virtual (SMT) core

vs. an estimate of the throughput with a CPU with SMT but without any
gain in throughput:

27,200 / 16 = 1,700 seconds per virtual core with ineffective SMT

(i.e. assuming SMT that does not increase effective IPC, resulting
in identical real time compared to the non-SMT case)

This result seems to match the increased performance when going from
-j 8 to -j 16:

27,200 / 19,900 = 2.7  ~  2.6 / 2.0

> or it could be cache or memory bandwidth related.  The Ryzen topology is
> a bit complicated. There are two groups of four cores, where each group
> of four cores shares half of the L3 cache, with a slowish interconnect
> bus between the groups.  This probably causes some NUMA-like issues.  I
> wonder if the ULE scheduler could be tweaked to handle this better.

I've been wondering whether it is possible to teach the scheduler about
above mentioned effect, i.e. by distinguishing a SMT core that executes
only 1 runnable thread from one that executes 2. The latter one should
be assumed to run at an estimated 60% clock (which makes both threads
proceed at 120% of the non-SMT speed).

OTOH, the lower "effective clock rate" should be irrelevant under high
load (when all cores are executing 2 threads), or under low load, when
some cores are idle (assuming, that the scheduler prefers to assign only
1 thread per each core until there are more runnable threads then cores.

If you assume that user time accounting is a raw measure of instructions
executed, then assuming a reduced clock rate would lead to "fairer"
results.

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51b6c5d5-fc66-f371-ef54-c3d85a6f2c2d>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation