Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 23 Sep 2019 13:28:15 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org
Subject:   head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?
Message-ID:  <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com>

next in thread | raw e-mail | index | archive | help
Note: I have access to only one FreeBSD amd64 context, and
it is also my only access to a NUMA context: 2 memory
domains. A Threadripper 1950X context. Also: I have only
a head FreeBSD context on any architecture, not 12.x or
before. So I have limited compare/contrast material.

I present the below basically to ask if the NUMA handling
has been validated, or if it is going to be, at least for
contexts that might apply to ThreadRipper 1950X and
analogous contexts. My results suggest they are not (or
libc++'s now times get messed up such that it looks like
NUMA mishandling since this is based on odd benchmark
results that involve mean time for laps, using a median
of such across multiple trials).

I ran a benchmark on both Fedora 30 and FreeBSD 13 on this
1950X got got expected  results on Fedora but odd ones on
FreeBSD. The benchmark is a variation on the old HINT
benchmark, spanning the old multi-threading variation. I
later tried Fedora because the FreeBSD results looked odd.
The other architectures I tried FreeBSD benchmarking with
did not look odd like this. (powerpc64 on a old PowerMac 2
socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive
1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd
Ed. For these I used 4 threads, not more.)

I tend to write in terms of plots made from the data instead
of the raw benchmark data.

FreeBSD testing based on:
cpuset -l0-15  -n prefer:1
cpuset -l16-31 -n prefer:1

Fedora 30 testing based on:
numactl --preferred 1 --cpunodebind 0
numactl --preferred 1 --cpunodebind 1

While I have more results, I reference primarily DSIZE
and ISIZE being unsigned long long and also both being
unsigned long as examples. Variations in results are not
from the type differences for any LP64 architectures.
(But they give an idea of benchmark variability in the
test context.)

The Fedora results solidly show the bandwidth limitation
of using one memory controller. They also show the latency
consequences for the remote memory domain case vs. the
local memory domain case. There is not a lot of
variability between the examples of the 2 type-pairs used
for Fedora.

Not true for FreeBSD on the 1950X:

A) The latency-constrained part of the graph looks to
   normally be using the local memory domain when
   -l0-15 is in use for 8 threads.

B) Both the -l0-15 and the -l16-31 parts of the
   graph for 8 threads that should be bandwidth
   limited show mostly examples that would have to
   involve both memory controllers for the bandwidth
   to get the results shown as far as I can tell.
   There is also wide variability ranging between the
   expected 1 controller result and, say, what a 2
   controller round-robin would be expected produce.

C) Even the single threaded result shows a higher
   result for larger total bytes for the kernel
   vectors. Fedora does not.

I think that (B) is the most solid evidence for
something being odd.



For reference for FreeBSD:

# cpuset -g -d 1
domain 1 mask: 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31

-r352341 allows -prefer:0 but I happen to have
used -prefer:1 in these experiments.

The benchmark was built via devel/g++9 but linked with
system libraries, including libc++. Unfortunately, I'm
not yet ready for distributing source to the benchmark,
but expect to at some point. I do not expect to ever
distribute binaries. The source code for normal builds
involves just standard C++17 code. Such builds are what
is involved here.

[The powerpc64 context is a system-clang 8, ELFv1 based
system context, not the usual gcc 4.2.1 based one.]

More notes:

In the 'kernel vectors: total Bytes' vs. 'QUality
Improvement Per Second' graphs the left hand side of
the curve is latency limited. On the right is bandwidth
limited for LP64. (The total Bytes axis is log base 2
scaling in the graphs.) Thread creation has latency
so the 8-thread curves are mostly of interest for kernel
vectors total bytes being 1 MiByte or more (say) so that
thread creations are not that much of the total
contributions to the measured time.

The thread creations are via std::async use.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?704D4CE4-865E-4C3C-A64E-9562F4D9FC4E>