From owner-freebsd-hackers@freebsd.org  Thu Sep 26 03:28:06 2019
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 86035F60F5
 for <freebsd-hackers@mailman.nyi.freebsd.org>;
 Thu, 26 Sep 2019 03:28:06 +0000 (UTC)
 (envelope-from marklmi@yahoo.com)
Received: from sonic302-1.consmr.mail.bf2.yahoo.com
 (sonic302-1.consmr.mail.bf2.yahoo.com [74.6.135.40])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 46f0kF42b7z3D6R
 for <freebsd-hackers@freebsd.org>; Thu, 26 Sep 2019 03:28:05 +0000 (UTC)
 (envelope-from marklmi@yahoo.com)
X-YMail-OSG: bVLUYfEVM1l.Gn1jQSz2IR_GbI4GeaIHsz7S0nZzwoUi2izTf90DzR7hTgnsbo8
 95Rh6iCdFcwGiG4PMlQxltk4NE8DMI76gwswkQ6lNWirmbPsQmuLR0Baz35c4RiZWPB66MR_eC5g
 MxVZM43h_W7mD60qByzem0RV9agLuTbBvhnekwS.FUvkODjirP37et3vTWCsZjotwU6JiK8ze0T_
 qEbgAJ4SQlG0zW4OqC4feWXxr3AMfYeZDe3C1I32nu8SwNT7Tx17k2x21FsQkTn6RBkMfUhIe7VL
 bIF90Vw4BVLI8wr4gHDhQ.OHs8T2vWh3Bp8iihVJa7d4O4JB4rRX__NvU.Q.fA8Pw9dBABjpZvhL
 WuFh3WBKMU0Wp8Q4Tkqx6bw8kbrx_3OjzzRXOq4oIaWwJrXrnSs9NEJizE1PGUrMBUbkwdqxEYib
 ZqnE4eyvvUmiLTlwOW_yo3y1wblXKydaQF0xH4wUc8MnFoa0WFyBkCOYu9e.pdWLngdVEyMhNe24
 2dCB4ge9rWL4HdtWBRvVTgYuQ4_7Zp.oxoCsKPNB2F.iFMRPf70DqUcojBRXlcRMAgZh3RO5NIaw
 X4MGtSh4NvXmuhlFv8UtEQZ_bAm9ZCWbjBk3nFed6.mh.uTOVycvzQv5XCW_s1T92k187eaxHReX
 .mPVz.JXvGNTkLTIZmRwaVjfRVEjlIy3qNgaVOsmwyyRRI9zWsQXShqmjf8o3gIclQzpBh1YhQ_K
 ukfHBs4KLcdnjYFmLpAqCLJY2_t4xiElGPxfY_tiCWocSotcRRSxV68r3R6cuHUyPZTPbkYDjfNl
 Ls48V2lj2skCzaunbNk2C2oUBtgFZFjA5qBoRgH8U9bfjnIarw_WI4HVEJkJpWnDanLJgY.nkYiF
 xFUJX.PeRZ_pyf3YvcJhUlXGM9cc4qGvmZpXkj7Oa6U8OJuyYGhEA4Xj4.RBC8A4hTkmtsaeYJHO
 QPUCWcE8xpxbMYKCNybxY1w1GLGAZIMXlIfhDpphMeH.lG9I9byFWcRRSYIsvBCoGdR_NpFgrzJ.
 iDq2kdYxEUNczPt6yh4MMmgPBYd.BEsdlPkayGidGLbuDGHe0sBhQ9MV9qNAoI8x3YlamT2Gw7Ra
 .C.jtg6ZTS2qEmQvf2KqGLMdwvA2.8J1TGSHKq.9GVP5CTAvJPh2RID3MnKYZb_BdOGbs02fHSYk
 xph0Xh47vwC7NbS_31zfIMK1SFAjps5mQViaCs6A_Hjp7PKdfDAZbh1qU7cV0ATFGXXrjPT49e6S
 4V67rxBsj6ieGi.TdZ5WPoye43I2wV3L2NtB.cYxOLuEPtUETL2kCwLlemy3VuM_Bl0t.og--
Received: from sonic.gate.mail.ne1.yahoo.com by
 sonic302.consmr.mail.bf2.yahoo.com with HTTP; Thu, 26 Sep 2019 03:28:04 +0000
Received: by smtp424.mail.bf1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA
 ID 48255e095b77b91aef8f4e0ec1b7a90a; 
 Thu, 26 Sep 2019 03:28:01 +0000 (UTC)
Content-Type: text/plain;
	charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\))
Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n
 prefer:1 with -l 0-15 vs. -l 16-31 odd performance?
From: Mark Millard <marklmi@yahoo.com>
In-Reply-To: <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com>
Date: Wed, 25 Sep 2019 20:27:58 -0700
Cc: freebsd-amd64@freebsd.org,
 freebsd-hackers@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com>
References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com>
 <20190925170255.GA43643@raichu>
 <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com>
To: Mark Johnston <markj@FreeBSD.org>
X-Mailer: Apple Mail (2.3445.104.11)
X-Rspamd-Queue-Id: 46f0kF42b7z3D6R
X-Spamd-Bar: +
X-Spamd-Result: default: False [1.27 / 15.00]; TO_DN_SOME(0.00)[];
 R_SPF_ALLOW(-0.20)[+ptr:yahoo.com];
 FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[];
 DKIM_TRACE(0.00)[yahoo.com:+];
 DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject];
 FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+];
 SUBJECT_ENDS_QUESTION(1.00)[];
 FREEMAIL_ENVFROM(0.00)[yahoo.com];
 ASN(0.00)[asn:26101, ipnet:74.6.128.0/21, country:US];
 MID_RHS_MATCH_FROM(0.00)[];
 DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0];
 ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048];
 FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3];
 NEURAL_HAM_LONG(-0.15)[-0.153,0]; MIME_GOOD(-0.10)[text/plain];
 IP_SCORE(0.00)[ip: (3.14), ipnet: 74.6.128.0/21(1.46), asn: 26101(1.17),
 country: US(-0.05)]; NEURAL_SPAM_MEDIUM(0.93)[0.926,0];
 IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[];
 RCVD_IN_DNSWL_NONE(0.00)[40.135.6.74.list.dnswl.org : 127.0.5.0];
 RCVD_TLS_LAST(0.00)[];
 RWL_MAILSPIKE_POSSIBLE(0.00)[40.135.6.74.rep.mailspike.net : 127.0.0.17];
 RCVD_COUNT_TWO(0.00)[2]
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 26 Sep 2019 03:28:06 -0000


On 2019-Sep-25, at 19:26, Mark Millard <marklmi at yahoo.com> wrote:

> On 2019-Sep-25, at 10:02, Mark Johnston <markj at reeBSD.org> wrote:
>=20
>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via =
freebsd-amd64 wrote:
>>> Note: I have access to only one FreeBSD amd64 context, and
>>> it is also my only access to a NUMA context: 2 memory
>>> domains. A Threadripper 1950X context. Also: I have only
>>> a head FreeBSD context on any architecture, not 12.x or
>>> before. So I have limited compare/contrast material.
>>>=20
>>> I present the below basically to ask if the NUMA handling
>>> has been validated, or if it is going to be, at least for
>>> contexts that might apply to ThreadRipper 1950X and
>>> analogous contexts. My results suggest they are not (or
>>> libc++'s now times get messed up such that it looks like
>>> NUMA mishandling since this is based on odd benchmark
>>> results that involve mean time for laps, using a median
>>> of such across multiple trials).
>>>=20
>>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this
>>> 1950X got got expected  results on Fedora but odd ones on
>>> FreeBSD. The benchmark is a variation on the old HINT
>>> benchmark, spanning the old multi-threading variation. I
>>> later tried Fedora because the FreeBSD results looked odd.
>>> The other architectures I tried FreeBSD benchmarking with
>>> did not look odd like this. (powerpc64 on a old PowerMac 2
>>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive
>>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd
>>> Ed. For these I used 4 threads, not more.)
>>>=20
>>> I tend to write in terms of plots made from the data instead
>>> of the raw benchmark data.
>>>=20
>>> FreeBSD testing based on:
>>> cpuset -l0-15  -n prefer:1
>>> cpuset -l16-31 -n prefer:1
>>>=20
>>> Fedora 30 testing based on:
>>> numactl --preferred 1 --cpunodebind 0
>>> numactl --preferred 1 --cpunodebind 1
>>>=20
>>> While I have more results, I reference primarily DSIZE
>>> and ISIZE being unsigned long long and also both being
>>> unsigned long as examples. Variations in results are not
>>> from the type differences for any LP64 architectures.
>>> (But they give an idea of benchmark variability in the
>>> test context.)
>>>=20
>>> The Fedora results solidly show the bandwidth limitation
>>> of using one memory controller. They also show the latency
>>> consequences for the remote memory domain case vs. the
>>> local memory domain case. There is not a lot of
>>> variability between the examples of the 2 type-pairs used
>>> for Fedora.
>>>=20
>>> Not true for FreeBSD on the 1950X:
>>>=20
>>> A) The latency-constrained part of the graph looks to
>>>  normally be using the local memory domain when
>>>  -l0-15 is in use for 8 threads.
>>>=20
>>> B) Both the -l0-15 and the -l16-31 parts of the
>>>  graph for 8 threads that should be bandwidth
>>>  limited show mostly examples that would have to
>>>  involve both memory controllers for the bandwidth
>>>  to get the results shown as far as I can tell.
>>>  There is also wide variability ranging between the
>>>  expected 1 controller result and, say, what a 2
>>>  controller round-robin would be expected produce.
>>>=20
>>> C) Even the single threaded result shows a higher
>>>  result for larger total bytes for the kernel
>>>  vectors. Fedora does not.
>>>=20
>>> I think that (B) is the most solid evidence for
>>> something being odd.
>>=20
>> The implication seems to be that your benchmark program is using =
pages
>> from both domains despite a policy which preferentially allocates =
pages
>> from domain 1, so you would first want to determine if this is =
actually
>> what's happening.  As far as I know we currently don't have a good =
way
>> of characterizing per-domain memory usage within a process.
>>=20
>> If your benchmark uses a large fraction of the system's memory, you
>> could use the vm.phys_free sysctl to get a sense of how much memory =
from
>> each domain is free.
>=20
> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per =
memory
> domain. I've never configured the benchmark such that it even reaches
> 10 GiBytes on this hardware. (It stops for a time constraint first,
> based on the values in use for the "adjustable" items.)
>=20
> . . . (much omitted material) . . .

>=20
>> Another possibility is to use DTrace to trace the
>> requested domain in vm_page_alloc_domain_after().  For example, the
>> following DTrace one-liner counts the number of pages allocated per
>> domain by ls(1):
>>=20
>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n rr ls"
>> ...
>> 	0               71
>> 	1               72
>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:1 =
ls"
>> ...
>> 	1              143
>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:0 =
ls"
>> ...
>> 	0              143
>=20
> I'll think about this, although it would give no
> information which CPUs are executing the threads
> that are allocating or accessing the vectors for
> the integration kernel. So, for example, if the
> threads migrate to or start out on cpus they
> should not be on, this would not report such.
>=20
> For such "which CPUs" questions one stab would
> be simply to watch with top while the benchmark
> is running and see which CPUs end up being busy
> vs. which do not. I think I'll try this.

Using top did not show evidence of the wrong
CPUs being actively in use.

My variation of top is unusual in that it also
tracks some maximum observed figures and reports
them, here being:

8804M MaxObsActive, 4228M MaxObsWired, 13G MaxObs(Act+Wir)

(no swap use was reported). This gives a system
level view of about how much RAM was put to use
during the monitoring of the 2 benchmark runs
(-l0-15 and -l16-31). No where near enough used
to require both memory domains to be in use.

Thus, it would appear to be just where the
allocations are made for -n prefer:1 that
matters, at least when a (temporary) thread
does the allocations.

>> This approach might not work for various reasons depending on how
>> exactly your benchmark program works.

I've not tried dtrace yet.

=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)