From owner-freebsd-hackers@freebsd.org  Fri Sep 27 03:37:47 2019
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 0C8F7F46D6
 for <freebsd-hackers@mailman.nyi.freebsd.org>;
 Fri, 27 Sep 2019 03:37:47 +0000 (UTC)
 (envelope-from marklmi@yahoo.com)
Received: from sonic308-10.consmr.mail.ne1.yahoo.com
 (sonic308-10.consmr.mail.ne1.yahoo.com [66.163.187.33])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 46fctx2Frdz4JMl
 for <freebsd-hackers@freebsd.org>; Fri, 27 Sep 2019 03:37:44 +0000 (UTC)
 (envelope-from marklmi@yahoo.com)
X-YMail-OSG: E8XhLpQVM1lJ6vovjrn7YpohE7rdoont6MbS01lbXpbDg0KL5dcpc05atAJ66pd
 xEyp5xuMSnUrgRGDwJNWfWVb7htpUnCvotbSyqnkT_rW6pKVy.j4jpzmqnHkjAH.X98edSqg0fLL
 DFm0gUKqh8afkyurKPLZKp8v_jNhNmvDXUw5TGwuM0rp6NFiAqCvcmMhxkaz0udDrUVeWFvq7esr
 FMCW12GVllDeat0nTbKTtzMfVM3bFLtE2xhkOg37eEKbkZk6NQpE9GL.vI_9qIw_rO4A8TSynnSh
 BudmIaSoJDgj8eA4FalL6gGMtm.v7LT2RkSaQIXFhHj0AAQfAYybHlVdQcgonPBPe_5J5_ttUjhJ
 wbIkIHQU03apPowzaWksg6rE1oXc2LmUFogsdSpuKJd8t16mDEF_4WP6R9Xc5_Ik_OhY_SgFk2yW
 25LHiIBiwB_LzXs4PjdTlX0PBN.PIILTqzJQQ.RO7NpKPYW_JyD6SFFipQsgsi81rmVLQIsuREXX
 necXnlNnAfecoOpyrJvuYQZQT4rmnMNwcbWwPBu543fvlSGE6JrgUEIq3_CQPeM0HES31gCwmkTx
 7EhcZzwZKxbqG2YY8Vme1rJ4acnLuVDLs86qCGrDlFv2kfKSeik2qqUbWDS5FRxVQ63kwVpkqzaD
 VpEOsFmVCTiah8Tq.bsos7Wk8cCAg8fsBgMZGfzBlTg9PufP7JgeERokcS3C2fI4bJurLehSPE5b
 nfA7WZl2gXHVk..L4ZaktnNoU9YSSD5CMzwLN93_vdeewFYPOwolD3zWGxqQt.eSzMB5Ck7YsFtD
 p0mhVqNDXw92nQhjp6AzAqJNOcMHOsLs0oS0IzIn4afFdPSuHsIPVtIMZfD_fBbKBXLjgJ1F2XZG
 56jroDCMvbee1hysRq4AE3zivEQso1d0KUi02FSSpu6Umat0pGbEhIaLVCSMKlOXfCY1FnP06_OZ
 r5Nueml3cFsP3QjRD4DBb7VOIgrceT_Q_ImP8bZQsyhQKHb72ag8qD4uJpG7R1nrtLZ5Kq1HjszN
 AK8uzMr6DoSucqrKYsR6w7dBLf6zYjuBMUa9XNVsO41l0FIgQYFkA1q0fSCKPTkH7MM6e9YO_RGW
 ohuOlt6r1ObI.0i8QwVvibUkOuwFuM_fJNS.vnnaBwjV3lZj_ZBv6098wqxjyeBjEChOtZloeH_V
 xvflm9fyUIf_RYS3B7AEw0s26pg0576t5an4US2DhPrm7Vc1R8jtE7aewoMFzXHaHweoLbgvpChq
 VrN5sDs4Vz6gz8j19XtOigGYjoDZ1Khxj2Qqr.mVJk9Hb9w_lbuN_go49NCtVjRQ0
Received: from sonic.gate.mail.ne1.yahoo.com by
 sonic308.consmr.mail.ne1.yahoo.com with HTTP; Fri, 27 Sep 2019 03:37:43 +0000
Received: by smtp428.mail.ne1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA
 ID 5b92f5f01d3981467664a1c9f888c9fe; 
 Fri, 27 Sep 2019 03:37:42 +0000 (UTC)
Content-Type: text/plain;
	charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\))
Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n
 prefer:1 with -l 0-15 vs. -l 16-31 odd performance?
From: Mark Millard <marklmi@yahoo.com>
In-Reply-To: <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com>
Date: Thu, 26 Sep 2019 20:37:39 -0700
Cc: freebsd-amd64@freebsd.org,
 freebsd-hackers@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <6BC5F6BE-5FC3-48FA-9873-B20141FEFDF5@yahoo.com>
References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com>
 <20190925170255.GA43643@raichu>
 <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com>
 <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com>
 <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com>
 <20190926202936.GD5581@raichu>
 <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com>
To: Mark Johnston <markj@FreeBSD.org>
X-Mailer: Apple Mail (2.3445.104.11)
X-Rspamd-Queue-Id: 46fctx2Frdz4JMl
X-Spamd-Bar: +
X-Spamd-Result: default: False [1.67 / 15.00]; TO_DN_SOME(0.00)[];
 R_SPF_ALLOW(-0.20)[+ptr:yahoo.com];
 FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[];
 DKIM_TRACE(0.00)[yahoo.com:+];
 DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject];
 FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+];
 IP_SCORE(0.00)[ip: (4.54), ipnet: 66.163.184.0/21(1.32), asn: 36646(1.05),
 country: US(-0.05)]; FREEMAIL_ENVFROM(0.00)[yahoo.com];
 SUBJECT_ENDS_QUESTION(1.00)[]; MID_RHS_MATCH_FROM(0.00)[];
 ASN(0.00)[asn:36646, ipnet:66.163.184.0/21, country:US];
 ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048];
 FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3];
 MIME_GOOD(-0.10)[text/plain];
 DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0];
 NEURAL_SPAM_MEDIUM(0.93)[0.926,0]; IP_SCORE_FREEMAIL(0.00)[];
 TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(0.25)[0.247,0];
 RCVD_IN_DNSWL_NONE(0.00)[33.187.163.66.list.dnswl.org : 127.0.5.0];
 RCVD_TLS_LAST(0.00)[];
 RWL_MAILSPIKE_POSSIBLE(0.00)[33.187.163.66.rep.mailspike.net : 127.0.0.17];
 RCVD_COUNT_TWO(0.00)[2]
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 27 Sep 2019 03:37:47 -0000


On 2019-Sep-26, at 17:05, Mark Millard <marklmi at yahoo.com> wrote:

> On 2019-Sep-26, at 13:29, Mark Johnston <markj at FreeBSD.org> wrote:
>=20
>> On Wed, Sep 25, 2019 at 10:03:14PM -0700, Mark Millard wrote:
>>>=20
>>>=20
>>> On 2019-Sep-25, at 20:27, Mark Millard <marklmi at yahoo.com> wrote:
>>>=20
>>>> On 2019-Sep-25, at 19:26, Mark Millard <marklmi at yahoo.com> =
wrote:
>>>>=20
>>>>> On 2019-Sep-25, at 10:02, Mark Johnston <markj at reeBSD.org> =
wrote:
>>>>>=20
>>>>>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via =
freebsd-amd64 wrote:
>>>>>>> Note: I have access to only one FreeBSD amd64 context, and
>>>>>>> it is also my only access to a NUMA context: 2 memory
>>>>>>> domains. A Threadripper 1950X context. Also: I have only
>>>>>>> a head FreeBSD context on any architecture, not 12.x or
>>>>>>> before. So I have limited compare/contrast material.
>>>>>>>=20
>>>>>>> I present the below basically to ask if the NUMA handling
>>>>>>> has been validated, or if it is going to be, at least for
>>>>>>> contexts that might apply to ThreadRipper 1950X and
>>>>>>> analogous contexts. My results suggest they are not (or
>>>>>>> libc++'s now times get messed up such that it looks like
>>>>>>> NUMA mishandling since this is based on odd benchmark
>>>>>>> results that involve mean time for laps, using a median
>>>>>>> of such across multiple trials).
>>>>>>>=20
>>>>>>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this
>>>>>>> 1950X got got expected  results on Fedora but odd ones on
>>>>>>> FreeBSD. The benchmark is a variation on the old HINT
>>>>>>> benchmark, spanning the old multi-threading variation. I
>>>>>>> later tried Fedora because the FreeBSD results looked odd.
>>>>>>> The other architectures I tried FreeBSD benchmarking with
>>>>>>> did not look odd like this. (powerpc64 on a old PowerMac 2
>>>>>>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive
>>>>>>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd
>>>>>>> Ed. For these I used 4 threads, not more.)
>>>>>>>=20
>>>>>>> I tend to write in terms of plots made from the data instead
>>>>>>> of the raw benchmark data.
>>>>>>>=20
>>>>>>> FreeBSD testing based on:
>>>>>>> cpuset -l0-15  -n prefer:1
>>>>>>> cpuset -l16-31 -n prefer:1
>>>>>>>=20
>>>>>>> Fedora 30 testing based on:
>>>>>>> numactl --preferred 1 --cpunodebind 0
>>>>>>> numactl --preferred 1 --cpunodebind 1
>>>>>>>=20
>>>>>>> While I have more results, I reference primarily DSIZE
>>>>>>> and ISIZE being unsigned long long and also both being
>>>>>>> unsigned long as examples. Variations in results are not
>>>>>>> from the type differences for any LP64 architectures.
>>>>>>> (But they give an idea of benchmark variability in the
>>>>>>> test context.)
>>>>>>>=20
>>>>>>> The Fedora results solidly show the bandwidth limitation
>>>>>>> of using one memory controller. They also show the latency
>>>>>>> consequences for the remote memory domain case vs. the
>>>>>>> local memory domain case. There is not a lot of
>>>>>>> variability between the examples of the 2 type-pairs used
>>>>>>> for Fedora.
>>>>>>>=20
>>>>>>> Not true for FreeBSD on the 1950X:
>>>>>>>=20
>>>>>>> A) The latency-constrained part of the graph looks to
>>>>>>> normally be using the local memory domain when
>>>>>>> -l0-15 is in use for 8 threads.
>>>>>>>=20
>>>>>>> B) Both the -l0-15 and the -l16-31 parts of the
>>>>>>> graph for 8 threads that should be bandwidth
>>>>>>> limited show mostly examples that would have to
>>>>>>> involve both memory controllers for the bandwidth
>>>>>>> to get the results shown as far as I can tell.
>>>>>>> There is also wide variability ranging between the
>>>>>>> expected 1 controller result and, say, what a 2
>>>>>>> controller round-robin would be expected produce.
>>>>>>>=20
>>>>>>> C) Even the single threaded result shows a higher
>>>>>>> result for larger total bytes for the kernel
>>>>>>> vectors. Fedora does not.
>>>>>>>=20
>>>>>>> I think that (B) is the most solid evidence for
>>>>>>> something being odd.
>>>>>>=20
>>>>>> The implication seems to be that your benchmark program is using =
pages
>>>>>> from both domains despite a policy which preferentially allocates =
pages
>>>>>> from domain 1, so you would first want to determine if this is =
actually
>>>>>> what's happening.  As far as I know we currently don't have a =
good way
>>>>>> of characterizing per-domain memory usage within a process.
>>>>>>=20
>>>>>> If your benchmark uses a large fraction of the system's memory, =
you
>>>>>> could use the vm.phys_free sysctl to get a sense of how much =
memory from
>>>>>> each domain is free.
>>>>>=20
>>>>> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes =
per memory
>>>>> domain. I've never configured the benchmark such that it even =
reaches
>>>>> 10 GiBytes on this hardware. (It stops for a time constraint =
first,
>>>>> based on the values in use for the "adjustable" items.)
>>>>>=20
>>>>> . . . (much omitted material) . . .
>>>>=20
>>>>>=20
>>>>>> Another possibility is to use DTrace to trace the
>>>>>> requested domain in vm_page_alloc_domain_after().  For example, =
the
>>>>>> following DTrace one-liner counts the number of pages allocated =
per
>>>>>> domain by ls(1):
>>>>>>=20
>>>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n rr ls"
>>>>>> ...
>>>>>> 	0               71
>>>>>> 	1               72
>>>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:1 =
ls"
>>>>>> ...
>>>>>> 	1              143
>>>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:0 =
ls"
>>>>>> ...
>>>>>> 	0              143
>>>>>=20
>>>>> I'll think about this, although it would give no
>>>>> information which CPUs are executing the threads
>>>>> that are allocating or accessing the vectors for
>>>>> the integration kernel. So, for example, if the
>>>>> threads migrate to or start out on cpus they
>>>>> should not be on, this would not report such.
>>>>>=20
>>>>> For such "which CPUs" questions one stab would
>>>>> be simply to watch with top while the benchmark
>>>>> is running and see which CPUs end up being busy
>>>>> vs. which do not. I think I'll try this.
>>>>=20
>>>> Using top did not show evidence of the wrong
>>>> CPUs being actively in use.
>>>>=20
>>>> My variation of top is unusual in that it also
>>>> tracks some maximum observed figures and reports
>>>> them, here being:
>>>>=20
>>>> 8804M MaxObsActive, 4228M MaxObsWired, 13G MaxObs(Act+Wir)
>>>>=20
>>>> (no swap use was reported). This gives a system
>>>> level view of about how much RAM was put to use
>>>> during the monitoring of the 2 benchmark runs
>>>> (-l0-15 and -l16-31). No where near enough used
>>>> to require both memory domains to be in use.
>>>>=20
>>>> Thus, it would appear to be just where the
>>>> allocations are made for -n prefer:1 that
>>>> matters, at least when a (temporary) thread
>>>> does the allocations.
>>>>=20
>>>>>> This approach might not work for various reasons depending on how
>>>>>> exactly your benchmark program works.
>>>>=20
>>>> I've not tried dtrace yet.
>>>=20
>>> Well, for an example -l0-15 -n prefer:1 run
>>> for just the 8 threads benchmark case . . .
>>>=20
>>> dtrace: pid 10997 has exited
>>>=20
>>>       0              712
>>>       1          6737529
>>>=20
>>> Something is leading to domain 0
>>> allocations, despite -n prefer:1 .
>>=20
>> You can get a sense of where these allocations are occuring by =
changing
>> the probe to capture kernel stacks for domain 0 page allocations:
>>=20
>> fbt::vm_page_alloc_domain_after:entry /progenyof($target) && args[2] =
=3D=3D 0/{@[stack()] =3D count();}
>>=20
>> One possibility is that these are kernel memory allocations occurring =
in
>> the context of the benchmark threads.  Such allocations may not =
respect
>> the configured policy since they are not private to the allocating
>> thread.  For instance, upon opening a file, the kernel may allocate a
>> vnode structure for that file.  That vnode may be accessed by threads
>> from many processes over its lifetime, and may be recycled many times
>> before its memory is released back to the allocator.
>=20
> For -l0-15 -n prefer:1 :
>=20
> Looks like this reports sys_thr_new activity, sys_cpuset
> activity, and 0xffffffff80bc09bd activity (whatever that
> is). Mostly sys_thr_new activity, over 1300 of them . . .
>=20
> dtrace: pid 13553 has exited
>=20
>=20
>              kernel`uma_small_alloc+0x61
>              kernel`keg_alloc_slab+0x10b
>              kernel`zone_import+0x1d2
>              kernel`uma_zalloc_arg+0x62b
>              kernel`thread_init+0x22
>              kernel`keg_alloc_slab+0x259
>              kernel`zone_import+0x1d2
>              kernel`uma_zalloc_arg+0x62b
>              kernel`thread_alloc+0x23
>              kernel`thread_create+0x13a
>              kernel`sys_thr_new+0xd2
>              kernel`amd64_syscall+0x3ae
>              kernel`0xffffffff811b7600
>                2
>=20
>              kernel`uma_small_alloc+0x61
>              kernel`keg_alloc_slab+0x10b
>              kernel`zone_import+0x1d2
>              kernel`uma_zalloc_arg+0x62b
>              kernel`cpuset_setproc+0x65
>              kernel`sys_cpuset+0x123
>              kernel`amd64_syscall+0x3ae
>              kernel`0xffffffff811b7600
>                2
>=20
>              kernel`uma_small_alloc+0x61
>              kernel`keg_alloc_slab+0x10b
>              kernel`zone_import+0x1d2
>              kernel`uma_zalloc_arg+0x62b
>              kernel`uma_zfree_arg+0x36a
>              kernel`thread_reap+0x106
>              kernel`thread_alloc+0xf
>              kernel`thread_create+0x13a
>              kernel`sys_thr_new+0xd2
>              kernel`amd64_syscall+0x3ae
>              kernel`0xffffffff811b7600
>                6
>=20
>              kernel`uma_small_alloc+0x61
>              kernel`keg_alloc_slab+0x10b
>              kernel`zone_import+0x1d2
>              kernel`uma_zalloc_arg+0x62b
>              kernel`uma_zfree_arg+0x36a
>              kernel`vm_map_process_deferred+0x8c
>              kernel`vm_map_remove+0x11d
>              kernel`vmspace_exit+0xd3
>              kernel`exit1+0x5a9
>              kernel`0xffffffff80bc09bd
>              kernel`amd64_syscall+0x3ae
>              kernel`0xffffffff811b7600
>                6
>=20
>              kernel`uma_small_alloc+0x61
>              kernel`keg_alloc_slab+0x10b
>              kernel`zone_import+0x1d2
>              kernel`uma_zalloc_arg+0x62b
>              kernel`thread_alloc+0x23
>              kernel`thread_create+0x13a
>              kernel`sys_thr_new+0xd2
>              kernel`amd64_syscall+0x3ae
>              kernel`0xffffffff811b7600
>               22
>=20
>              kernel`vm_page_grab_pages+0x1b4
>              kernel`vm_thread_stack_create+0xc0
>              kernel`kstack_import+0x52
>              kernel`uma_zalloc_arg+0x62b
>              kernel`vm_thread_new+0x4d
>              kernel`thread_alloc+0x31
>              kernel`thread_create+0x13a
>              kernel`sys_thr_new+0xd2
>              kernel`amd64_syscall+0x3ae
>              kernel`0xffffffff811b7600
>             1324

With sys_thr_new not respecting -n prefer:1 for
-l0-15 (especially for the thread stacks), I
looked some at the generated integration kernel
code and it makes significant use of %rsp based
memory accesses (read and write).

That would get both memory controllers going in
parallel (kernel vectors accesses to the preferred
memory domain), so not slowing down as expected.

If round-robin is not respected for thread stacks,
and if threads migrate cpus across memory domains
at times, there could be considerable variability
for that context as well. (This may not be the
only way to have different/extra variability for
this context.)

Overall: I'd be surprised if this was not
contributing to what I thought was odd about
the benchmark results.

> For -l16-31 -n prefer:1 :
>=20
> Again, exactly 2. Both being sys_cpuset . . .
>=20
> dtrace: pid 13594 has exited
>=20
>=20
>              kernel`uma_small_alloc+0x61
>              kernel`keg_alloc_slab+0x10b
>              kernel`zone_import+0x1d2
>              kernel`uma_zalloc_arg+0x62b
>              kernel`cpuset_setproc+0x65
>              kernel`sys_cpuset+0x123
>              kernel`amd64_syscall+0x3ae
>              kernel`0xffffffff811b7600
>                2
>=20
>=20
>=20
>>=20
>> Given the low number of domain 0 allocations I am skeptical that they
>> are responsible for the variablility in your results.
>>=20
>>> So I tried -l16-31 -n prefer:1 and it got:
>>>=20
>>> dtrace: pid 11037 has exited
>>>=20
>>>       0                2
>>>       1          8055389
>>>=20
>>> (The larger number of allocations is
>>> not a surprise: more work done in
>>> about the same overall time based on
>>> faster memory access generally.)
>=20


=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)