Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 27 Sep 2019 13:52:58 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        Mark Johnston <markj@FreeBSD.org>
Cc:        freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org
Subject:   Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?
Message-ID:  <08CA4DA1-131C-4B14-BB57-EAA22A8CD5D9@yahoo.com>
In-Reply-To: <20190927192434.GA93180@raichu>
References:  <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com> <20190926202936.GD5581@raichu> <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com> <6BC5F6BE-5FC3-48FA-9873-B20141FEFDF5@yahoo.com> <20190927192434.GA93180@raichu>

next in thread | previous in thread | raw e-mail | index | archive | help


On 2019-Sep-27, at 12:24, Mark Johnston <markj at FreeBSD.org> wrote:

> On Thu, Sep 26, 2019 at 08:37:39PM -0700, Mark Millard wrote:
>>=20
>>=20
>> On 2019-Sep-26, at 17:05, Mark Millard <marklmi at yahoo.com> wrote:
>>=20
>>> On 2019-Sep-26, at 13:29, Mark Johnston <markj at FreeBSD.org> =
wrote:
>>>> One possibility is that these are kernel memory allocations =
occurring in
>>>> the context of the benchmark threads.  Such allocations may not =
respect
>>>> the configured policy since they are not private to the allocating
>>>> thread.  For instance, upon opening a file, the kernel may allocate =
a
>>>> vnode structure for that file.  That vnode may be accessed by =
threads
>>>> from many processes over its lifetime, and may be recycled many =
times
>>>> before its memory is released back to the allocator.
>>>=20
>>> For -l0-15 -n prefer:1 :
>>>=20
>>> Looks like this reports sys_thr_new activity, sys_cpuset
>>> activity, and 0xffffffff80bc09bd activity (whatever that
>>> is). Mostly sys_thr_new activity, over 1300 of them . . .
>>>=20
>>> dtrace: pid 13553 has exited
>>>=20
>>>=20
>>>             kernel`uma_small_alloc+0x61
>>>             kernel`keg_alloc_slab+0x10b
>>>             kernel`zone_import+0x1d2
>>>             kernel`uma_zalloc_arg+0x62b
>>>             kernel`thread_init+0x22
>>>             kernel`keg_alloc_slab+0x259
>>>             kernel`zone_import+0x1d2
>>>             kernel`uma_zalloc_arg+0x62b
>>>             kernel`thread_alloc+0x23
>>>             kernel`thread_create+0x13a
>>>             kernel`sys_thr_new+0xd2
>>>             kernel`amd64_syscall+0x3ae
>>>             kernel`0xffffffff811b7600
>>>               2
>>>=20
>>>             kernel`uma_small_alloc+0x61
>>>             kernel`keg_alloc_slab+0x10b
>>>             kernel`zone_import+0x1d2
>>>             kernel`uma_zalloc_arg+0x62b
>>>             kernel`cpuset_setproc+0x65
>>>             kernel`sys_cpuset+0x123
>>>             kernel`amd64_syscall+0x3ae
>>>             kernel`0xffffffff811b7600
>>>               2
>>>=20
>>>             kernel`uma_small_alloc+0x61
>>>             kernel`keg_alloc_slab+0x10b
>>>             kernel`zone_import+0x1d2
>>>             kernel`uma_zalloc_arg+0x62b
>>>             kernel`uma_zfree_arg+0x36a
>>>             kernel`thread_reap+0x106
>>>             kernel`thread_alloc+0xf
>>>             kernel`thread_create+0x13a
>>>             kernel`sys_thr_new+0xd2
>>>             kernel`amd64_syscall+0x3ae
>>>             kernel`0xffffffff811b7600
>>>               6
>>>=20
>>>             kernel`uma_small_alloc+0x61
>>>             kernel`keg_alloc_slab+0x10b
>>>             kernel`zone_import+0x1d2
>>>             kernel`uma_zalloc_arg+0x62b
>>>             kernel`uma_zfree_arg+0x36a
>>>             kernel`vm_map_process_deferred+0x8c
>>>             kernel`vm_map_remove+0x11d
>>>             kernel`vmspace_exit+0xd3
>>>             kernel`exit1+0x5a9
>>>             kernel`0xffffffff80bc09bd
>>>             kernel`amd64_syscall+0x3ae
>>>             kernel`0xffffffff811b7600
>>>               6
>>>=20
>>>             kernel`uma_small_alloc+0x61
>>>             kernel`keg_alloc_slab+0x10b
>>>             kernel`zone_import+0x1d2
>>>             kernel`uma_zalloc_arg+0x62b
>>>             kernel`thread_alloc+0x23
>>>             kernel`thread_create+0x13a
>>>             kernel`sys_thr_new+0xd2
>>>             kernel`amd64_syscall+0x3ae
>>>             kernel`0xffffffff811b7600
>>>              22
>>>=20
>>>             kernel`vm_page_grab_pages+0x1b4
>>>             kernel`vm_thread_stack_create+0xc0
>>>             kernel`kstack_import+0x52
>>>             kernel`uma_zalloc_arg+0x62b
>>>             kernel`vm_thread_new+0x4d
>>>             kernel`thread_alloc+0x31
>>>             kernel`thread_create+0x13a
>>>             kernel`sys_thr_new+0xd2
>>>             kernel`amd64_syscall+0x3ae
>>>             kernel`0xffffffff811b7600
>>>            1324
>>=20
>> With sys_thr_new not respecting -n prefer:1 for
>> -l0-15 (especially for the thread stacks), I
>> looked some at the generated integration kernel
>> code and it makes significant use of %rsp based
>> memory accesses (read and write).
>>=20
>> That would get both memory controllers going in
>> parallel (kernel vectors accesses to the preferred
>> memory domain), so not slowing down as expected.
>>=20
>> If round-robin is not respected for thread stacks,
>> and if threads migrate cpus across memory domains
>> at times, there could be considerable variability
>> for that context as well. (This may not be the
>> only way to have different/extra variability for
>> this context.)
>>=20
>> Overall: I'd be surprised if this was not
>> contributing to what I thought was odd about
>> the benchmark results.
>=20
> Your tracing refers to kernel thread stacks though, not the stacks =
used
> by threads when executing in user mode.  My understanding is that a =
HINT
> implementation would spend virtually all of its time in user mode, so =
it
> shouldn't matter much or at all if kernel thread stacks are backed by
> memory from the "wrong" domain.

Looks like I was trying to think about it when I should have been =
sleeping.
You are correct.

> This also doesn't really explain some of the disparities in the plots
> you sent me.  For instance, you get a much higher peak QUIS on FreeBSD
> than on Fedora with 16 threads and an interleave/round-robin domain
> selection policy.

True. I suppose that there is the possibility that steady_clock's now() =
results
are odd for some reason for the type of context, leading to the =
durations
between such being on the short side where things look different.

But the left hand side of the single-thread results (smaller memory =
sizes for
the vectors for the integration kernel's use) do not show such a =
rescaling.
(The single thread time measurements are strictly inside the thread of
execution, no thread creation or such counted for any size.) The right =
hand
side of the single thread results (larger memory use, making smaller =
cache
levels fairly ineffective) do generally show some rescaling, but not as =
drastic
as multi-threaded.

Both round-robin and prefer:1  showed such for single threaded.

=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net <http://dsl-only.net/>; went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?08CA4DA1-131C-4B14-BB57-EAA22A8CD5D9>