From owner-freebsd-amd64@freebsd.org Fri Sep 27 22:22:22 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 81C8712F1FA for ; Fri, 27 Sep 2019 22:22:22 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic302-22.consmr.mail.ne1.yahoo.com (sonic302-22.consmr.mail.ne1.yahoo.com [66.163.186.148]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 46g5rY02t2z4S2C for ; Fri, 27 Sep 2019 22:22:20 +0000 (UTC) (envelope-from marklmi@yahoo.com) X-YMail-OSG: .JhUWa8VM1kRawIwgVHhDff5.D2J6hIw.lrxucmKuaAly71jc3efJ5oyPlFY6xd qp6BmLQ0jVKUmRa2pVyW51faWzDvyjwJoXzXX8Icmjn5CragbqWgPNCceM84QtnIDe.3EspS7Ywo NYwXvaBldHeOeBvKQ.ElYe3KIHpk_w2p64jG81ABAXDcXbNGv9q0bQkuqbCp36ekIN8dLCYiy.3U TzTV9AtLnrkwVuYsiZeeOAk8hqNL_gl6hwqKwl0gpFXPZ1DxTX8AVLSbs8qqRrVI73dXVeQ3csMR ac4vyIYgLmvfIn0byqGYPxD6AIw9xg7zOGOxBHDZaPnQ6vInB.79zhQwMKn8gdApGkK.1r4irr5j 2zSr8qHaxWAI6z0tLfXWkSRhXD.dbyPB_DY9KblsPjBOHjdWJB96b7TYe2uSthIW687iNnixUnV6 Ji4hRSF6I3TDfZc1Pev7bx94zGOut2QwOTS96zRxGScoN3TV8WNE7hcoDblIDfNR45ZStT_awoVT gnql7Uph3Yu_gpC1CSLFHupLR8Ufb5JV1BKw_Ps4er6vUz693gdt_jSSbvcrwRiAgZ0SCX85w6cf MNU_Cx21tK81J0TvI1DX0K.U.d1qkIdFtUMxBU7cRskjmlVItJHq.RdlXKIBoNWVm6fHPFiGxG3A SuRhsNR9hHT6dB8lbmSTI.f1a6fWd6RZTrz8KemiSLZGLgLaGabJkEh6mSoW_v4RSSH0oAYIYVmQ PgfwFboVMmOugEKMk0Bf3rePwIYgnn_s76TV8vMIkxq09tig.Do1HvGgLNAP9k5v2MDY_X9WH_Ga 9uzceI1CuEe0FvJrA1bW9JrQQ7ZYd9TGy.XVJgPRvnlI1eLBAw5g5I7aBrvBRXlfyrZWmBKTc8ao ao781N8n2riZb5zEb.P5TdYhzukznN5I7Dqs5aAut6E9aGgWEBk2JPm_Utg55kUDAXjQcfs6SyJw EuX9lR6NPLsI0D1qoVWw90wXFFkE.A9Gi1.W7ru4yG6Ril8GcBhO2VJqbDJU6zxqoQl_qjdwaVIq GhzqgsHJ8yub_9eys91GHbSDyOXKu_Gg6DxslO_dMMO6a9Df.qHk_NNF.ZlRd_IyNhNdBpnic8M8 Oi_ALHEwi6cyHtx4.mNpmQLu7sUNMCpOT2txOoCb0XquhPJ8kNFnVoWB74aGcQI_xpMRJoJuMa69 ApDNC_wazSbMSYHs93pc2F3bBHxdGFkbQMr2V.9s2sqVwH85Tm52tQ5j2BNBUYdh3xu7.FhcIo11 xUwcgtKlIqb3c_IbY7LxmZghMbO.TnON1xXcLMWAfZiSzbSX6gT.koE4U7_7gby7t_YqKAmMAxXg - Received: from sonic.gate.mail.ne1.yahoo.com by sonic302.consmr.mail.ne1.yahoo.com with HTTP; Fri, 27 Sep 2019 22:22:20 +0000 Received: by smtp416.mail.ne1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID fe38cc4492cc14311a5b8e24e630f249; Fri, 27 Sep 2019 22:22:18 +0000 (UTC) From: Mark Millard Message-Id: Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? Date: Fri, 27 Sep 2019 15:22:16 -0700 In-Reply-To: <08CA4DA1-131C-4B14-BB57-EAA22A8CD5D9@yahoo.com> Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org To: Mark Johnston References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com> <20190926202936.GD5581@raichu> <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com> <6BC5F6BE-5FC3-48FA-9873-B20141FEFDF5@yahoo.com> <20190927192434.GA93180@raichu> <08CA4DA1-131C-4B14-BB57-EAA22A8CD5D9@yahoo.com> X-Mailer: Apple Mail (2.3445.104.11) X-Rspamd-Queue-Id: 46g5rY02t2z4S2C X-Spamd-Bar: +++ X-Spamd-Result: default: False [3.49 / 15.00]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[]; URI_COUNT_ODD(1.00)[21]; DKIM_TRACE(0.00)[yahoo.com:+]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; SUBJECT_ENDS_QUESTION(1.00)[]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; ASN(0.00)[asn:36646, ipnet:66.163.184.0/21, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; IP_SCORE(0.00)[ip: (6.75), ipnet: 66.163.184.0/21(1.32), asn: 36646(1.05), country: US(-0.05)]; NEURAL_SPAM_MEDIUM(0.99)[0.994,0]; IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(1.00)[0.995,0]; RCVD_IN_DNSWL_NONE(0.00)[148.186.163.66.list.dnswl.org : 127.0.5.0]; RCVD_TLS_LAST(0.00)[]; RCVD_COUNT_TWO(0.00)[2] Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Sep 2019 22:22:22 -0000 On 2019-Sep-27, at 13:52, Mark Millard wrote: > On 2019-Sep-27, at 12:24, Mark Johnston > wrote: >=20 >> On Thu, Sep 26, 2019 at 08:37:39PM -0700, Mark Millard wrote: >>>=20 >>>=20 >>> On 2019-Sep-26, at 17:05, Mark Millard > wrote: >>>=20 >>>> On 2019-Sep-26, at 13:29, Mark Johnston > wrote: >>>>> One possibility is that these are kernel memory allocations = occurring in >>>>> the context of the benchmark threads. Such allocations may not = respect >>>>> the configured policy since they are not private to the allocating >>>>> thread. For instance, upon opening a file, the kernel may = allocate a >>>>> vnode structure for that file. That vnode may be accessed by = threads >>>>> from many processes over its lifetime, and may be recycled many = times >>>>> before its memory is released back to the allocator. >>>>=20 >>>> For -l0-15 -n prefer:1 : >>>>=20 >>>> Looks like this reports sys_thr_new activity, sys_cpuset >>>> activity, and 0xffffffff80bc09bd activity (whatever that >>>> is). Mostly sys_thr_new activity, over 1300 of them . . . >>>>=20 >>>> dtrace: pid 13553 has exited >>>>=20 >>>>=20 >>>> kernel`uma_small_alloc+0x61 >>>> kernel`keg_alloc_slab+0x10b >>>> kernel`zone_import+0x1d2 >>>> kernel`uma_zalloc_arg+0x62b >>>> kernel`thread_init+0x22 >>>> kernel`keg_alloc_slab+0x259 >>>> kernel`zone_import+0x1d2 >>>> kernel`uma_zalloc_arg+0x62b >>>> kernel`thread_alloc+0x23 >>>> kernel`thread_create+0x13a >>>> kernel`sys_thr_new+0xd2 >>>> kernel`amd64_syscall+0x3ae >>>> kernel`0xffffffff811b7600 >>>> 2 >>>>=20 >>>> kernel`uma_small_alloc+0x61 >>>> kernel`keg_alloc_slab+0x10b >>>> kernel`zone_import+0x1d2 >>>> kernel`uma_zalloc_arg+0x62b >>>> kernel`cpuset_setproc+0x65 >>>> kernel`sys_cpuset+0x123 >>>> kernel`amd64_syscall+0x3ae >>>> kernel`0xffffffff811b7600 >>>> 2 >>>>=20 >>>> kernel`uma_small_alloc+0x61 >>>> kernel`keg_alloc_slab+0x10b >>>> kernel`zone_import+0x1d2 >>>> kernel`uma_zalloc_arg+0x62b >>>> kernel`uma_zfree_arg+0x36a >>>> kernel`thread_reap+0x106 >>>> kernel`thread_alloc+0xf >>>> kernel`thread_create+0x13a >>>> kernel`sys_thr_new+0xd2 >>>> kernel`amd64_syscall+0x3ae >>>> kernel`0xffffffff811b7600 >>>> 6 >>>>=20 >>>> kernel`uma_small_alloc+0x61 >>>> kernel`keg_alloc_slab+0x10b >>>> kernel`zone_import+0x1d2 >>>> kernel`uma_zalloc_arg+0x62b >>>> kernel`uma_zfree_arg+0x36a >>>> kernel`vm_map_process_deferred+0x8c >>>> kernel`vm_map_remove+0x11d >>>> kernel`vmspace_exit+0xd3 >>>> kernel`exit1+0x5a9 >>>> kernel`0xffffffff80bc09bd >>>> kernel`amd64_syscall+0x3ae >>>> kernel`0xffffffff811b7600 >>>> 6 >>>>=20 >>>> kernel`uma_small_alloc+0x61 >>>> kernel`keg_alloc_slab+0x10b >>>> kernel`zone_import+0x1d2 >>>> kernel`uma_zalloc_arg+0x62b >>>> kernel`thread_alloc+0x23 >>>> kernel`thread_create+0x13a >>>> kernel`sys_thr_new+0xd2 >>>> kernel`amd64_syscall+0x3ae >>>> kernel`0xffffffff811b7600 >>>> 22 >>>>=20 >>>> kernel`vm_page_grab_pages+0x1b4 >>>> kernel`vm_thread_stack_create+0xc0 >>>> kernel`kstack_import+0x52 >>>> kernel`uma_zalloc_arg+0x62b >>>> kernel`vm_thread_new+0x4d >>>> kernel`thread_alloc+0x31 >>>> kernel`thread_create+0x13a >>>> kernel`sys_thr_new+0xd2 >>>> kernel`amd64_syscall+0x3ae >>>> kernel`0xffffffff811b7600 >>>> 1324 >>>=20 >>> With sys_thr_new not respecting -n prefer:1 for >>> -l0-15 (especially for the thread stacks), I >>> looked some at the generated integration kernel >>> code and it makes significant use of %rsp based >>> memory accesses (read and write). >>>=20 >>> That would get both memory controllers going in >>> parallel (kernel vectors accesses to the preferred >>> memory domain), so not slowing down as expected. >>>=20 >>> If round-robin is not respected for thread stacks, >>> and if threads migrate cpus across memory domains >>> at times, there could be considerable variability >>> for that context as well. (This may not be the >>> only way to have different/extra variability for >>> this context.) >>>=20 >>> Overall: I'd be surprised if this was not >>> contributing to what I thought was odd about >>> the benchmark results. >>=20 >> Your tracing refers to kernel thread stacks though, not the stacks = used >> by threads when executing in user mode. My understanding is that a = HINT >> implementation would spend virtually all of its time in user mode, so = it >> shouldn't matter much or at all if kernel thread stacks are backed by >> memory from the "wrong" domain. >=20 > Looks like I was trying to think about it when I should have been = sleeping. > You are correct. >=20 >> This also doesn't really explain some of the disparities in the plots >> you sent me. For instance, you get a much higher peak QUIS on = FreeBSD >> than on Fedora with 16 threads and an interleave/round-robin domain >> selection policy. >=20 > True. I suppose that there is the possibility that steady_clock's = now() results > are odd for some reason for the type of context, leading to the = durations > between such being on the short side where things look different. >=20 > But the left hand side of the single-thread results (smaller memory = sizes for > the vectors for the integration kernel's use) do not show such a = rescaling. > (The single thread time measurements are strictly inside the thread of > execution, no thread creation or such counted for any size.) The right = hand > side of the single thread results (larger memory use, making smaller = cache > levels fairly ineffective) do generally show some rescaling, but not = as drastic > as multi-threaded. >=20 > Both round-robin and prefer:1 showed such for single threaded. Just to be explicit about what would be executed in the FreeBSD kernel . . . One difference between single-threaded vs. multi-threaded for the benchmark code is that the multi-threaded calls steady_clock's now from the main thread, counting time that thread creations contribute. Single-threaded calls steady_clock's now from inside the same thread that executes the integration kernel, not counting thread creation. steady_clock's now uses sys calls requesting CLOCK_MONOTONIC from what I've seen with truss. This would be code involved from the FreeBSD kernel that could contribute some to the measured time. Having the kernel stack for this on the memory domain where the time-measuring-CPU is vs. on a remote memory domain might make some difference in duration results. (But I've no clue specifically what to expect for the differences for my context so it may well not explain much of anything.) =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar) =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)