From owner-freebsd-amd64@freebsd.org Sun Sep 22 21:00:40 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 2FA0212C384 for ; Sun, 22 Sep 2019 21:00:40 +0000 (UTC) (envelope-from bugzilla-noreply@FreeBSD.org) Received: from mailman.nyi.freebsd.org (mailman.nyi.freebsd.org [IPv6:2610:1c1:1:606c::50:13]) by mx1.freebsd.org (Postfix) with ESMTP id 46c0Gb5g9kz4F6T for ; Sun, 22 Sep 2019 21:00:39 +0000 (UTC) (envelope-from bugzilla-noreply@FreeBSD.org) Received: by mailman.nyi.freebsd.org (Postfix) id C22DB12C381; Sun, 22 Sep 2019 21:00:39 +0000 (UTC) Delivered-To: amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id C1D8412C380 for ; Sun, 22 Sep 2019 21:00:39 +0000 (UTC) (envelope-from bugzilla-noreply@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 46c0Gb07BRz4F6F for ; Sun, 22 Sep 2019 21:00:38 +0000 (UTC) (envelope-from bugzilla-noreply@FreeBSD.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 7650E462E for ; Sun, 22 Sep 2019 21:00:38 +0000 (UTC) (envelope-from bugzilla-noreply@FreeBSD.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id x8ML0cgA082457 for ; Sun, 22 Sep 2019 21:00:38 GMT (envelope-from bugzilla-noreply@FreeBSD.org) Received: (from bugzilla@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id x8ML0c0R082450 for amd64@FreeBSD.org; Sun, 22 Sep 2019 21:00:38 GMT (envelope-from bugzilla-noreply@FreeBSD.org) Message-Id: <201909222100.x8ML0c0R082450@kenobi.freebsd.org> X-Authentication-Warning: kenobi.freebsd.org: bugzilla set sender to bugzilla-noreply@FreeBSD.org using -f From: bugzilla-noreply@FreeBSD.org To: amd64@FreeBSD.org Subject: Problem reports for amd64@FreeBSD.org that need special attention Date: Sun, 22 Sep 2019 21:00:38 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Sep 2019 21:00:40 -0000 To view an individual PR, use: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=(Bug Id). The following is a listing of current problems submitted by FreeBSD users, which need special attention. These represent problem reports covering all versions including experimental development code and obsolete releases. Status | Bug Id | Description ------------+-----------+--------------------------------------------------- In Progress | 239607 | amdtemp: Does not recognize AMD Ryzen 5 3600 1 problems total for which you should take action. From owner-freebsd-amd64@freebsd.org Mon Sep 23 20:28:21 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id DD970FECC8 for ; Mon, 23 Sep 2019 20:28:21 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic317-20.consmr.mail.gq1.yahoo.com (sonic317-20.consmr.mail.gq1.yahoo.com [98.137.66.146]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 46cbVr1Wyqz4TLr for ; Mon, 23 Sep 2019 20:28:19 +0000 (UTC) (envelope-from marklmi@yahoo.com) X-YMail-OSG: yr3jZ5sVM1mGGyjS7uFim0GLnpJRqkVKfPqh4P1Ysb2Ol0lavpF_R_JxL0LJgfV rPV446RYOVfoQLpPuSCQNQRF1pa5c5KgiVZ08XT7k57bmatkSictLO8LsBk6ps.E3JGKs8sxfgfp ICp7KM82RbGD4ljrQRKoH7QpBAvlI8cpQV63avqQpN0FqGKyGeAOvPnS..2Oia3v6lO2zL6DwgI7 GWJ2a84e3DoGr8td9Hv3pbjklFSCwBl_q..mYWCs1c7MxSRElX2U34FgYjHvxrTshWRX_kuI0M2M dvfVbi6bcGPhtUKl0TqcjjoC_S3.DXeHW9o2sDIBL2g8GfpVagG65PRMf9mi8HF16IUdGa8gD4ba d8AQp1hLMkzaF.gI08iN59Od1_HAEGGDldWxOMSibJIjU6e9kKmIBIZT4xCrXaftLadSH9tVUqql MKPZ2XCpGISmY3kfRncqCH5B7dnq_x2sSfSouB8.5z5P8M9.je0q_C9Sdl7qSYVRpBe3iR9Ons6w FwgAeHDl3ebZga0XDqF70A8KPRukKqUkLjt7RDgOCA50t8GC3r3iD4xfVTd26Dw.tF7W1J0I9lnn fgKAeCvYnOE2M.EQY1hjb5V8PrQQ8cvkmvVSMieLkr8Zbdf_iH1eLvdLJDo.sllID7EsRc1JobOi Tck9XSTw9b3OBDcCORxemCtwlxyfAPFW_vLbIHiaEaI.irAnLkE__SS.MHniJMqWIs3pNmP9kN6o Pvyoj1cIMvlhCuvQ94hX2nBlqpOjce81pcpOpvykVv4_PbANW5_cAx0BkdaydEwxxu.deWTl7YAX R4cgIlxCt2IB19mrO0mMWWyarxoPXK6Ll7MTCM0.ZnKcGrxVSCaFcyb7ew43uqHeb6Ciw14Svd7l yjyM7dZB.W_VS4fLgS2rrzSQbI8K96jXJ4R8tzA4yocBLfT7xkuxYNfn_GsBmHk_Q_PR1l4Z_Ya9 CwScmsQo3Pj4zaGMdKAkQHUT_pgtrDxAoh6eVCg0.WxvDCRuXearmr0wp6TGYO7h7WFnSFQcilUz u.yNqPDldf8BqfqAdR3CA1v4b5K7M6yCin.U8o_9GfQnd7dfrYmzuvolh6MvueaTfcKfrkNIuc8F pNMGO8iHiERfDEn1UZKLcpA8qM4OQGXewsNlQYz39sKPalftwufBGlaFo_Bgg2po856syJZHzIWK 47KaAHPh7xldVO0bGpcICsbB.mxtKykLOr2mVd272UIW.GbYBhlK2m3X2lk_WwdUV.4InIc76oPL sP.cUDMOz9qHbIx1Q..XXj45u5Cbmj7RrnxEPM6MCRmqq14dFvcp40L.hiIEvp9es8Nqz Received: from sonic.gate.mail.ne1.yahoo.com by sonic317.consmr.mail.gq1.yahoo.com with HTTP; Mon, 23 Sep 2019 20:28:18 +0000 Received: by smtp424.mail.gq1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID 26c35cfde03bc3c0174f09621ebbb85c; Mon, 23 Sep 2019 20:28:16 +0000 (UTC) From: Mark Millard Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? Message-Id: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> Date: Mon, 23 Sep 2019 13:28:15 -0700 To: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org X-Mailer: Apple Mail (2.3445.104.11) X-Rspamd-Queue-Id: 46cbVr1Wyqz4TLr X-Spamd-Bar: / X-Spamd-Result: default: False [0.50 / 15.00]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; FREEMAIL_FROM(0.00)[yahoo.com]; TO_DN_NONE(0.00)[]; MV_CASE(0.50)[]; DKIM_TRACE(0.00)[yahoo.com:+]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; SUBJECT_ENDS_QUESTION(1.00)[]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; ASN(0.00)[asn:36647, ipnet:98.137.64.0/21, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-0.85)[-0.847,0]; MIME_GOOD(-0.10)[text/plain]; IP_SCORE(0.00)[ip: (3.34), ipnet: 98.137.64.0/21(0.94), asn: 36647(0.75), country: US(-0.05)]; NEURAL_SPAM_MEDIUM(0.85)[0.847,0]; IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[146.66.137.98.list.dnswl.org : 127.0.5.0]; RCVD_TLS_LAST(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Sep 2019 20:28:21 -0000 Note: I have access to only one FreeBSD amd64 context, and it is also my only access to a NUMA context: 2 memory domains. A Threadripper 1950X context. Also: I have only a head FreeBSD context on any architecture, not 12.x or before. So I have limited compare/contrast material. I present the below basically to ask if the NUMA handling has been validated, or if it is going to be, at least for contexts that might apply to ThreadRipper 1950X and analogous contexts. My results suggest they are not (or libc++'s now times get messed up such that it looks like NUMA mishandling since this is based on odd benchmark results that involve mean time for laps, using a median of such across multiple trials). I ran a benchmark on both Fedora 30 and FreeBSD 13 on this 1950X got got expected results on Fedora but odd ones on FreeBSD. The benchmark is a variation on the old HINT benchmark, spanning the old multi-threading variation. I later tried Fedora because the FreeBSD results looked odd. The other architectures I tried FreeBSD benchmarking with did not look odd like this. (powerpc64 on a old PowerMac 2 socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd Ed. For these I used 4 threads, not more.) I tend to write in terms of plots made from the data instead of the raw benchmark data. FreeBSD testing based on: cpuset -l0-15 -n prefer:1 cpuset -l16-31 -n prefer:1 Fedora 30 testing based on: numactl --preferred 1 --cpunodebind 0 numactl --preferred 1 --cpunodebind 1 While I have more results, I reference primarily DSIZE and ISIZE being unsigned long long and also both being unsigned long as examples. Variations in results are not from the type differences for any LP64 architectures. (But they give an idea of benchmark variability in the test context.) The Fedora results solidly show the bandwidth limitation of using one memory controller. They also show the latency consequences for the remote memory domain case vs. the local memory domain case. There is not a lot of variability between the examples of the 2 type-pairs used for Fedora. Not true for FreeBSD on the 1950X: A) The latency-constrained part of the graph looks to normally be using the local memory domain when -l0-15 is in use for 8 threads. B) Both the -l0-15 and the -l16-31 parts of the graph for 8 threads that should be bandwidth limited show mostly examples that would have to involve both memory controllers for the bandwidth to get the results shown as far as I can tell. There is also wide variability ranging between the expected 1 controller result and, say, what a 2 controller round-robin would be expected produce. C) Even the single threaded result shows a higher result for larger total bytes for the kernel vectors. Fedora does not. I think that (B) is the most solid evidence for something being odd. For reference for FreeBSD: # cpuset -g -d 1 domain 1 mask: 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 -r352341 allows -prefer:0 but I happen to have used -prefer:1 in these experiments. The benchmark was built via devel/g++9 but linked with system libraries, including libc++. Unfortunately, I'm not yet ready for distributing source to the benchmark, but expect to at some point. I do not expect to ever distribute binaries. The source code for normal builds involves just standard C++17 code. Such builds are what is involved here. [The powerpc64 context is a system-clang 8, ELFv1 based system context, not the usual gcc 4.2.1 based one.] More notes: In the 'kernel vectors: total Bytes' vs. 'QUality Improvement Per Second' graphs the left hand side of the curve is latency limited. On the right is bandwidth limited for LP64. (The total Bytes axis is log base 2 scaling in the graphs.) Thread creation has latency so the 8-thread curves are mostly of interest for kernel vectors total bytes being 1 MiByte or more (say) so that thread creations are not that much of the total contributions to the measured time. The thread creations are via std::async use. === Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar) From owner-freebsd-amd64@freebsd.org Wed Sep 25 17:03:02 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 7B43F12D0D9; Wed, 25 Sep 2019 17:03:02 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: from mail-io1-xd42.google.com (mail-io1-xd42.google.com [IPv6:2607:f8b0:4864:20::d42]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 46dks15hZGz49g6; Wed, 25 Sep 2019 17:03:01 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: by mail-io1-xd42.google.com with SMTP id u8so655589iom.5; Wed, 25 Sep 2019 10:03:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=/BkNTpm4qrfGlFg6uC/F5bwBq53Emn6fZEETXEOqqo8=; b=Nn3RvPcBgfUx4U1YlBZJan8Cd6+/zcarv9BCBZSbusnJjuv6DLgVufbm34X56J4edT tAjlYj8Gqlkvh899+KPfrUukuFKQ8FMMxIQLB3aWMo7Rm8iE/0yMF/FnNPC/ktNqbVkI nnRGiqSDGl82oTbJsAnS4FWrrzCDJnz+SVk5aqHt3A5Ex4coiCMbk5PeXhe6MPGEOwJ3 boBwCWlLtr1d0WkUzEkMIrDTFZPDD4XmLl3F1OY3Ru/cssTzfnzNV9BXCoJ3rMxoOLkp fexl6bi1PZ1DkC0vYosySdVRkImjLxsDz/lpdXhf2jse0NC3+fh2WSil18epBV/rNVGd 2acw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=/BkNTpm4qrfGlFg6uC/F5bwBq53Emn6fZEETXEOqqo8=; b=Au1uk5pzHcWaPdvsZA5cb9m3BcT5IeUz5tyyTxB4Fp+g4byf/ar6XjjA0sGwA6Tmnr /SocDH+dejCM2HIg1/hkgmOupLeez/vllb9uOWk5KK7CdFQpYCO9JW3oqKqAoTgZkQzL ycf/FAext0ej5NBX3P2tQNSRe5hv3jkZWw/Fk+btTEEnCUHmyqcQ8UEeazvw7/IZzXDY nZSOQ6wZPXe3NMNHRnmxbBM3R5H9gLjfestheoPjlk+2lmCDCnSh4ZhiFwEMTZ6mYp86 WENbtBU/FWzwusbitlmB9XhrKTUTs1WvlbZymYa++htg5DmdqOKROsfPCM/n7r5wt2mm z4QQ== X-Gm-Message-State: APjAAAUtmkFn7Py6d6NBvGVUWNkKWk8W+n7vWtrW/H9ijfCmfy6YYjVm oUJ5Uf8MKxdkY6tz60406ss= X-Google-Smtp-Source: APXvYqxVKEEPSN/Jdi3TlK8ghtBQqqlewNBT2yaPFRt9fHfuagoGgoRhR8TwNAVqhvz348HGaXNF+w== X-Received: by 2002:a92:d146:: with SMTP id t6mr1168664ilg.187.1569430980501; Wed, 25 Sep 2019 10:03:00 -0700 (PDT) Received: from raichu (toroon0560w-lp140-01-69-159-39-167.dsl.bell.ca. [69.159.39.167]) by smtp.gmail.com with ESMTPSA id h70sm170197iof.48.2019.09.25.10.02.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Sep 2019 10:02:59 -0700 (PDT) Sender: Mark Johnston Date: Wed, 25 Sep 2019 13:02:55 -0400 From: Mark Johnston To: Mark Millard Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? Message-ID: <20190925170255.GA43643@raichu> References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> User-Agent: Mutt/1.12.1 (2019-06-15) X-Rspamd-Queue-Id: 46dks15hZGz49g6 X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20161025 header.b=Nn3RvPcB; dmarc=none; spf=pass (mx1.freebsd.org: domain of markjdb@gmail.com designates 2607:f8b0:4864:20::d42 as permitted sender) smtp.mailfrom=markjdb@gmail.com X-Spamd-Result: default: False [-1.22 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20161025]; RCVD_TLS_ALL(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; NEURAL_HAM_LONG(-1.00)[-0.999,0]; MIME_GOOD(-0.10)[text/plain]; MIME_TRACE(0.00)[0:+]; DMARC_NA(0.00)[freebsd.org]; TO_DN_SOME(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; RCVD_IN_DNSWL_NONE(0.00)[2.4.d.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0.4.6.8.4.0.b.8.f.7.0.6.2.list.dnswl.org : 127.0.5.0]; NEURAL_HAM_MEDIUM(-0.98)[-0.980,0]; IP_SCORE(-0.54)[ip: (2.12), ipnet: 2607:f8b0::/32(-2.61), asn: 15169(-2.18), country: US(-0.05)]; FORGED_SENDER(0.30)[markj@freebsd.org,markjdb@gmail.com]; FREEMAIL_TO(0.00)[yahoo.com]; SUBJECT_ENDS_QUESTION(1.00)[]; MID_RHS_NOT_FQDN(0.50)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; FROM_NEQ_ENVFROM(0.00)[markj@freebsd.org,markjdb@gmail.com]; FREEMAIL_ENVFROM(0.00)[gmail.com] X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Sep 2019 17:03:02 -0000 On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 wrote: > Note: I have access to only one FreeBSD amd64 context, and > it is also my only access to a NUMA context: 2 memory > domains. A Threadripper 1950X context. Also: I have only > a head FreeBSD context on any architecture, not 12.x or > before. So I have limited compare/contrast material. > > I present the below basically to ask if the NUMA handling > has been validated, or if it is going to be, at least for > contexts that might apply to ThreadRipper 1950X and > analogous contexts. My results suggest they are not (or > libc++'s now times get messed up such that it looks like > NUMA mishandling since this is based on odd benchmark > results that involve mean time for laps, using a median > of such across multiple trials). > > I ran a benchmark on both Fedora 30 and FreeBSD 13 on this > 1950X got got expected results on Fedora but odd ones on > FreeBSD. The benchmark is a variation on the old HINT > benchmark, spanning the old multi-threading variation. I > later tried Fedora because the FreeBSD results looked odd. > The other architectures I tried FreeBSD benchmarking with > did not look odd like this. (powerpc64 on a old PowerMac 2 > socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive > 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd > Ed. For these I used 4 threads, not more.) > > I tend to write in terms of plots made from the data instead > of the raw benchmark data. > > FreeBSD testing based on: > cpuset -l0-15 -n prefer:1 > cpuset -l16-31 -n prefer:1 > > Fedora 30 testing based on: > numactl --preferred 1 --cpunodebind 0 > numactl --preferred 1 --cpunodebind 1 > > While I have more results, I reference primarily DSIZE > and ISIZE being unsigned long long and also both being > unsigned long as examples. Variations in results are not > from the type differences for any LP64 architectures. > (But they give an idea of benchmark variability in the > test context.) > > The Fedora results solidly show the bandwidth limitation > of using one memory controller. They also show the latency > consequences for the remote memory domain case vs. the > local memory domain case. There is not a lot of > variability between the examples of the 2 type-pairs used > for Fedora. > > Not true for FreeBSD on the 1950X: > > A) The latency-constrained part of the graph looks to > normally be using the local memory domain when > -l0-15 is in use for 8 threads. > > B) Both the -l0-15 and the -l16-31 parts of the > graph for 8 threads that should be bandwidth > limited show mostly examples that would have to > involve both memory controllers for the bandwidth > to get the results shown as far as I can tell. > There is also wide variability ranging between the > expected 1 controller result and, say, what a 2 > controller round-robin would be expected produce. > > C) Even the single threaded result shows a higher > result for larger total bytes for the kernel > vectors. Fedora does not. > > I think that (B) is the most solid evidence for > something being odd. The implication seems to be that your benchmark program is using pages from both domains despite a policy which preferentially allocates pages from domain 1, so you would first want to determine if this is actually what's happening. As far as I know we currently don't have a good way of characterizing per-domain memory usage within a process. If your benchmark uses a large fraction of the system's memory, you could use the vm.phys_free sysctl to get a sense of how much memory from each domain is free. Another possibility is to use DTrace to trace the requested domain in vm_page_alloc_domain_after(). For example, the following DTrace one-liner counts the number of pages allocated per domain by ls(1): # dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n rr ls" ... 0 71 1 72 # dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:1 ls" ... 1 143 # dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:0 ls" ... 0 143 This approach might not work for various reasons depending on how exactly your benchmark program works. From owner-freebsd-amd64@freebsd.org Thu Sep 26 02:26:54 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id F3D00F358F for ; Thu, 26 Sep 2019 02:26:53 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic307-8.consmr.mail.gq1.yahoo.com (sonic307-8.consmr.mail.gq1.yahoo.com [98.137.64.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 46dzMc2hV0z4sdq for ; Thu, 26 Sep 2019 02:26:52 +0000 (UTC) (envelope-from marklmi@yahoo.com) X-YMail-OSG: IdgWFNMVM1nltA91qn9kCoa1srQWZiyVsaMEFe6PWlZp9iroQ4j62b2jR4jQ5PT K4IipZjMYHmbVz3sWc04ryVCbO4WWLxp6JRrAj_Dw.5YPCa0GEAxPNTSa5EUWsMwMqahIgiwc7EA lumFiUmji9_zSlEqhna3dTezOgCfwxmhWr21ZDXe5CXC2YXx96Zcz.zYxlFLn9G4kEk57h4.TRDM .2KnwGJA59TVbFL7vprmxq5RoOBXuXxT55Wkz1PD8MSXG_BwhaSZT0KEbyokvnkVfQqI1bI0kFQL Ozh28aiofp5IGmsHEmYnFTIPvQMLRx4.ZBq3MkRGB2ClCoCd0bBEdHoYT8V5D5DsiCb_nqxR2gbT d1zf6LeJ2lSXsK8o3Zc_l1H9kbhIBiZmglWPwVRnl82g4KY6Kv581vlSdY.PuJV1Fjyf1Lwv9Cdq o51V4AU49SAwfmDOXBpySUQWi9MHXJ0OSPmpUc_RZxND4axsOEGqAAxEieL6K5ISEVx7hfX6Zpds V9Z_b9SRl.I.SbCds9CfJirVcUTr1dfdC6QJbdvzAsw5e5ctWOlwH8bnzmvaZ2_42jj7dPem8LHG YW7hkck4WvWvacxNAJ7ZCpfGUKTDwcgMsrKelR4H9tm71sJRoviHUUAO3C2QB6ICcYfEKBEcLJjE o8CyF7L35ILSXn9y7RIHxrW7FReVZ0uG_Kc2G5wfmtLMPo7CxTSDN5Zktzi8Av5E4K.7Bydz35FD obvId_0jV8M0kaTNw5CyUkZveYFCs3_8oOw15hDY86EvVhPpsMwySaKylLzgEp1fYupUeOi67isx 8mAtgNVfmCsPQnZbrFb5Jm2seIOKTn41rurlJx_9kFT1MSG6XcZUZMGTt6eD9.lCB_ooMso_yYK4 764XxSaSSbN7mAUrSBqTVpengQu5UWsqIKFqaWDToUyvKDJjXxyUa5z1EGSiwtH.I.j80B_t6F9w CMm9oP7sJ9PLsBf75WxzAtDLncUTWr8c00V_iJNGSs6Lash7d7x2IrQmTc8Tomj6BI5dIcTBtitV abNjuxBR6tsd647q_AbRdIVZKQBT_L4sOlMh_VhjU6E.XteCesh9rEnqRV8oc2Og5t1C248xRxgX uZgH0RAP1aXKCjWl7TYTHPOLSclia.9QesPIibaKicwDGjwu_VlfeeUKXp_ghzGCDlLQFwXBMH4W XXLuYeH92JgYfhQPppsB1LqNKiqnaqxpTjMngvAiCooyPruXYmj_CL.cu64Ac1pTdiztGMDLJjMg 6LE.ixpruR1.M8DSDSleoc0SAhu9Kb_L1LJDXxILe3v23TdGMMX_1bIx.jNTxVEhwIg-- Received: from sonic.gate.mail.ne1.yahoo.com by sonic307.consmr.mail.gq1.yahoo.com with HTTP; Thu, 26 Sep 2019 02:26:50 +0000 Received: by smtp421.mail.gq1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID 4b6622377680b3aa5486ead83e10c84e; Thu, 26 Sep 2019 02:26:47 +0000 (UTC) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? From: Mark Millard In-Reply-To: <20190925170255.GA43643@raichu> Date: Wed, 25 Sep 2019 19:26:46 -0700 Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> To: Mark Johnston X-Mailer: Apple Mail (2.3445.104.11) X-Rspamd-Queue-Id: 46dzMc2hV0z4sdq X-Spamd-Bar: + X-Spamd-Result: default: False [1.85 / 15.00]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[]; DKIM_TRACE(0.00)[yahoo.com:+]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; IP_SCORE(0.00)[ip: (7.39), ipnet: 98.137.64.0/21(0.94), asn: 36647(0.75), country: US(-0.05)]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; SUBJECT_ENDS_QUESTION(1.00)[]; MID_RHS_MATCH_FROM(0.00)[]; ASN(0.00)[asn:36647, ipnet:98.137.64.0/21, country:US]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; MIME_GOOD(-0.10)[text/plain]; DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0]; NEURAL_SPAM_MEDIUM(0.95)[0.955,0]; IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(0.40)[0.399,0]; RCVD_IN_DNSWL_NONE(0.00)[32.64.137.98.list.dnswl.org : 127.0.5.0]; RCVD_TLS_LAST(0.00)[]; RWL_MAILSPIKE_POSSIBLE(0.00)[32.64.137.98.rep.mailspike.net : 127.0.0.17]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Sep 2019 02:26:54 -0000 On 2019-Sep-25, at 10:02, Mark Johnston wrote: > On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via = freebsd-amd64 wrote: >> Note: I have access to only one FreeBSD amd64 context, and >> it is also my only access to a NUMA context: 2 memory >> domains. A Threadripper 1950X context. Also: I have only >> a head FreeBSD context on any architecture, not 12.x or >> before. So I have limited compare/contrast material. >>=20 >> I present the below basically to ask if the NUMA handling >> has been validated, or if it is going to be, at least for >> contexts that might apply to ThreadRipper 1950X and >> analogous contexts. My results suggest they are not (or >> libc++'s now times get messed up such that it looks like >> NUMA mishandling since this is based on odd benchmark >> results that involve mean time for laps, using a median >> of such across multiple trials). >>=20 >> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this >> 1950X got got expected results on Fedora but odd ones on >> FreeBSD. The benchmark is a variation on the old HINT >> benchmark, spanning the old multi-threading variation. I >> later tried Fedora because the FreeBSD results looked odd. >> The other architectures I tried FreeBSD benchmarking with >> did not look odd like this. (powerpc64 on a old PowerMac 2 >> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive >> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd >> Ed. For these I used 4 threads, not more.) >>=20 >> I tend to write in terms of plots made from the data instead >> of the raw benchmark data. >>=20 >> FreeBSD testing based on: >> cpuset -l0-15 -n prefer:1 >> cpuset -l16-31 -n prefer:1 >>=20 >> Fedora 30 testing based on: >> numactl --preferred 1 --cpunodebind 0 >> numactl --preferred 1 --cpunodebind 1 >>=20 >> While I have more results, I reference primarily DSIZE >> and ISIZE being unsigned long long and also both being >> unsigned long as examples. Variations in results are not >> from the type differences for any LP64 architectures. >> (But they give an idea of benchmark variability in the >> test context.) >>=20 >> The Fedora results solidly show the bandwidth limitation >> of using one memory controller. They also show the latency >> consequences for the remote memory domain case vs. the >> local memory domain case. There is not a lot of >> variability between the examples of the 2 type-pairs used >> for Fedora. >>=20 >> Not true for FreeBSD on the 1950X: >>=20 >> A) The latency-constrained part of the graph looks to >> normally be using the local memory domain when >> -l0-15 is in use for 8 threads. >>=20 >> B) Both the -l0-15 and the -l16-31 parts of the >> graph for 8 threads that should be bandwidth >> limited show mostly examples that would have to >> involve both memory controllers for the bandwidth >> to get the results shown as far as I can tell. >> There is also wide variability ranging between the >> expected 1 controller result and, say, what a 2 >> controller round-robin would be expected produce. >>=20 >> C) Even the single threaded result shows a higher >> result for larger total bytes for the kernel >> vectors. Fedora does not. >>=20 >> I think that (B) is the most solid evidence for >> something being odd. >=20 > The implication seems to be that your benchmark program is using pages > from both domains despite a policy which preferentially allocates = pages > from domain 1, so you would first want to determine if this is = actually > what's happening. As far as I know we currently don't have a good way > of characterizing per-domain memory usage within a process. >=20 > If your benchmark uses a large fraction of the system's memory, you > could use the vm.phys_free sysctl to get a sense of how much memory = from > each domain is free. The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per = memory domain. I've never configured the benchmark such that it even reaches 10 GiBytes on this hardware. (It stops for a time constraint first, based on the values in use for the "adjustable" items.) The benchmark runs the Hierarchical INTegeration kernel for a sequence of larger and larger number of cells in the grid that it uses. Each size is run in isolation before the next is tried, each gets its own timings. Each size gets its own kernel vector allocations (and deallocations) with the trails and laps within a trail reusing the same memory. Each lap in each trial gets its own thread creations (and completions). The main thread combines the results when there are multiple threads involved. (So I'm not sure of the main thread's behavior relative to the cpuset commands.) Thus, there are lots of thread creations overall, as well as lots of allocations of vectors for use in the integration kernel code. What it looks like to me that the std::async's internal thread creations are not respecting the cpuset command settings: in a sense, not inheriting the cpuset information correctly (or such is being ignored). For reference, the following shows the std::async use for for the multi-threaded case. Normal builds plug in no-op code for: RestrictThreadToCpu(. . ., . . .); // if built for such (intended as a hook for potential experiments that cpuset can not set up for multi-threaded). I make this point because the call shows up below but it is not doing anything here. One std::async use is for where the kernel vector memory allocations are done: for (HwConcurrencyCount thread{0u}; thread{memry}); } ); alloc_thread.wait(); } So the main thread is not doing the allocations: created, temporary threads are. As for the running the trials and laps of the integration kernel for a given size of grid for the integration, each lap creates its own threads: for ( auto trials_left{NTRIAL} // 0u result{}; for ( auto lap_count_down{laps} ; 0u ( thread , ki , = threads_kvs[thread] ); } ); } KernelResults lap_result{}; =20 for(auto& thread : in_parallel) { lap_result.Merge(thread.get()); } =20 result=3D lap_result; // Using the last lap's result } =20 auto const finish{clock_info.Now()}; =20 . . . (process the measurement, no threading) . . . } Based on such and each cpuset command that I reported, I'd not expect any variability in which domains the memory allocations are made from or which domain's cpus are using the memory the accesses. For reference for how kernel vectors are structured: template struct KernelVectors { using RECTVector =3D std::vector>; =20 using ErrsVector =3D std::vector; =20 using IxesVector =3D std::vector; // Holds indexes into rect and errs. RECTVector rect; ErrsVector errs; IxesVector ixes; // indexes into rect and errs. =20 KernelVectors() =3D delete; =20 KernelVectors(ISIZE memry) : rect(memry), errs(memry*2), = ixes(memry*2) {} =20 . . . (irrelevant methods omitted) . . . }; // KernelVectors with (irrelevant comment lines eliminated): template struct RECT { DSIZE ahi, // Upper bound via rectangle areas for scx by scy = breakdown alo, // Lower bound via rectangle areas for scx by scy = breakdown dx, // Interval widths, SEE NOTES BELOW. flh, // Function values of left coordinates, high fll, // Function values of left coordinates, low frh, // Function values of right coordinates, high frl, // Function values of right coordinates, low xl, // Left x-coordinates of subintervals xr; // Right x-coordinates of subintervals }; // RECT Even the single-threaded integration kernel case executes the kernel vector memory allocation step and the trails and laps via std::async instead of using the main thread for such. Note: The original HINT's copyright holder, Iowa State University Research Foundation, Inc., indicated that HINT was intended to be licensed via GPLv2 (not earlier and not later), despite how it was (inappropriately) distributed wihtout indicating which GPL version back then. Thus, overall, this variation on HINT is also GPLv2-only in order to respect the original intent. > Another possibility is to use DTrace to trace the > requested domain in vm_page_alloc_domain_after(). For example, the > following DTrace one-liner counts the number of pages allocated per > domain by ls(1): >=20 > # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n rr ls" > ... > 0 71 > 1 72 > # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:1 = ls" > ... > 1 143 > # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:0 = ls" > ... > 0 143 I'll think about this, although it would give no information which CPUs are executing the threads that are allocating or accessing the vectors for the integration kernel. So, for example, if the threads migrate to or start out on cpus they should not be on, this would not report such. For such "which CPUs" questions one stab would be simply to watch with top while the benchmark is running and see which CPUs end up being busy vs. which do not. I think I'll try this. > This approach might not work for various reasons depending on how > exactly your benchmark program works. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar) From owner-freebsd-amd64@freebsd.org Thu Sep 26 03:28:06 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 865EDF60F6 for ; Thu, 26 Sep 2019 03:28:06 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic302-1.consmr.mail.bf2.yahoo.com (sonic302-1.consmr.mail.bf2.yahoo.com [74.6.135.40]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 46f0kF42Wfz3D6Q for ; Thu, 26 Sep 2019 03:28:05 +0000 (UTC) (envelope-from marklmi@yahoo.com) X-YMail-OSG: bVLUYfEVM1l.Gn1jQSz2IR_GbI4GeaIHsz7S0nZzwoUi2izTf90DzR7hTgnsbo8 95Rh6iCdFcwGiG4PMlQxltk4NE8DMI76gwswkQ6lNWirmbPsQmuLR0Baz35c4RiZWPB66MR_eC5g MxVZM43h_W7mD60qByzem0RV9agLuTbBvhnekwS.FUvkODjirP37et3vTWCsZjotwU6JiK8ze0T_ qEbgAJ4SQlG0zW4OqC4feWXxr3AMfYeZDe3C1I32nu8SwNT7Tx17k2x21FsQkTn6RBkMfUhIe7VL bIF90Vw4BVLI8wr4gHDhQ.OHs8T2vWh3Bp8iihVJa7d4O4JB4rRX__NvU.Q.fA8Pw9dBABjpZvhL WuFh3WBKMU0Wp8Q4Tkqx6bw8kbrx_3OjzzRXOq4oIaWwJrXrnSs9NEJizE1PGUrMBUbkwdqxEYib ZqnE4eyvvUmiLTlwOW_yo3y1wblXKydaQF0xH4wUc8MnFoa0WFyBkCOYu9e.pdWLngdVEyMhNe24 2dCB4ge9rWL4HdtWBRvVTgYuQ4_7Zp.oxoCsKPNB2F.iFMRPf70DqUcojBRXlcRMAgZh3RO5NIaw X4MGtSh4NvXmuhlFv8UtEQZ_bAm9ZCWbjBk3nFed6.mh.uTOVycvzQv5XCW_s1T92k187eaxHReX .mPVz.JXvGNTkLTIZmRwaVjfRVEjlIy3qNgaVOsmwyyRRI9zWsQXShqmjf8o3gIclQzpBh1YhQ_K ukfHBs4KLcdnjYFmLpAqCLJY2_t4xiElGPxfY_tiCWocSotcRRSxV68r3R6cuHUyPZTPbkYDjfNl Ls48V2lj2skCzaunbNk2C2oUBtgFZFjA5qBoRgH8U9bfjnIarw_WI4HVEJkJpWnDanLJgY.nkYiF xFUJX.PeRZ_pyf3YvcJhUlXGM9cc4qGvmZpXkj7Oa6U8OJuyYGhEA4Xj4.RBC8A4hTkmtsaeYJHO QPUCWcE8xpxbMYKCNybxY1w1GLGAZIMXlIfhDpphMeH.lG9I9byFWcRRSYIsvBCoGdR_NpFgrzJ. iDq2kdYxEUNczPt6yh4MMmgPBYd.BEsdlPkayGidGLbuDGHe0sBhQ9MV9qNAoI8x3YlamT2Gw7Ra .C.jtg6ZTS2qEmQvf2KqGLMdwvA2.8J1TGSHKq.9GVP5CTAvJPh2RID3MnKYZb_BdOGbs02fHSYk xph0Xh47vwC7NbS_31zfIMK1SFAjps5mQViaCs6A_Hjp7PKdfDAZbh1qU7cV0ATFGXXrjPT49e6S 4V67rxBsj6ieGi.TdZ5WPoye43I2wV3L2NtB.cYxOLuEPtUETL2kCwLlemy3VuM_Bl0t.og-- Received: from sonic.gate.mail.ne1.yahoo.com by sonic302.consmr.mail.bf2.yahoo.com with HTTP; Thu, 26 Sep 2019 03:28:04 +0000 Received: by smtp424.mail.bf1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID 48255e095b77b91aef8f4e0ec1b7a90a; Thu, 26 Sep 2019 03:28:01 +0000 (UTC) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? From: Mark Millard In-Reply-To: <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> Date: Wed, 25 Sep 2019 20:27:58 -0700 Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> To: Mark Johnston X-Mailer: Apple Mail (2.3445.104.11) X-Rspamd-Queue-Id: 46f0kF42Wfz3D6Q X-Spamd-Bar: + X-Spamd-Result: default: False [1.27 / 15.00]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[]; DKIM_TRACE(0.00)[yahoo.com:+]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; SUBJECT_ENDS_QUESTION(1.00)[]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; ASN(0.00)[asn:26101, ipnet:74.6.128.0/21, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; NEURAL_HAM_LONG(-0.15)[-0.153,0]; MIME_GOOD(-0.10)[text/plain]; IP_SCORE(0.00)[ip: (3.14), ipnet: 74.6.128.0/21(1.46), asn: 26101(1.17), country: US(-0.05)]; NEURAL_SPAM_MEDIUM(0.93)[0.926,0]; IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[40.135.6.74.list.dnswl.org : 127.0.5.0]; RCVD_TLS_LAST(0.00)[]; RWL_MAILSPIKE_POSSIBLE(0.00)[40.135.6.74.rep.mailspike.net : 127.0.0.17]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Sep 2019 03:28:06 -0000 On 2019-Sep-25, at 19:26, Mark Millard wrote: > On 2019-Sep-25, at 10:02, Mark Johnston wrote: >=20 >> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via = freebsd-amd64 wrote: >>> Note: I have access to only one FreeBSD amd64 context, and >>> it is also my only access to a NUMA context: 2 memory >>> domains. A Threadripper 1950X context. Also: I have only >>> a head FreeBSD context on any architecture, not 12.x or >>> before. So I have limited compare/contrast material. >>>=20 >>> I present the below basically to ask if the NUMA handling >>> has been validated, or if it is going to be, at least for >>> contexts that might apply to ThreadRipper 1950X and >>> analogous contexts. My results suggest they are not (or >>> libc++'s now times get messed up such that it looks like >>> NUMA mishandling since this is based on odd benchmark >>> results that involve mean time for laps, using a median >>> of such across multiple trials). >>>=20 >>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this >>> 1950X got got expected results on Fedora but odd ones on >>> FreeBSD. The benchmark is a variation on the old HINT >>> benchmark, spanning the old multi-threading variation. I >>> later tried Fedora because the FreeBSD results looked odd. >>> The other architectures I tried FreeBSD benchmarking with >>> did not look odd like this. (powerpc64 on a old PowerMac 2 >>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive >>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd >>> Ed. For these I used 4 threads, not more.) >>>=20 >>> I tend to write in terms of plots made from the data instead >>> of the raw benchmark data. >>>=20 >>> FreeBSD testing based on: >>> cpuset -l0-15 -n prefer:1 >>> cpuset -l16-31 -n prefer:1 >>>=20 >>> Fedora 30 testing based on: >>> numactl --preferred 1 --cpunodebind 0 >>> numactl --preferred 1 --cpunodebind 1 >>>=20 >>> While I have more results, I reference primarily DSIZE >>> and ISIZE being unsigned long long and also both being >>> unsigned long as examples. Variations in results are not >>> from the type differences for any LP64 architectures. >>> (But they give an idea of benchmark variability in the >>> test context.) >>>=20 >>> The Fedora results solidly show the bandwidth limitation >>> of using one memory controller. They also show the latency >>> consequences for the remote memory domain case vs. the >>> local memory domain case. There is not a lot of >>> variability between the examples of the 2 type-pairs used >>> for Fedora. >>>=20 >>> Not true for FreeBSD on the 1950X: >>>=20 >>> A) The latency-constrained part of the graph looks to >>> normally be using the local memory domain when >>> -l0-15 is in use for 8 threads. >>>=20 >>> B) Both the -l0-15 and the -l16-31 parts of the >>> graph for 8 threads that should be bandwidth >>> limited show mostly examples that would have to >>> involve both memory controllers for the bandwidth >>> to get the results shown as far as I can tell. >>> There is also wide variability ranging between the >>> expected 1 controller result and, say, what a 2 >>> controller round-robin would be expected produce. >>>=20 >>> C) Even the single threaded result shows a higher >>> result for larger total bytes for the kernel >>> vectors. Fedora does not. >>>=20 >>> I think that (B) is the most solid evidence for >>> something being odd. >>=20 >> The implication seems to be that your benchmark program is using = pages >> from both domains despite a policy which preferentially allocates = pages >> from domain 1, so you would first want to determine if this is = actually >> what's happening. As far as I know we currently don't have a good = way >> of characterizing per-domain memory usage within a process. >>=20 >> If your benchmark uses a large fraction of the system's memory, you >> could use the vm.phys_free sysctl to get a sense of how much memory = from >> each domain is free. >=20 > The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per = memory > domain. I've never configured the benchmark such that it even reaches > 10 GiBytes on this hardware. (It stops for a time constraint first, > based on the values in use for the "adjustable" items.) >=20 > . . . (much omitted material) . . . >=20 >> Another possibility is to use DTrace to trace the >> requested domain in vm_page_alloc_domain_after(). For example, the >> following DTrace one-liner counts the number of pages allocated per >> domain by ls(1): >>=20 >> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n rr ls" >> ... >> 0 71 >> 1 72 >> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:1 = ls" >> ... >> 1 143 >> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:0 = ls" >> ... >> 0 143 >=20 > I'll think about this, although it would give no > information which CPUs are executing the threads > that are allocating or accessing the vectors for > the integration kernel. So, for example, if the > threads migrate to or start out on cpus they > should not be on, this would not report such. >=20 > For such "which CPUs" questions one stab would > be simply to watch with top while the benchmark > is running and see which CPUs end up being busy > vs. which do not. I think I'll try this. Using top did not show evidence of the wrong CPUs being actively in use. My variation of top is unusual in that it also tracks some maximum observed figures and reports them, here being: 8804M MaxObsActive, 4228M MaxObsWired, 13G MaxObs(Act+Wir) (no swap use was reported). This gives a system level view of about how much RAM was put to use during the monitoring of the 2 benchmark runs (-l0-15 and -l16-31). No where near enough used to require both memory domains to be in use. Thus, it would appear to be just where the allocations are made for -n prefer:1 that matters, at least when a (temporary) thread does the allocations. >> This approach might not work for various reasons depending on how >> exactly your benchmark program works. I've not tried dtrace yet. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar) From owner-freebsd-amd64@freebsd.org Thu Sep 26 05:03:23 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id B4F03F89A2 for ; Thu, 26 Sep 2019 05:03:23 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic307-10.consmr.mail.ne1.yahoo.com (sonic307-10.consmr.mail.ne1.yahoo.com [66.163.190.33]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 46f2rB3JsBz3J7H for ; Thu, 26 Sep 2019 05:03:22 +0000 (UTC) (envelope-from marklmi@yahoo.com) X-YMail-OSG: gxGVtbsVM1m2AYfwbXiTkVnrazwTpntbCSJvi7vir_fNUU2.vKBaGt_9YLw.0Mn YLN1qjeQaB8OBcGbOL32bbxjl8ts9gdslSmiEnLJIBf8bc54iJO6.il0w3P.TLMOfprMmeUCTTwc koEl_9zKVOrUdJCcuzZqr_09ANrvFE7H7DHh6YA_t4wjYi3HnWB5bJJtla3nt8z0h9YkJy_eQ1Mr zaICTt.sGlMk.k3PofJFPyk7UI5UjSd0ul8ULYdffKQlIcaUuyA0gOXEcg3YFT3kAiNXQtU5mByX iY_o_BkqZ_o9dR1CJbbI_ZfSuLDPUVLnJKmRKRUI4aauok5S6BpXc5xlILc8byvJgWwQTgeOSras MVyR70k5qljUaLCq3D4jiNmD.xt3PMG6mITaLc6iVAfy52l.n6c.p1eYZXfFpoOs2jcayn4F22li APftOunliPpuRo9Fn8O1WMLMmHDFWc9PnfC6lsW4ZvT1IGUThVYxUxd34y6tNMnhmy1OboeCQsZQ fOU1I8TUj5PvZji3H9zfR.KQRJiGox07UHDjoqayRtAAttlTUqsFhwbVuGruY1j1TLJkL_pvD1IJ nf7TCmkvjqlMglHdRx2Tz5sQXG7OsoK6PRRr1xm55EnFkCp3.PufMmhq0jAoFCfty3wniKaXFZKS sOqpb6f.lhleT2pu9BtiEj4zA6lCWbAXQKFBoqHNcTtocuQrIOmk17i.M35XYu1mpnp5K1G_nJWH v0GgLnXVDufhZDh.snDavmKujpol4ttvE_Dz283YQJxg8GO1tYxhmZ57iyg8rOZbJpbtgZ9XM1Y5 jrL0hm..BTOOA6bByoqVcWcwTsaVGzE17BbxSYMdkimxZIDi.WqsUqjsnXmjPZcX8CH48h4TKUqg fDrKIjI6uu7U3BUi5DueSbv042v4HP3lkV0DxCfZQfgW4fcXMVPyeCd1XSEPeIhYxsv9kfuF5akb bONNtCYSI3EB53Sou5NerfNXpHYqAsFQS6jq0h0aguBXAREeqofTWvoE8KLSBA9lrcnHYo2._DSA w8l.QqDEIaFrYO1R8vkApXiwKvOeD0uwjxlRf4FwZkbSFXT2nTsZLScvhKz.9nmJtthgdjH4.piG ek2u.LPl2iy4sJqiSAUz5CFP0CIuxfNxLu9HAKj7ly2S95wRt2YDo9IFxRb2FYuNOVR0Fe0oGFpO lGIYPKE8khSb2scX0HNpe1o.4RuKEaoxHCEUycibZVk7X1C7qNZAKSk2EiQcv7FXxJcnaZpJ2KNG NX5UJSiFfqva5uCE5N04M5bBTMfAI.7IUL6oBd1pRgxbuPzXKrvSB Received: from sonic.gate.mail.ne1.yahoo.com by sonic307.consmr.mail.ne1.yahoo.com with HTTP; Thu, 26 Sep 2019 05:03:21 +0000 Received: by smtp408.mail.ne1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID 80f364357394574d4332f678e732eb78; Thu, 26 Sep 2019 05:03:16 +0000 (UTC) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? From: Mark Millard In-Reply-To: <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> Date: Wed, 25 Sep 2019 22:03:14 -0700 Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com> References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> To: Mark Johnston X-Mailer: Apple Mail (2.3445.104.11) X-Rspamd-Queue-Id: 46f2rB3JsBz3J7H X-Spamd-Bar: + X-Spamd-Result: default: False [1.40 / 15.00]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[]; DKIM_TRACE(0.00)[yahoo.com:+]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; IP_SCORE(0.00)[ip: (3.77), ipnet: 66.163.184.0/21(1.31), asn: 36646(1.05), country: US(-0.05)]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; SUBJECT_ENDS_QUESTION(1.00)[]; MID_RHS_MATCH_FROM(0.00)[]; ASN(0.00)[asn:36646, ipnet:66.163.184.0/21, country:US]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; NEURAL_HAM_LONG(-0.04)[-0.038,0]; MIME_GOOD(-0.10)[text/plain]; DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0]; NEURAL_SPAM_MEDIUM(0.94)[0.936,0]; IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[33.190.163.66.list.dnswl.org : 127.0.5.0]; RCVD_TLS_LAST(0.00)[]; RWL_MAILSPIKE_POSSIBLE(0.00)[33.190.163.66.rep.mailspike.net : 127.0.0.17]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Sep 2019 05:03:23 -0000 On 2019-Sep-25, at 20:27, Mark Millard wrote: > On 2019-Sep-25, at 19:26, Mark Millard wrote: >=20 >> On 2019-Sep-25, at 10:02, Mark Johnston wrote: >>=20 >>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via = freebsd-amd64 wrote: >>>> Note: I have access to only one FreeBSD amd64 context, and >>>> it is also my only access to a NUMA context: 2 memory >>>> domains. A Threadripper 1950X context. Also: I have only >>>> a head FreeBSD context on any architecture, not 12.x or >>>> before. So I have limited compare/contrast material. >>>>=20 >>>> I present the below basically to ask if the NUMA handling >>>> has been validated, or if it is going to be, at least for >>>> contexts that might apply to ThreadRipper 1950X and >>>> analogous contexts. My results suggest they are not (or >>>> libc++'s now times get messed up such that it looks like >>>> NUMA mishandling since this is based on odd benchmark >>>> results that involve mean time for laps, using a median >>>> of such across multiple trials). >>>>=20 >>>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this >>>> 1950X got got expected results on Fedora but odd ones on >>>> FreeBSD. The benchmark is a variation on the old HINT >>>> benchmark, spanning the old multi-threading variation. I >>>> later tried Fedora because the FreeBSD results looked odd. >>>> The other architectures I tried FreeBSD benchmarking with >>>> did not look odd like this. (powerpc64 on a old PowerMac 2 >>>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive >>>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd >>>> Ed. For these I used 4 threads, not more.) >>>>=20 >>>> I tend to write in terms of plots made from the data instead >>>> of the raw benchmark data. >>>>=20 >>>> FreeBSD testing based on: >>>> cpuset -l0-15 -n prefer:1 >>>> cpuset -l16-31 -n prefer:1 >>>>=20 >>>> Fedora 30 testing based on: >>>> numactl --preferred 1 --cpunodebind 0 >>>> numactl --preferred 1 --cpunodebind 1 >>>>=20 >>>> While I have more results, I reference primarily DSIZE >>>> and ISIZE being unsigned long long and also both being >>>> unsigned long as examples. Variations in results are not >>>> from the type differences for any LP64 architectures. >>>> (But they give an idea of benchmark variability in the >>>> test context.) >>>>=20 >>>> The Fedora results solidly show the bandwidth limitation >>>> of using one memory controller. They also show the latency >>>> consequences for the remote memory domain case vs. the >>>> local memory domain case. There is not a lot of >>>> variability between the examples of the 2 type-pairs used >>>> for Fedora. >>>>=20 >>>> Not true for FreeBSD on the 1950X: >>>>=20 >>>> A) The latency-constrained part of the graph looks to >>>> normally be using the local memory domain when >>>> -l0-15 is in use for 8 threads. >>>>=20 >>>> B) Both the -l0-15 and the -l16-31 parts of the >>>> graph for 8 threads that should be bandwidth >>>> limited show mostly examples that would have to >>>> involve both memory controllers for the bandwidth >>>> to get the results shown as far as I can tell. >>>> There is also wide variability ranging between the >>>> expected 1 controller result and, say, what a 2 >>>> controller round-robin would be expected produce. >>>>=20 >>>> C) Even the single threaded result shows a higher >>>> result for larger total bytes for the kernel >>>> vectors. Fedora does not. >>>>=20 >>>> I think that (B) is the most solid evidence for >>>> something being odd. >>>=20 >>> The implication seems to be that your benchmark program is using = pages >>> from both domains despite a policy which preferentially allocates = pages >>> from domain 1, so you would first want to determine if this is = actually >>> what's happening. As far as I know we currently don't have a good = way >>> of characterizing per-domain memory usage within a process. >>>=20 >>> If your benchmark uses a large fraction of the system's memory, you >>> could use the vm.phys_free sysctl to get a sense of how much memory = from >>> each domain is free. >>=20 >> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per = memory >> domain. I've never configured the benchmark such that it even reaches >> 10 GiBytes on this hardware. (It stops for a time constraint first, >> based on the values in use for the "adjustable" items.) >>=20 >> . . . (much omitted material) . . . >=20 >>=20 >>> Another possibility is to use DTrace to trace the >>> requested domain in vm_page_alloc_domain_after(). For example, the >>> following DTrace one-liner counts the number of pages allocated per >>> domain by ls(1): >>>=20 >>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n rr ls" >>> ... >>> 0 71 >>> 1 72 >>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:1 = ls" >>> ... >>> 1 143 >>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:0 = ls" >>> ... >>> 0 143 >>=20 >> I'll think about this, although it would give no >> information which CPUs are executing the threads >> that are allocating or accessing the vectors for >> the integration kernel. So, for example, if the >> threads migrate to or start out on cpus they >> should not be on, this would not report such. >>=20 >> For such "which CPUs" questions one stab would >> be simply to watch with top while the benchmark >> is running and see which CPUs end up being busy >> vs. which do not. I think I'll try this. >=20 > Using top did not show evidence of the wrong > CPUs being actively in use. >=20 > My variation of top is unusual in that it also > tracks some maximum observed figures and reports > them, here being: >=20 > 8804M MaxObsActive, 4228M MaxObsWired, 13G MaxObs(Act+Wir) >=20 > (no swap use was reported). This gives a system > level view of about how much RAM was put to use > during the monitoring of the 2 benchmark runs > (-l0-15 and -l16-31). No where near enough used > to require both memory domains to be in use. >=20 > Thus, it would appear to be just where the > allocations are made for -n prefer:1 that > matters, at least when a (temporary) thread > does the allocations. >=20 >>> This approach might not work for various reasons depending on how >>> exactly your benchmark program works. >=20 > I've not tried dtrace yet. Well, for an example -l0-15 -n prefer:1 run for just the 8 threads benchmark case . . . dtrace: pid 10997 has exited 0 712 1 6737529 Something is leading to domain 0 allocations, despite -n prefer:1 . So I tried -l16-31 -n prefer:1 and it got: dtrace: pid 11037 has exited 0 2 1 8055389 (The larger number of allocations is not a surprise: more work done in about the same overall time based on faster memory access generally.) =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar) From owner-freebsd-amd64@freebsd.org Thu Sep 26 20:29:41 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id BAE5813367E; Thu, 26 Sep 2019 20:29:41 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: from mail-io1-xd42.google.com (mail-io1-xd42.google.com [IPv6:2607:f8b0:4864:20::d42]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 46fRP068v4z3Qxq; Thu, 26 Sep 2019 20:29:40 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: by mail-io1-xd42.google.com with SMTP id r26so10020140ioh.8; Thu, 26 Sep 2019 13:29:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=nk0zX/kRlmxv7iSvbAI8lh4zAfCzQx908kmtMa8iJ5E=; b=r69jIwDkxnQHAj6dWa7D8PgBx+8MhMUwZLhZGipfXssWkHIF9AO/pkC6RHI7213X+h 2Bjk/5U+C6Q/DbKkfOxQyCpd6GSgNx5orCESktn+F86PTs4m0RnJy/Z+ODbg+re5iUao X9zDa70v5jc+1XO84YW+I1WLmxNhFLsJiWGziMO33JJf6Cv+MH8pTmSoDTZfLbSxvj/y cR/qmWC7ISdo+OWPnrvOMuEbUTCQNjp+yHeUgxaOEQkNWAqmrzTkKVSzwjq55q0omP5H bh1HpkyPiKx2WwFkB4SL2FTUDBlI0Hld90Q7skRDTWOByNgSDoYSNAhi427xLwqbGej+ rGBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=nk0zX/kRlmxv7iSvbAI8lh4zAfCzQx908kmtMa8iJ5E=; b=noggAgo4Ge/R68U8tzbflQD54ZI8Izhom3dvBQHc7Y+I2rzbY/DCIKGB0FKIRSYJqf 0gMJIBGZsqgQKVF46KNIeVpm3XoLOIxAQz5jeBw5E6E802gP6ij2kTWLKDbDUk39+kbZ +EaVyAVsqojXEvlcuLPyVzZAbwlgUZIRYP/0vI8hCaLagP4N7o8frET8v815q20QowiW PwZ35JvuXmK0ESQ2VLn+Lu2sxe5MoifqtMw794ttrjg/fGIxROJKIFVgYLf1RfAEBRqj 6et8IFuKaxEy9UjfLHflEKqAvb3zbc4dSL1bZT6O3ScMKbJ4c1SjZ3mALlVehO9SE83h eT5A== X-Gm-Message-State: APjAAAVpFIGMw66YNx/sCtIA2k37kkD8TayxuXJ+77xFuwmB7fuZRz27 pRN5ABOplgfs2CVWhYxSdM8= X-Google-Smtp-Source: APXvYqxM0zWE+tEqDFxKwblcyNl2SQv2CH0/uyDUQBU2Kl2CUuzDutx+vsH21HF9VdTmCls1BjPlhg== X-Received: by 2002:a02:c943:: with SMTP id u3mr5234664jao.143.1569529779780; Thu, 26 Sep 2019 13:29:39 -0700 (PDT) Received: from raichu (toroon0560w-lp140-01-69-159-39-167.dsl.bell.ca. [69.159.39.167]) by smtp.gmail.com with ESMTPSA id l16sm173094ilm.67.2019.09.26.13.29.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 26 Sep 2019 13:29:38 -0700 (PDT) Sender: Mark Johnston Date: Thu, 26 Sep 2019 16:29:36 -0400 From: Mark Johnston To: Mark Millard Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? Message-ID: <20190926202936.GD5581@raichu> References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com> User-Agent: Mutt/1.12.1 (2019-06-15) X-Rspamd-Queue-Id: 46fRP068v4z3Qxq X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20161025 header.b=r69jIwDk; dmarc=none; spf=pass (mx1.freebsd.org: domain of markjdb@gmail.com designates 2607:f8b0:4864:20::d42 as permitted sender) smtp.mailfrom=markjdb@gmail.com X-Spamd-Result: default: False [-1.22 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20161025]; RCVD_TLS_ALL(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; NEURAL_HAM_LONG(-1.00)[-0.999,0]; MIME_GOOD(-0.10)[text/plain]; MIME_TRACE(0.00)[0:+]; DMARC_NA(0.00)[freebsd.org]; TO_DN_SOME(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; RCVD_IN_DNSWL_NONE(0.00)[2.4.d.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0.4.6.8.4.0.b.8.f.7.0.6.2.list.dnswl.org : 127.0.5.0]; NEURAL_HAM_MEDIUM(-0.98)[-0.982,0]; IP_SCORE(-0.54)[ip: (2.13), ipnet: 2607:f8b0::/32(-2.60), asn: 15169(-2.18), country: US(-0.05)]; FORGED_SENDER(0.30)[markj@freebsd.org,markjdb@gmail.com]; FREEMAIL_TO(0.00)[yahoo.com]; SUBJECT_ENDS_QUESTION(1.00)[]; MID_RHS_NOT_FQDN(0.50)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; FROM_NEQ_ENVFROM(0.00)[markj@freebsd.org,markjdb@gmail.com]; FREEMAIL_ENVFROM(0.00)[gmail.com] X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Sep 2019 20:29:41 -0000 On Wed, Sep 25, 2019 at 10:03:14PM -0700, Mark Millard wrote: > > > On 2019-Sep-25, at 20:27, Mark Millard wrote: > > > On 2019-Sep-25, at 19:26, Mark Millard wrote: > > > >> On 2019-Sep-25, at 10:02, Mark Johnston wrote: > >> > >>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 wrote: > >>>> Note: I have access to only one FreeBSD amd64 context, and > >>>> it is also my only access to a NUMA context: 2 memory > >>>> domains. A Threadripper 1950X context. Also: I have only > >>>> a head FreeBSD context on any architecture, not 12.x or > >>>> before. So I have limited compare/contrast material. > >>>> > >>>> I present the below basically to ask if the NUMA handling > >>>> has been validated, or if it is going to be, at least for > >>>> contexts that might apply to ThreadRipper 1950X and > >>>> analogous contexts. My results suggest they are not (or > >>>> libc++'s now times get messed up such that it looks like > >>>> NUMA mishandling since this is based on odd benchmark > >>>> results that involve mean time for laps, using a median > >>>> of such across multiple trials). > >>>> > >>>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this > >>>> 1950X got got expected results on Fedora but odd ones on > >>>> FreeBSD. The benchmark is a variation on the old HINT > >>>> benchmark, spanning the old multi-threading variation. I > >>>> later tried Fedora because the FreeBSD results looked odd. > >>>> The other architectures I tried FreeBSD benchmarking with > >>>> did not look odd like this. (powerpc64 on a old PowerMac 2 > >>>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive > >>>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd > >>>> Ed. For these I used 4 threads, not more.) > >>>> > >>>> I tend to write in terms of plots made from the data instead > >>>> of the raw benchmark data. > >>>> > >>>> FreeBSD testing based on: > >>>> cpuset -l0-15 -n prefer:1 > >>>> cpuset -l16-31 -n prefer:1 > >>>> > >>>> Fedora 30 testing based on: > >>>> numactl --preferred 1 --cpunodebind 0 > >>>> numactl --preferred 1 --cpunodebind 1 > >>>> > >>>> While I have more results, I reference primarily DSIZE > >>>> and ISIZE being unsigned long long and also both being > >>>> unsigned long as examples. Variations in results are not > >>>> from the type differences for any LP64 architectures. > >>>> (But they give an idea of benchmark variability in the > >>>> test context.) > >>>> > >>>> The Fedora results solidly show the bandwidth limitation > >>>> of using one memory controller. They also show the latency > >>>> consequences for the remote memory domain case vs. the > >>>> local memory domain case. There is not a lot of > >>>> variability between the examples of the 2 type-pairs used > >>>> for Fedora. > >>>> > >>>> Not true for FreeBSD on the 1950X: > >>>> > >>>> A) The latency-constrained part of the graph looks to > >>>> normally be using the local memory domain when > >>>> -l0-15 is in use for 8 threads. > >>>> > >>>> B) Both the -l0-15 and the -l16-31 parts of the > >>>> graph for 8 threads that should be bandwidth > >>>> limited show mostly examples that would have to > >>>> involve both memory controllers for the bandwidth > >>>> to get the results shown as far as I can tell. > >>>> There is also wide variability ranging between the > >>>> expected 1 controller result and, say, what a 2 > >>>> controller round-robin would be expected produce. > >>>> > >>>> C) Even the single threaded result shows a higher > >>>> result for larger total bytes for the kernel > >>>> vectors. Fedora does not. > >>>> > >>>> I think that (B) is the most solid evidence for > >>>> something being odd. > >>> > >>> The implication seems to be that your benchmark program is using pages > >>> from both domains despite a policy which preferentially allocates pages > >>> from domain 1, so you would first want to determine if this is actually > >>> what's happening. As far as I know we currently don't have a good way > >>> of characterizing per-domain memory usage within a process. > >>> > >>> If your benchmark uses a large fraction of the system's memory, you > >>> could use the vm.phys_free sysctl to get a sense of how much memory from > >>> each domain is free. > >> > >> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per memory > >> domain. I've never configured the benchmark such that it even reaches > >> 10 GiBytes on this hardware. (It stops for a time constraint first, > >> based on the values in use for the "adjustable" items.) > >> > >> . . . (much omitted material) . . . > > > >> > >>> Another possibility is to use DTrace to trace the > >>> requested domain in vm_page_alloc_domain_after(). For example, the > >>> following DTrace one-liner counts the number of pages allocated per > >>> domain by ls(1): > >>> > >>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n rr ls" > >>> ... > >>> 0 71 > >>> 1 72 > >>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:1 ls" > >>> ... > >>> 1 143 > >>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:0 ls" > >>> ... > >>> 0 143 > >> > >> I'll think about this, although it would give no > >> information which CPUs are executing the threads > >> that are allocating or accessing the vectors for > >> the integration kernel. So, for example, if the > >> threads migrate to or start out on cpus they > >> should not be on, this would not report such. > >> > >> For such "which CPUs" questions one stab would > >> be simply to watch with top while the benchmark > >> is running and see which CPUs end up being busy > >> vs. which do not. I think I'll try this. > > > > Using top did not show evidence of the wrong > > CPUs being actively in use. > > > > My variation of top is unusual in that it also > > tracks some maximum observed figures and reports > > them, here being: > > > > 8804M MaxObsActive, 4228M MaxObsWired, 13G MaxObs(Act+Wir) > > > > (no swap use was reported). This gives a system > > level view of about how much RAM was put to use > > during the monitoring of the 2 benchmark runs > > (-l0-15 and -l16-31). No where near enough used > > to require both memory domains to be in use. > > > > Thus, it would appear to be just where the > > allocations are made for -n prefer:1 that > > matters, at least when a (temporary) thread > > does the allocations. > > > >>> This approach might not work for various reasons depending on how > >>> exactly your benchmark program works. > > > > I've not tried dtrace yet. > > Well, for an example -l0-15 -n prefer:1 run > for just the 8 threads benchmark case . . . > > dtrace: pid 10997 has exited > > 0 712 > 1 6737529 > > Something is leading to domain 0 > allocations, despite -n prefer:1 . You can get a sense of where these allocations are occuring by changing the probe to capture kernel stacks for domain 0 page allocations: fbt::vm_page_alloc_domain_after:entry /progenyof($target) && args[2] == 0/{@[stack()] = count();} One possibility is that these are kernel memory allocations occurring in the context of the benchmark threads. Such allocations may not respect the configured policy since they are not private to the allocating thread. For instance, upon opening a file, the kernel may allocate a vnode structure for that file. That vnode may be accessed by threads from many processes over its lifetime, and may be recycled many times before its memory is released back to the allocator. Given the low number of domain 0 allocations I am skeptical that they are responsible for the variablility in your results. > So I tried -l16-31 -n prefer:1 and it got: > > dtrace: pid 11037 has exited > > 0 2 > 1 8055389 > > (The larger number of allocations is > not a surprise: more work done in > about the same overall time based on > faster memory access generally.) From owner-freebsd-amd64@freebsd.org Fri Sep 27 00:05:45 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id B6323137B26 for ; Fri, 27 Sep 2019 00:05:45 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic316-12.consmr.mail.bf2.yahoo.com (sonic316-12.consmr.mail.bf2.yahoo.com [74.6.130.122]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 46fXBJ6xvQz47N6 for ; Fri, 27 Sep 2019 00:05:44 +0000 (UTC) (envelope-from marklmi@yahoo.com) X-YMail-OSG: 3_ItcxwVM1lZiHF0FZxEeg80uDgclCXajJlSqXXF04R8cEiROEvhLVIt8FNwNRE Ygiz0wbI9u8umn2_fu88z._9KUl7_i52eAKPyAUpjPrNndGkp1ZRZN8DsjVj2gNhf.JwFJQD.pGS 5.Zvd06JgzkoowhAm7emk1H2DOfY3Ie8fJW_EqFLrPNDKgbgfqqGMDFBqfxBIaTFk3CPwTT0ySp7 G9LQacTjg3DR7oKNL8ebpJW3GbjY30Roh1dACkO32FF_6HvZ2VclbIaLCwSxrv57eS4y5jAviB_u 25DXPv380XdrSIDNTIBZHJl6qWKZYsPkQ9pwT5ll1EWy_vqMlkSf1FtcezHJzi2S2JFeXUI0yPHm hlllkbgAvCFYF9NGbUFGTg1rruYvVZT64PIDcW_T0suiuyhC3nxU78dRSfhhEJ51Hf3Tdr2aeOrU i55PctgrkfO1TSWdX1Oy6bIUqcRiGMQPub6P7leOa0YFJTFeJyrpR69b8kBGhuIDTN_7EDVCToeK V1UhGuirZwMqa0JBSeEQnXZ0rtbJ4x4sRx3okA6fRJ74G2uJn7svdY4W8bF3zSxzy6ugZby5dETz hfilrSWOggavjxne7sGqod2ACPWlSchVodAK6V.i0BwwwIEyrMDJGM3Hl8ycE_FEWxNBoKc3.9o0 slMvOFKGVNY_Khf8Sm0AZqM_naM6WrI7W66yEnLLFTb75W_yWn0Dj1Cb5WWcV_OIK7PLgRNEKaEC hHWArsFoiaHIuP.Pml86ZWEIhOKyeyFuwzuzICowZXKs3bUDw72RNIjJ9ThawZX.DubNJz0lmBcb EsI69n3QtI7jj9RPVbGhOJxSssa.l9.LRvP7lnBgO0dxwQMpVE.Z9V0.unTWmsqhpfE_xku3kvEF 5B2MEuIMXXdyBpxubcZPhYe6wnTiiPV7A0bhS5GDyDAGK1RI_BpvteO.sDXRmWc9Vo_NOroNEmNv 5QjKl0MasGQuRClOE9ZatvSVi7izTJ_c_Omc5EdI45sMu6wEy9gREpO8S3bgyJLRBaOWTJGyu7PF cjSM7.Ku5yA2jBkfzRbnkk99XCJgLsD9J4VvV2BbHsgXHJcE83DsceAh_WchZ5.xJ_H4RBlrT3FQ VNSYFgC6Wh3PpiTTJL_baRiIdpEPgLVsytLYDydRUNl10OzqcdLeprz0uU_0qTVWL6t.FiSOh2Om phuI17A7ZqKRh_fGzkkaFT0vhWcdNxmPvWNssgzRkHf400wwJbIsl8MTPzTXAUQfqNbWVAFEcdMI PHOBkFPIL6qM7mKRNAIRwjzCmNeHZcPPwdJmXdyMS3_Chb3ToxS_18xM- Received: from sonic.gate.mail.ne1.yahoo.com by sonic316.consmr.mail.bf2.yahoo.com with HTTP; Fri, 27 Sep 2019 00:05:43 +0000 Received: by smtp423.mail.bf1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID 8aa6fc02efcbfd567d14245ffe470ebe; Fri, 27 Sep 2019 00:05:41 +0000 (UTC) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? From: Mark Millard In-Reply-To: <20190926202936.GD5581@raichu> Date: Thu, 26 Sep 2019 17:05:38 -0700 Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com> References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com> <20190926202936.GD5581@raichu> To: Mark Johnston X-Mailer: Apple Mail (2.3445.104.11) X-Rspamd-Queue-Id: 46fXBJ6xvQz47N6 X-Spamd-Bar: + X-Spamd-Result: default: False [1.69 / 15.00]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[]; DKIM_TRACE(0.00)[yahoo.com:+]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; SUBJECT_ENDS_QUESTION(1.00)[]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; ASN(0.00)[asn:26101, ipnet:74.6.128.0/21, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; MIME_GOOD(-0.10)[text/plain]; IP_SCORE(0.00)[ip: (4.59), ipnet: 74.6.128.0/21(1.45), asn: 26101(1.16), country: US(-0.05)]; NEURAL_SPAM_MEDIUM(0.92)[0.916,0]; IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(0.28)[0.279,0]; RCVD_IN_DNSWL_NONE(0.00)[122.130.6.74.list.dnswl.org : 127.0.5.0]; RCVD_TLS_LAST(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Sep 2019 00:05:45 -0000 On 2019-Sep-26, at 13:29, Mark Johnston wrote: > On Wed, Sep 25, 2019 at 10:03:14PM -0700, Mark Millard wrote: >>=20 >>=20 >> On 2019-Sep-25, at 20:27, Mark Millard wrote: >>=20 >>> On 2019-Sep-25, at 19:26, Mark Millard wrote: >>>=20 >>>> On 2019-Sep-25, at 10:02, Mark Johnston = wrote: >>>>=20 >>>>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via = freebsd-amd64 wrote: >>>>>> Note: I have access to only one FreeBSD amd64 context, and >>>>>> it is also my only access to a NUMA context: 2 memory >>>>>> domains. A Threadripper 1950X context. Also: I have only >>>>>> a head FreeBSD context on any architecture, not 12.x or >>>>>> before. So I have limited compare/contrast material. >>>>>>=20 >>>>>> I present the below basically to ask if the NUMA handling >>>>>> has been validated, or if it is going to be, at least for >>>>>> contexts that might apply to ThreadRipper 1950X and >>>>>> analogous contexts. My results suggest they are not (or >>>>>> libc++'s now times get messed up such that it looks like >>>>>> NUMA mishandling since this is based on odd benchmark >>>>>> results that involve mean time for laps, using a median >>>>>> of such across multiple trials). >>>>>>=20 >>>>>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this >>>>>> 1950X got got expected results on Fedora but odd ones on >>>>>> FreeBSD. The benchmark is a variation on the old HINT >>>>>> benchmark, spanning the old multi-threading variation. I >>>>>> later tried Fedora because the FreeBSD results looked odd. >>>>>> The other architectures I tried FreeBSD benchmarking with >>>>>> did not look odd like this. (powerpc64 on a old PowerMac 2 >>>>>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive >>>>>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd >>>>>> Ed. For these I used 4 threads, not more.) >>>>>>=20 >>>>>> I tend to write in terms of plots made from the data instead >>>>>> of the raw benchmark data. >>>>>>=20 >>>>>> FreeBSD testing based on: >>>>>> cpuset -l0-15 -n prefer:1 >>>>>> cpuset -l16-31 -n prefer:1 >>>>>>=20 >>>>>> Fedora 30 testing based on: >>>>>> numactl --preferred 1 --cpunodebind 0 >>>>>> numactl --preferred 1 --cpunodebind 1 >>>>>>=20 >>>>>> While I have more results, I reference primarily DSIZE >>>>>> and ISIZE being unsigned long long and also both being >>>>>> unsigned long as examples. Variations in results are not >>>>>> from the type differences for any LP64 architectures. >>>>>> (But they give an idea of benchmark variability in the >>>>>> test context.) >>>>>>=20 >>>>>> The Fedora results solidly show the bandwidth limitation >>>>>> of using one memory controller. They also show the latency >>>>>> consequences for the remote memory domain case vs. the >>>>>> local memory domain case. There is not a lot of >>>>>> variability between the examples of the 2 type-pairs used >>>>>> for Fedora. >>>>>>=20 >>>>>> Not true for FreeBSD on the 1950X: >>>>>>=20 >>>>>> A) The latency-constrained part of the graph looks to >>>>>> normally be using the local memory domain when >>>>>> -l0-15 is in use for 8 threads. >>>>>>=20 >>>>>> B) Both the -l0-15 and the -l16-31 parts of the >>>>>> graph for 8 threads that should be bandwidth >>>>>> limited show mostly examples that would have to >>>>>> involve both memory controllers for the bandwidth >>>>>> to get the results shown as far as I can tell. >>>>>> There is also wide variability ranging between the >>>>>> expected 1 controller result and, say, what a 2 >>>>>> controller round-robin would be expected produce. >>>>>>=20 >>>>>> C) Even the single threaded result shows a higher >>>>>> result for larger total bytes for the kernel >>>>>> vectors. Fedora does not. >>>>>>=20 >>>>>> I think that (B) is the most solid evidence for >>>>>> something being odd. >>>>>=20 >>>>> The implication seems to be that your benchmark program is using = pages >>>>> from both domains despite a policy which preferentially allocates = pages >>>>> from domain 1, so you would first want to determine if this is = actually >>>>> what's happening. As far as I know we currently don't have a good = way >>>>> of characterizing per-domain memory usage within a process. >>>>>=20 >>>>> If your benchmark uses a large fraction of the system's memory, = you >>>>> could use the vm.phys_free sysctl to get a sense of how much = memory from >>>>> each domain is free. >>>>=20 >>>> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per = memory >>>> domain. I've never configured the benchmark such that it even = reaches >>>> 10 GiBytes on this hardware. (It stops for a time constraint first, >>>> based on the values in use for the "adjustable" items.) >>>>=20 >>>> . . . (much omitted material) . . . >>>=20 >>>>=20 >>>>> Another possibility is to use DTrace to trace the >>>>> requested domain in vm_page_alloc_domain_after(). For example, = the >>>>> following DTrace one-liner counts the number of pages allocated = per >>>>> domain by ls(1): >>>>>=20 >>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n rr ls" >>>>> ... >>>>> 0 71 >>>>> 1 72 >>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:1 = ls" >>>>> ... >>>>> 1 143 >>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:0 = ls" >>>>> ... >>>>> 0 143 >>>>=20 >>>> I'll think about this, although it would give no >>>> information which CPUs are executing the threads >>>> that are allocating or accessing the vectors for >>>> the integration kernel. So, for example, if the >>>> threads migrate to or start out on cpus they >>>> should not be on, this would not report such. >>>>=20 >>>> For such "which CPUs" questions one stab would >>>> be simply to watch with top while the benchmark >>>> is running and see which CPUs end up being busy >>>> vs. which do not. I think I'll try this. >>>=20 >>> Using top did not show evidence of the wrong >>> CPUs being actively in use. >>>=20 >>> My variation of top is unusual in that it also >>> tracks some maximum observed figures and reports >>> them, here being: >>>=20 >>> 8804M MaxObsActive, 4228M MaxObsWired, 13G MaxObs(Act+Wir) >>>=20 >>> (no swap use was reported). This gives a system >>> level view of about how much RAM was put to use >>> during the monitoring of the 2 benchmark runs >>> (-l0-15 and -l16-31). No where near enough used >>> to require both memory domains to be in use. >>>=20 >>> Thus, it would appear to be just where the >>> allocations are made for -n prefer:1 that >>> matters, at least when a (temporary) thread >>> does the allocations. >>>=20 >>>>> This approach might not work for various reasons depending on how >>>>> exactly your benchmark program works. >>>=20 >>> I've not tried dtrace yet. >>=20 >> Well, for an example -l0-15 -n prefer:1 run >> for just the 8 threads benchmark case . . . >>=20 >> dtrace: pid 10997 has exited >>=20 >> 0 712 >> 1 6737529 >>=20 >> Something is leading to domain 0 >> allocations, despite -n prefer:1 . >=20 > You can get a sense of where these allocations are occuring by = changing > the probe to capture kernel stacks for domain 0 page allocations: >=20 > fbt::vm_page_alloc_domain_after:entry /progenyof($target) && args[2] = =3D=3D 0/{@[stack()] =3D count();} >=20 > One possibility is that these are kernel memory allocations occurring = in > the context of the benchmark threads. Such allocations may not = respect > the configured policy since they are not private to the allocating > thread. For instance, upon opening a file, the kernel may allocate a > vnode structure for that file. That vnode may be accessed by threads > from many processes over its lifetime, and may be recycled many times > before its memory is released back to the allocator. For -l0-15 -n prefer:1 : Looks like this reports sys_thr_new activity, sys_cpuset activity, and 0xffffffff80bc09bd activity (whatever that is). Mostly sys_thr_new activity, over 1300 of them . . . dtrace: pid 13553 has exited kernel`uma_small_alloc+0x61 kernel`keg_alloc_slab+0x10b kernel`zone_import+0x1d2 kernel`uma_zalloc_arg+0x62b kernel`thread_init+0x22 kernel`keg_alloc_slab+0x259 kernel`zone_import+0x1d2 kernel`uma_zalloc_arg+0x62b kernel`thread_alloc+0x23 kernel`thread_create+0x13a kernel`sys_thr_new+0xd2 kernel`amd64_syscall+0x3ae kernel`0xffffffff811b7600 2 kernel`uma_small_alloc+0x61 kernel`keg_alloc_slab+0x10b kernel`zone_import+0x1d2 kernel`uma_zalloc_arg+0x62b kernel`cpuset_setproc+0x65 kernel`sys_cpuset+0x123 kernel`amd64_syscall+0x3ae kernel`0xffffffff811b7600 2 kernel`uma_small_alloc+0x61 kernel`keg_alloc_slab+0x10b kernel`zone_import+0x1d2 kernel`uma_zalloc_arg+0x62b kernel`uma_zfree_arg+0x36a kernel`thread_reap+0x106 kernel`thread_alloc+0xf kernel`thread_create+0x13a kernel`sys_thr_new+0xd2 kernel`amd64_syscall+0x3ae kernel`0xffffffff811b7600 6 kernel`uma_small_alloc+0x61 kernel`keg_alloc_slab+0x10b kernel`zone_import+0x1d2 kernel`uma_zalloc_arg+0x62b kernel`uma_zfree_arg+0x36a kernel`vm_map_process_deferred+0x8c kernel`vm_map_remove+0x11d kernel`vmspace_exit+0xd3 kernel`exit1+0x5a9 kernel`0xffffffff80bc09bd kernel`amd64_syscall+0x3ae kernel`0xffffffff811b7600 6 kernel`uma_small_alloc+0x61 kernel`keg_alloc_slab+0x10b kernel`zone_import+0x1d2 kernel`uma_zalloc_arg+0x62b kernel`thread_alloc+0x23 kernel`thread_create+0x13a kernel`sys_thr_new+0xd2 kernel`amd64_syscall+0x3ae kernel`0xffffffff811b7600 22 kernel`vm_page_grab_pages+0x1b4 kernel`vm_thread_stack_create+0xc0 kernel`kstack_import+0x52 kernel`uma_zalloc_arg+0x62b kernel`vm_thread_new+0x4d kernel`thread_alloc+0x31 kernel`thread_create+0x13a kernel`sys_thr_new+0xd2 kernel`amd64_syscall+0x3ae kernel`0xffffffff811b7600 1324 For -l16-31 -n prefer:1 : Again, exactly 2. Both being sys_cpuset . . . dtrace: pid 13594 has exited kernel`uma_small_alloc+0x61 kernel`keg_alloc_slab+0x10b kernel`zone_import+0x1d2 kernel`uma_zalloc_arg+0x62b kernel`cpuset_setproc+0x65 kernel`sys_cpuset+0x123 kernel`amd64_syscall+0x3ae kernel`0xffffffff811b7600 2 >=20 > Given the low number of domain 0 allocations I am skeptical that they > are responsible for the variablility in your results. >=20 >> So I tried -l16-31 -n prefer:1 and it got: >>=20 >> dtrace: pid 11037 has exited >>=20 >> 0 2 >> 1 8055389 >>=20 >> (The larger number of allocations is >> not a surprise: more work done in >> about the same overall time based on >> faster memory access generally.) =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar) From owner-freebsd-amd64@freebsd.org Fri Sep 27 03:37:47 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 0C376F46D5 for ; Fri, 27 Sep 2019 03:37:47 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic308-10.consmr.mail.ne1.yahoo.com (sonic308-10.consmr.mail.ne1.yahoo.com [66.163.187.33]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 46fctx2FxGz4JMn for ; Fri, 27 Sep 2019 03:37:44 +0000 (UTC) (envelope-from marklmi@yahoo.com) X-YMail-OSG: E8XhLpQVM1lJ6vovjrn7YpohE7rdoont6MbS01lbXpbDg0KL5dcpc05atAJ66pd xEyp5xuMSnUrgRGDwJNWfWVb7htpUnCvotbSyqnkT_rW6pKVy.j4jpzmqnHkjAH.X98edSqg0fLL DFm0gUKqh8afkyurKPLZKp8v_jNhNmvDXUw5TGwuM0rp6NFiAqCvcmMhxkaz0udDrUVeWFvq7esr FMCW12GVllDeat0nTbKTtzMfVM3bFLtE2xhkOg37eEKbkZk6NQpE9GL.vI_9qIw_rO4A8TSynnSh BudmIaSoJDgj8eA4FalL6gGMtm.v7LT2RkSaQIXFhHj0AAQfAYybHlVdQcgonPBPe_5J5_ttUjhJ wbIkIHQU03apPowzaWksg6rE1oXc2LmUFogsdSpuKJd8t16mDEF_4WP6R9Xc5_Ik_OhY_SgFk2yW 25LHiIBiwB_LzXs4PjdTlX0PBN.PIILTqzJQQ.RO7NpKPYW_JyD6SFFipQsgsi81rmVLQIsuREXX necXnlNnAfecoOpyrJvuYQZQT4rmnMNwcbWwPBu543fvlSGE6JrgUEIq3_CQPeM0HES31gCwmkTx 7EhcZzwZKxbqG2YY8Vme1rJ4acnLuVDLs86qCGrDlFv2kfKSeik2qqUbWDS5FRxVQ63kwVpkqzaD VpEOsFmVCTiah8Tq.bsos7Wk8cCAg8fsBgMZGfzBlTg9PufP7JgeERokcS3C2fI4bJurLehSPE5b nfA7WZl2gXHVk..L4ZaktnNoU9YSSD5CMzwLN93_vdeewFYPOwolD3zWGxqQt.eSzMB5Ck7YsFtD p0mhVqNDXw92nQhjp6AzAqJNOcMHOsLs0oS0IzIn4afFdPSuHsIPVtIMZfD_fBbKBXLjgJ1F2XZG 56jroDCMvbee1hysRq4AE3zivEQso1d0KUi02FSSpu6Umat0pGbEhIaLVCSMKlOXfCY1FnP06_OZ r5Nueml3cFsP3QjRD4DBb7VOIgrceT_Q_ImP8bZQsyhQKHb72ag8qD4uJpG7R1nrtLZ5Kq1HjszN AK8uzMr6DoSucqrKYsR6w7dBLf6zYjuBMUa9XNVsO41l0FIgQYFkA1q0fSCKPTkH7MM6e9YO_RGW ohuOlt6r1ObI.0i8QwVvibUkOuwFuM_fJNS.vnnaBwjV3lZj_ZBv6098wqxjyeBjEChOtZloeH_V xvflm9fyUIf_RYS3B7AEw0s26pg0576t5an4US2DhPrm7Vc1R8jtE7aewoMFzXHaHweoLbgvpChq VrN5sDs4Vz6gz8j19XtOigGYjoDZ1Khxj2Qqr.mVJk9Hb9w_lbuN_go49NCtVjRQ0 Received: from sonic.gate.mail.ne1.yahoo.com by sonic308.consmr.mail.ne1.yahoo.com with HTTP; Fri, 27 Sep 2019 03:37:43 +0000 Received: by smtp428.mail.ne1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID 5b92f5f01d3981467664a1c9f888c9fe; Fri, 27 Sep 2019 03:37:42 +0000 (UTC) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? From: Mark Millard In-Reply-To: <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com> Date: Thu, 26 Sep 2019 20:37:39 -0700 Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <6BC5F6BE-5FC3-48FA-9873-B20141FEFDF5@yahoo.com> References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com> <20190926202936.GD5581@raichu> <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com> To: Mark Johnston X-Mailer: Apple Mail (2.3445.104.11) X-Rspamd-Queue-Id: 46fctx2FxGz4JMn X-Spamd-Bar: + X-Spamd-Result: default: False [1.67 / 15.00]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[]; DKIM_TRACE(0.00)[yahoo.com:+]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; IP_SCORE(0.00)[ip: (4.54), ipnet: 66.163.184.0/21(1.32), asn: 36646(1.05), country: US(-0.05)]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; SUBJECT_ENDS_QUESTION(1.00)[]; MID_RHS_MATCH_FROM(0.00)[]; ASN(0.00)[asn:36646, ipnet:66.163.184.0/21, country:US]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; MIME_GOOD(-0.10)[text/plain]; DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0]; NEURAL_SPAM_MEDIUM(0.93)[0.926,0]; IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(0.25)[0.247,0]; RCVD_IN_DNSWL_NONE(0.00)[33.187.163.66.list.dnswl.org : 127.0.5.0]; RCVD_TLS_LAST(0.00)[]; RWL_MAILSPIKE_POSSIBLE(0.00)[33.187.163.66.rep.mailspike.net : 127.0.0.17]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Sep 2019 03:37:47 -0000 On 2019-Sep-26, at 17:05, Mark Millard wrote: > On 2019-Sep-26, at 13:29, Mark Johnston wrote: >=20 >> On Wed, Sep 25, 2019 at 10:03:14PM -0700, Mark Millard wrote: >>>=20 >>>=20 >>> On 2019-Sep-25, at 20:27, Mark Millard wrote: >>>=20 >>>> On 2019-Sep-25, at 19:26, Mark Millard = wrote: >>>>=20 >>>>> On 2019-Sep-25, at 10:02, Mark Johnston = wrote: >>>>>=20 >>>>>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via = freebsd-amd64 wrote: >>>>>>> Note: I have access to only one FreeBSD amd64 context, and >>>>>>> it is also my only access to a NUMA context: 2 memory >>>>>>> domains. A Threadripper 1950X context. Also: I have only >>>>>>> a head FreeBSD context on any architecture, not 12.x or >>>>>>> before. So I have limited compare/contrast material. >>>>>>>=20 >>>>>>> I present the below basically to ask if the NUMA handling >>>>>>> has been validated, or if it is going to be, at least for >>>>>>> contexts that might apply to ThreadRipper 1950X and >>>>>>> analogous contexts. My results suggest they are not (or >>>>>>> libc++'s now times get messed up such that it looks like >>>>>>> NUMA mishandling since this is based on odd benchmark >>>>>>> results that involve mean time for laps, using a median >>>>>>> of such across multiple trials). >>>>>>>=20 >>>>>>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this >>>>>>> 1950X got got expected results on Fedora but odd ones on >>>>>>> FreeBSD. The benchmark is a variation on the old HINT >>>>>>> benchmark, spanning the old multi-threading variation. I >>>>>>> later tried Fedora because the FreeBSD results looked odd. >>>>>>> The other architectures I tried FreeBSD benchmarking with >>>>>>> did not look odd like this. (powerpc64 on a old PowerMac 2 >>>>>>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive >>>>>>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd >>>>>>> Ed. For these I used 4 threads, not more.) >>>>>>>=20 >>>>>>> I tend to write in terms of plots made from the data instead >>>>>>> of the raw benchmark data. >>>>>>>=20 >>>>>>> FreeBSD testing based on: >>>>>>> cpuset -l0-15 -n prefer:1 >>>>>>> cpuset -l16-31 -n prefer:1 >>>>>>>=20 >>>>>>> Fedora 30 testing based on: >>>>>>> numactl --preferred 1 --cpunodebind 0 >>>>>>> numactl --preferred 1 --cpunodebind 1 >>>>>>>=20 >>>>>>> While I have more results, I reference primarily DSIZE >>>>>>> and ISIZE being unsigned long long and also both being >>>>>>> unsigned long as examples. Variations in results are not >>>>>>> from the type differences for any LP64 architectures. >>>>>>> (But they give an idea of benchmark variability in the >>>>>>> test context.) >>>>>>>=20 >>>>>>> The Fedora results solidly show the bandwidth limitation >>>>>>> of using one memory controller. They also show the latency >>>>>>> consequences for the remote memory domain case vs. the >>>>>>> local memory domain case. There is not a lot of >>>>>>> variability between the examples of the 2 type-pairs used >>>>>>> for Fedora. >>>>>>>=20 >>>>>>> Not true for FreeBSD on the 1950X: >>>>>>>=20 >>>>>>> A) The latency-constrained part of the graph looks to >>>>>>> normally be using the local memory domain when >>>>>>> -l0-15 is in use for 8 threads. >>>>>>>=20 >>>>>>> B) Both the -l0-15 and the -l16-31 parts of the >>>>>>> graph for 8 threads that should be bandwidth >>>>>>> limited show mostly examples that would have to >>>>>>> involve both memory controllers for the bandwidth >>>>>>> to get the results shown as far as I can tell. >>>>>>> There is also wide variability ranging between the >>>>>>> expected 1 controller result and, say, what a 2 >>>>>>> controller round-robin would be expected produce. >>>>>>>=20 >>>>>>> C) Even the single threaded result shows a higher >>>>>>> result for larger total bytes for the kernel >>>>>>> vectors. Fedora does not. >>>>>>>=20 >>>>>>> I think that (B) is the most solid evidence for >>>>>>> something being odd. >>>>>>=20 >>>>>> The implication seems to be that your benchmark program is using = pages >>>>>> from both domains despite a policy which preferentially allocates = pages >>>>>> from domain 1, so you would first want to determine if this is = actually >>>>>> what's happening. As far as I know we currently don't have a = good way >>>>>> of characterizing per-domain memory usage within a process. >>>>>>=20 >>>>>> If your benchmark uses a large fraction of the system's memory, = you >>>>>> could use the vm.phys_free sysctl to get a sense of how much = memory from >>>>>> each domain is free. >>>>>=20 >>>>> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes = per memory >>>>> domain. I've never configured the benchmark such that it even = reaches >>>>> 10 GiBytes on this hardware. (It stops for a time constraint = first, >>>>> based on the values in use for the "adjustable" items.) >>>>>=20 >>>>> . . . (much omitted material) . . . >>>>=20 >>>>>=20 >>>>>> Another possibility is to use DTrace to trace the >>>>>> requested domain in vm_page_alloc_domain_after(). For example, = the >>>>>> following DTrace one-liner counts the number of pages allocated = per >>>>>> domain by ls(1): >>>>>>=20 >>>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n rr ls" >>>>>> ... >>>>>> 0 71 >>>>>> 1 72 >>>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:1 = ls" >>>>>> ... >>>>>> 1 143 >>>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:0 = ls" >>>>>> ... >>>>>> 0 143 >>>>>=20 >>>>> I'll think about this, although it would give no >>>>> information which CPUs are executing the threads >>>>> that are allocating or accessing the vectors for >>>>> the integration kernel. So, for example, if the >>>>> threads migrate to or start out on cpus they >>>>> should not be on, this would not report such. >>>>>=20 >>>>> For such "which CPUs" questions one stab would >>>>> be simply to watch with top while the benchmark >>>>> is running and see which CPUs end up being busy >>>>> vs. which do not. I think I'll try this. >>>>=20 >>>> Using top did not show evidence of the wrong >>>> CPUs being actively in use. >>>>=20 >>>> My variation of top is unusual in that it also >>>> tracks some maximum observed figures and reports >>>> them, here being: >>>>=20 >>>> 8804M MaxObsActive, 4228M MaxObsWired, 13G MaxObs(Act+Wir) >>>>=20 >>>> (no swap use was reported). This gives a system >>>> level view of about how much RAM was put to use >>>> during the monitoring of the 2 benchmark runs >>>> (-l0-15 and -l16-31). No where near enough used >>>> to require both memory domains to be in use. >>>>=20 >>>> Thus, it would appear to be just where the >>>> allocations are made for -n prefer:1 that >>>> matters, at least when a (temporary) thread >>>> does the allocations. >>>>=20 >>>>>> This approach might not work for various reasons depending on how >>>>>> exactly your benchmark program works. >>>>=20 >>>> I've not tried dtrace yet. >>>=20 >>> Well, for an example -l0-15 -n prefer:1 run >>> for just the 8 threads benchmark case . . . >>>=20 >>> dtrace: pid 10997 has exited >>>=20 >>> 0 712 >>> 1 6737529 >>>=20 >>> Something is leading to domain 0 >>> allocations, despite -n prefer:1 . >>=20 >> You can get a sense of where these allocations are occuring by = changing >> the probe to capture kernel stacks for domain 0 page allocations: >>=20 >> fbt::vm_page_alloc_domain_after:entry /progenyof($target) && args[2] = =3D=3D 0/{@[stack()] =3D count();} >>=20 >> One possibility is that these are kernel memory allocations occurring = in >> the context of the benchmark threads. Such allocations may not = respect >> the configured policy since they are not private to the allocating >> thread. For instance, upon opening a file, the kernel may allocate a >> vnode structure for that file. That vnode may be accessed by threads >> from many processes over its lifetime, and may be recycled many times >> before its memory is released back to the allocator. >=20 > For -l0-15 -n prefer:1 : >=20 > Looks like this reports sys_thr_new activity, sys_cpuset > activity, and 0xffffffff80bc09bd activity (whatever that > is). Mostly sys_thr_new activity, over 1300 of them . . . >=20 > dtrace: pid 13553 has exited >=20 >=20 > kernel`uma_small_alloc+0x61 > kernel`keg_alloc_slab+0x10b > kernel`zone_import+0x1d2 > kernel`uma_zalloc_arg+0x62b > kernel`thread_init+0x22 > kernel`keg_alloc_slab+0x259 > kernel`zone_import+0x1d2 > kernel`uma_zalloc_arg+0x62b > kernel`thread_alloc+0x23 > kernel`thread_create+0x13a > kernel`sys_thr_new+0xd2 > kernel`amd64_syscall+0x3ae > kernel`0xffffffff811b7600 > 2 >=20 > kernel`uma_small_alloc+0x61 > kernel`keg_alloc_slab+0x10b > kernel`zone_import+0x1d2 > kernel`uma_zalloc_arg+0x62b > kernel`cpuset_setproc+0x65 > kernel`sys_cpuset+0x123 > kernel`amd64_syscall+0x3ae > kernel`0xffffffff811b7600 > 2 >=20 > kernel`uma_small_alloc+0x61 > kernel`keg_alloc_slab+0x10b > kernel`zone_import+0x1d2 > kernel`uma_zalloc_arg+0x62b > kernel`uma_zfree_arg+0x36a > kernel`thread_reap+0x106 > kernel`thread_alloc+0xf > kernel`thread_create+0x13a > kernel`sys_thr_new+0xd2 > kernel`amd64_syscall+0x3ae > kernel`0xffffffff811b7600 > 6 >=20 > kernel`uma_small_alloc+0x61 > kernel`keg_alloc_slab+0x10b > kernel`zone_import+0x1d2 > kernel`uma_zalloc_arg+0x62b > kernel`uma_zfree_arg+0x36a > kernel`vm_map_process_deferred+0x8c > kernel`vm_map_remove+0x11d > kernel`vmspace_exit+0xd3 > kernel`exit1+0x5a9 > kernel`0xffffffff80bc09bd > kernel`amd64_syscall+0x3ae > kernel`0xffffffff811b7600 > 6 >=20 > kernel`uma_small_alloc+0x61 > kernel`keg_alloc_slab+0x10b > kernel`zone_import+0x1d2 > kernel`uma_zalloc_arg+0x62b > kernel`thread_alloc+0x23 > kernel`thread_create+0x13a > kernel`sys_thr_new+0xd2 > kernel`amd64_syscall+0x3ae > kernel`0xffffffff811b7600 > 22 >=20 > kernel`vm_page_grab_pages+0x1b4 > kernel`vm_thread_stack_create+0xc0 > kernel`kstack_import+0x52 > kernel`uma_zalloc_arg+0x62b > kernel`vm_thread_new+0x4d > kernel`thread_alloc+0x31 > kernel`thread_create+0x13a > kernel`sys_thr_new+0xd2 > kernel`amd64_syscall+0x3ae > kernel`0xffffffff811b7600 > 1324 With sys_thr_new not respecting -n prefer:1 for -l0-15 (especially for the thread stacks), I looked some at the generated integration kernel code and it makes significant use of %rsp based memory accesses (read and write). That would get both memory controllers going in parallel (kernel vectors accesses to the preferred memory domain), so not slowing down as expected. If round-robin is not respected for thread stacks, and if threads migrate cpus across memory domains at times, there could be considerable variability for that context as well. (This may not be the only way to have different/extra variability for this context.) Overall: I'd be surprised if this was not contributing to what I thought was odd about the benchmark results. > For -l16-31 -n prefer:1 : >=20 > Again, exactly 2. Both being sys_cpuset . . . >=20 > dtrace: pid 13594 has exited >=20 >=20 > kernel`uma_small_alloc+0x61 > kernel`keg_alloc_slab+0x10b > kernel`zone_import+0x1d2 > kernel`uma_zalloc_arg+0x62b > kernel`cpuset_setproc+0x65 > kernel`sys_cpuset+0x123 > kernel`amd64_syscall+0x3ae > kernel`0xffffffff811b7600 > 2 >=20 >=20 >=20 >>=20 >> Given the low number of domain 0 allocations I am skeptical that they >> are responsible for the variablility in your results. >>=20 >>> So I tried -l16-31 -n prefer:1 and it got: >>>=20 >>> dtrace: pid 11037 has exited >>>=20 >>> 0 2 >>> 1 8055389 >>>=20 >>> (The larger number of allocations is >>> not a surprise: more work done in >>> about the same overall time based on >>> faster memory access generally.) >=20 =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar) From owner-freebsd-amd64@freebsd.org Fri Sep 27 15:39:16 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 962C8126069 for ; Fri, 27 Sep 2019 15:39:16 +0000 (UTC) (envelope-from jau789@gmail.com) Received: from mail-lf1-x132.google.com (mail-lf1-x132.google.com [IPv6:2a00:1450:4864:20::132]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 46fwvR43dmz3xlX for ; Fri, 27 Sep 2019 15:39:15 +0000 (UTC) (envelope-from jau789@gmail.com) Received: by mail-lf1-x132.google.com with SMTP id x80so2280609lff.3 for ; Fri, 27 Sep 2019 08:39:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=to:from:subject:message-id:date:user-agent:mime-version :content-language:content-transfer-encoding; bh=teHZ87xpz/XVHPNhA8i7MMiP53U/ChS2BU+JW4VomTg=; b=pwJvEWkVNsC9LBNiFASaZQNdwS7rDlhyWNhECSe7ARp01BjxESRC+pbXNq4VRCjmt4 Ib/QfH4lR+NfbFaPlgFdPVrFmO9j0cUhXLKqoiVDdcR7gZSvcNTI8gZZjdEf4VpnBtBF LlMPEj8ho8XeHAxz0XluYGMZWuxIDIhHiMI7z8LW/f2S3YwKLHH+7Mi3Mx931GpRCyFV F6FNySI9FWr4D2VkvGa3Gf6sJGyeeYj129qzJIcV8xvqEYKaY5Uk0l8HSnJ5LbrWxZXs ZwZfq9FEJy8/TQnRpERGdOFETX5N8OPS7p6wmx2PCqsu8QZjvTKWmFHYpAHmEcecu+KG /XVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:from:subject:message-id:date:user-agent :mime-version:content-language:content-transfer-encoding; bh=teHZ87xpz/XVHPNhA8i7MMiP53U/ChS2BU+JW4VomTg=; b=fqDT32iWnQDJSrd/PJuVJzzfytMehiiPsKMH05u6QCVoMu0l+nfIexunhAaIpZPdjC l/HkZwhcMjjZNEGgFjT5kgMnH9HTam1kYanVNuGjDC+C3her4xRzR8Q3SyekEnvkQOgG 4tnlV+xTT/k3qYOqlzsqrOs8U4ejXHZnmCJXJrLb1o7uEjNQ/IknXacID9w4UVevUEGa dTwylbk6+U1CWC7fc2GbDBmhdBhxDBMLgq+GluWfVv4D/jGjEZANqrQO/N5krQ92MeNa 5kv8xpqB/M5AGUfnhffbfLXhc2zswBHVMAUD0d21AcK2nrwQaQciK2iMakJPs4juPnH6 kE4A== X-Gm-Message-State: APjAAAUOiPG96cYkb81GQ88KG42mD3G1TVK8mI2suur3tFQz6pyf2LYF 5jwzHkYZFyBKgAdtkSbbBISFdHYv X-Google-Smtp-Source: APXvYqzoxIoA66g4lty+OkcSo7msj8XOILxWU3Cqejq2l9tuEe7Yx/WMDMXRTB31Zkl0ZBAK4wSGcA== X-Received: by 2002:a19:f801:: with SMTP id a1mr3003776lff.166.1569598752711; Fri, 27 Sep 2019 08:39:12 -0700 (PDT) Received: from [192.168.1.131] (dsl-hkibng21-54f87c-13.dhcp.inet.fi. [84.248.124.13]) by smtp.googlemail.com with ESMTPSA id c69sm536398ljf.32.2019.09.27.08.39.11 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 27 Sep 2019 08:39:12 -0700 (PDT) To: freebsd-amd64@freebsd.org From: "Jukka A. Ukkonen" Subject: i915kms question Message-ID: Date: Fri, 27 Sep 2019 18:39:10 +0300 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:68.0) Gecko/20100101 Thunderbird/68.1.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 46fwvR43dmz3xlX X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20161025 header.b=pwJvEWkV; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of jau789@gmail.com designates 2a00:1450:4864:20::132 as permitted sender) smtp.mailfrom=jau789@gmail.com X-Spamd-Result: default: False [-3.00 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36]; FREEMAIL_FROM(0.00)[gmail.com]; TO_DN_NONE(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; DKIM_TRACE(0.00)[gmail.com:+]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; RECEIVED_SPAMHAUS_PBL(0.00)[13.124.248.84.khpj7ygk5idzvmvt5x4ziurxhy.zen.dq.spamhaus.net : 127.0.0.10]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com.dwl.dnswl.org : 127.0.5.0]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20161025]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-amd64@freebsd.org]; IP_SCORE_FREEMAIL(0.00)[]; RCPT_COUNT_ONE(0.00)[1]; IP_SCORE(0.00)[ip: (-9.48), ipnet: 2a00:1450::/32(-2.90), asn: 15169(-2.17), country: US(-0.05)]; RCVD_IN_DNSWL_NONE(0.00)[2.3.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0.4.6.8.4.0.5.4.1.0.0.a.2.list.dnswl.org : 127.0.5.0]; RCVD_TLS_ALL(0.00)[] X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Sep 2019 15:39:16 -0000 Hello all, I recently bought a 7" TFT the dimensions of which are natively 1024x600 pixels and in text mode 128x36 assuming the 8x16 font is used. It seems that whatever I do the kernel stubbornly ends up assuming 160x64 dimensions in text mode which using the 8x16 console font is the same as 1280x1024. Obviously some text tends to disappear beyond the right edge of the screen. Similarly 28 lines of text roll over the lower edge of the screen. The graphics device to which the display device is connected (via HDMI) is ... # pciconf -lv vgapci0 vgapci0@pci0:0:2:0: class=0x030000 card=0x0f318086 chip=0x0f318086 rev=0x11 hdr=0x00 vendor = 'Intel Corporation' device = 'Atom Processor Z36xxx/Z37xxx Series Graphics & Display' class = display subclass = VGA I have set the following in /boot/loader.conf ... kern.vt.fb.default_mode="1024x600" kern.vt.fb.modes.HDMI-A-1="1024x600" i915kms_load="YES" Is that an indication that the i915kms module does not support this particular graphics hardware model? I also tried starting X11 on the system just to get a bit more info about what is going on. That did not work out either. I got the following error messages ... X.Org X Server 1.18.4 Release Date: 2016-07-19 X Protocol Version 11, Revision 0 Build Operating System: FreeBSD 11.2-RELEASE-p14 amd64 Current Operating System: FreeBSD xyzzy 11.3-STABLE FreeBSD 11.3-STABLE #0 r351588: Thu Aug 29 02:09:40 UTC 2019 root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 Build Date: 21 September 2019 01:41:58AM Current version of pixman: 0.38.4 Before reporting problems, check http://wiki.x.org to make sure that you have the latest version. Markers: (--) probed, (**) from config file, (==) default setting, (++) from command line, (!!) notice, (II) informational, (WW) warning, (EE) error, (NI) not implemented, (??) unknown. (==) Log file: "/var/log/Xorg.0.log", Time: Fri Sep 27 17:46:58 2019 List of video drivers: modesetting (++) Using config file: "/root/xorg.conf.new" (==) Using system config directory "/usr/local/share/X11/xorg.conf.d" (EE) (EE) Backtrace: (EE) 0: /usr/local/bin/X (OsInit+0x37a) [0x5baf8a] (EE) 1: /lib/libthr.so.3 (_pthread_sigmask+0x53e) [0x80260a17e] (EE) 2: /lib/libthr.so.3 (_pthread_getspecific+0xdef) [0x802609f8f] (EE) 3: ? (?+0xdef) [0x7ffffffffdf2] (EE) 4: ? (?+0xdef) [0xdef] (EE) 5: /usr/local/bin/X (InitOutput+0x128c) [0x482e5c] (EE) 6: /usr/local/bin/X (remove_fs_handlers+0x3bb) [0x43c7cb] (EE) 7: /usr/local/bin/X (_start+0x95) [0x425145] (EE) 8: ? (?+0x95) [0x800845095] (EE) (EE) Segmentation fault at address 0x0 (EE) Fatal server error: (EE) Caught signal 11 (Segmentation fault). Server aborting (EE) (EE) Please consult the The X.Org Foundation support at http://wiki.x.org for help. (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information. (EE) (EE) Server terminated with error (1). Closing log file. For my eyes that is uninformative. Instead of complaining about what it cannot do X falls on its face. What might be the problem(s)? Is there anything else meaningful I could test to collect more info? --jau From owner-freebsd-amd64@freebsd.org Fri Sep 27 19:24:41 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id ED6A212AD3E; Fri, 27 Sep 2019 19:24:41 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: from mail-io1-xd43.google.com (mail-io1-xd43.google.com [IPv6:2607:f8b0:4864:20::d43]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 46g1vY1HQlz4D0H; Fri, 27 Sep 2019 19:24:40 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: by mail-io1-xd43.google.com with SMTP id r26so19172517ioh.8; Fri, 27 Sep 2019 12:24:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=uKkivfd8fY+rKkAJikbX2q5nIs/eNmq4y5ScAcHJx8Q=; b=HwlC7gY9HEOKSLqc6XWLsrAHY8lnXnAqzZKqreI6mW5nthqX9vB2oDgvESzJaLCTFi jklkxPFk10M5eDdqBDOZMA4QGJgcw5WwTsr3CnlHk+fzy1GoUHllP1sEZqxUsacytSub fYnfOCNe4iDo8oHtqrNxWF2nmg7qHMcAg4j0updKQshZvbsNtHxKWUctBgMCAde2QhOr Pi9L/eL/8v+Q66alri1jFh4QZCFqU0Wdnfc4WVYrIZQIEhsDdX0NfRBJ8Ewm/E4m/YMj KBjwZgeWp9JEYU/UkiOBV3yNbzzM9VG9xp7WYhefOiJd4JPWE2LqfJ0fE6ytGWAk6uBj ZWVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=uKkivfd8fY+rKkAJikbX2q5nIs/eNmq4y5ScAcHJx8Q=; b=EYGvjnAWutkR/p9pRJI1TteFdasikCL32bc7j2LuEievrBx00uAGYK9b6pT2axcFxW 36Ox2Lg1b8kujtSWC7iG/VWEA1CDeGn1/AK2pG9YuZpmMoZl4qMyW4WA3tp4pAjmE2Hc Wiqw46T2YnyWh5QcEICvaTMEGYqmE9kut0BxbM6kEsG0AVVj441oN0benPydq+Hb1WZv y64m5JuFgVJmn1UpI15J2sezbe6yWeYLI8OVD53mN9pTNKh3bcaHLvMacrCPoU1tiJaK IHK18ltv2JLA96mgwIQCFnOGnfA64nB4kYhRV5LOPW3+guToRVmAt0mfS2MYbId7Rhr8 RPIw== X-Gm-Message-State: APjAAAWG+KPwzXyWGXRMw7W2w02re7hFQSM5HGK8ZBd0Ry1QI2fzjVeJ H2qlykrA+frHtC15XuNtG8E= X-Google-Smtp-Source: APXvYqym91FNCoChL5YyFWRJrDd5ezg5Q15uFk9PKmziPadbqVQxwAeekHTexs96RPA+xiDoHjUvjQ== X-Received: by 2002:a92:5a10:: with SMTP id o16mr6854939ilb.296.1569612279750; Fri, 27 Sep 2019 12:24:39 -0700 (PDT) Received: from raichu (toroon0560w-lp140-01-69-159-39-167.dsl.bell.ca. [69.159.39.167]) by smtp.gmail.com with ESMTPSA id i18sm2048898ilc.34.2019.09.27.12.24.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 27 Sep 2019 12:24:38 -0700 (PDT) Sender: Mark Johnston Date: Fri, 27 Sep 2019 15:24:34 -0400 From: Mark Johnston To: Mark Millard Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? Message-ID: <20190927192434.GA93180@raichu> References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com> <20190926202936.GD5581@raichu> <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com> <6BC5F6BE-5FC3-48FA-9873-B20141FEFDF5@yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6BC5F6BE-5FC3-48FA-9873-B20141FEFDF5@yahoo.com> User-Agent: Mutt/1.12.1 (2019-06-15) X-Rspamd-Queue-Id: 46g1vY1HQlz4D0H X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20161025 header.b=HwlC7gY9; dmarc=none; spf=pass (mx1.freebsd.org: domain of markjdb@gmail.com designates 2607:f8b0:4864:20::d43 as permitted sender) smtp.mailfrom=markjdb@gmail.com X-Spamd-Result: default: False [-1.26 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20161025]; RCVD_TLS_ALL(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.10)[text/plain]; MIME_TRACE(0.00)[0:+]; DMARC_NA(0.00)[freebsd.org]; TO_DN_SOME(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; RCVD_IN_DNSWL_NONE(0.00)[3.4.d.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0.4.6.8.4.0.b.8.f.7.0.6.2.list.dnswl.org : 127.0.5.0]; NEURAL_HAM_MEDIUM(-0.99)[-0.994,0]; IP_SCORE(-0.56)[ip: (2.01), ipnet: 2607:f8b0::/32(-2.59), asn: 15169(-2.17), country: US(-0.05)]; FORGED_SENDER(0.30)[markj@freebsd.org,markjdb@gmail.com]; FREEMAIL_TO(0.00)[yahoo.com]; SUBJECT_ENDS_QUESTION(1.00)[]; MID_RHS_NOT_FQDN(0.50)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; FROM_NEQ_ENVFROM(0.00)[markj@freebsd.org,markjdb@gmail.com]; FREEMAIL_ENVFROM(0.00)[gmail.com] X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Sep 2019 19:24:42 -0000 On Thu, Sep 26, 2019 at 08:37:39PM -0700, Mark Millard wrote: > > > On 2019-Sep-26, at 17:05, Mark Millard wrote: > > > On 2019-Sep-26, at 13:29, Mark Johnston wrote: > >> One possibility is that these are kernel memory allocations occurring in > >> the context of the benchmark threads. Such allocations may not respect > >> the configured policy since they are not private to the allocating > >> thread. For instance, upon opening a file, the kernel may allocate a > >> vnode structure for that file. That vnode may be accessed by threads > >> from many processes over its lifetime, and may be recycled many times > >> before its memory is released back to the allocator. > > > > For -l0-15 -n prefer:1 : > > > > Looks like this reports sys_thr_new activity, sys_cpuset > > activity, and 0xffffffff80bc09bd activity (whatever that > > is). Mostly sys_thr_new activity, over 1300 of them . . . > > > > dtrace: pid 13553 has exited > > > > > > kernel`uma_small_alloc+0x61 > > kernel`keg_alloc_slab+0x10b > > kernel`zone_import+0x1d2 > > kernel`uma_zalloc_arg+0x62b > > kernel`thread_init+0x22 > > kernel`keg_alloc_slab+0x259 > > kernel`zone_import+0x1d2 > > kernel`uma_zalloc_arg+0x62b > > kernel`thread_alloc+0x23 > > kernel`thread_create+0x13a > > kernel`sys_thr_new+0xd2 > > kernel`amd64_syscall+0x3ae > > kernel`0xffffffff811b7600 > > 2 > > > > kernel`uma_small_alloc+0x61 > > kernel`keg_alloc_slab+0x10b > > kernel`zone_import+0x1d2 > > kernel`uma_zalloc_arg+0x62b > > kernel`cpuset_setproc+0x65 > > kernel`sys_cpuset+0x123 > > kernel`amd64_syscall+0x3ae > > kernel`0xffffffff811b7600 > > 2 > > > > kernel`uma_small_alloc+0x61 > > kernel`keg_alloc_slab+0x10b > > kernel`zone_import+0x1d2 > > kernel`uma_zalloc_arg+0x62b > > kernel`uma_zfree_arg+0x36a > > kernel`thread_reap+0x106 > > kernel`thread_alloc+0xf > > kernel`thread_create+0x13a > > kernel`sys_thr_new+0xd2 > > kernel`amd64_syscall+0x3ae > > kernel`0xffffffff811b7600 > > 6 > > > > kernel`uma_small_alloc+0x61 > > kernel`keg_alloc_slab+0x10b > > kernel`zone_import+0x1d2 > > kernel`uma_zalloc_arg+0x62b > > kernel`uma_zfree_arg+0x36a > > kernel`vm_map_process_deferred+0x8c > > kernel`vm_map_remove+0x11d > > kernel`vmspace_exit+0xd3 > > kernel`exit1+0x5a9 > > kernel`0xffffffff80bc09bd > > kernel`amd64_syscall+0x3ae > > kernel`0xffffffff811b7600 > > 6 > > > > kernel`uma_small_alloc+0x61 > > kernel`keg_alloc_slab+0x10b > > kernel`zone_import+0x1d2 > > kernel`uma_zalloc_arg+0x62b > > kernel`thread_alloc+0x23 > > kernel`thread_create+0x13a > > kernel`sys_thr_new+0xd2 > > kernel`amd64_syscall+0x3ae > > kernel`0xffffffff811b7600 > > 22 > > > > kernel`vm_page_grab_pages+0x1b4 > > kernel`vm_thread_stack_create+0xc0 > > kernel`kstack_import+0x52 > > kernel`uma_zalloc_arg+0x62b > > kernel`vm_thread_new+0x4d > > kernel`thread_alloc+0x31 > > kernel`thread_create+0x13a > > kernel`sys_thr_new+0xd2 > > kernel`amd64_syscall+0x3ae > > kernel`0xffffffff811b7600 > > 1324 > > With sys_thr_new not respecting -n prefer:1 for > -l0-15 (especially for the thread stacks), I > looked some at the generated integration kernel > code and it makes significant use of %rsp based > memory accesses (read and write). > > That would get both memory controllers going in > parallel (kernel vectors accesses to the preferred > memory domain), so not slowing down as expected. > > If round-robin is not respected for thread stacks, > and if threads migrate cpus across memory domains > at times, there could be considerable variability > for that context as well. (This may not be the > only way to have different/extra variability for > this context.) > > Overall: I'd be surprised if this was not > contributing to what I thought was odd about > the benchmark results. Your tracing refers to kernel thread stacks though, not the stacks used by threads when executing in user mode. My understanding is that a HINT implementation would spend virtually all of its time in user mode, so it shouldn't matter much or at all if kernel thread stacks are backed by memory from the "wrong" domain. This also doesn't really explain some of the disparities in the plots you sent me. For instance, you get a much higher peak QUIS on FreeBSD than on Fedora with 16 threads and an interleave/round-robin domain selection policy. From owner-freebsd-amd64@freebsd.org Fri Sep 27 20:53:06 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id C58F212CA2A for ; Fri, 27 Sep 2019 20:53:06 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic304-22.consmr.mail.ne1.yahoo.com (sonic304-22.consmr.mail.ne1.yahoo.com [66.163.191.148]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 46g3sY4JPzz4Jj4 for ; Fri, 27 Sep 2019 20:53:05 +0000 (UTC) (envelope-from marklmi@yahoo.com) X-YMail-OSG: waxFg.QVM1mnjqB3fF7um.UECE9jJ7GdltGt3ugDUMqHjdSFgCeAX7BWUB25VkX iOUBThaG8agCd1PkUoxerwcpf8jriyfX3yvgtMqHGj1GDU7HDPwTO.CCIDbByIdeSxWYrPkJHQe9 AXu8V_X73aOkmCZHGzpSd9.Jq2EgQRMkdSCwT8cpoST_QZkHYCBnctbo4OSzXioMjtIGFm866QBF WdNjzWdZreHH1jz.Ghh51cKpQ5tzktN9XR9FyH9UpLR3.ZPJbyaMqkUCG27RByTa7orfUg5GK.92 uozG_fG9xhbFwTFBVl85FO6c5eP3KDmlyq_sJjHIiO8hu.jxhay55uaStObCGWkMncxRLdiGd9EW XN3LrsDNi_VKOyNqdK_AINo7dlTx_HcXceYf1_pPlKlJpt.OHbhpXql9FTOgqvsh.FU3bnv64eij YSiOjuTaMqqedC.530u8BqMRmMYMKFjde1FeWFttD9aYYy3GIAsznKfy0B1A3SZMLXxp0e1I4OdB 2cp3St.T_KtxRyzO4OId3fCbhiBOQqwnXzZxUq7xUo_YxCdu5kyFCgilfdb0odXmGAKjuCmihr3e RGi1kisds4pC1JHjxLq0NN85PCFs5ilvVrcfepdORpWURJWxv327PeXy8K4tJcYCOYuptUuiLebh np.KHwx6xZaAzLZCYg9JPnQDCJNMw1j6z0TTYigkVVr6nNvAMeNG694H.nV42v14Ati15d1qnlrz EOc0gPl27oDlgn_NSVCxC10EfVYWf9Fo.Ao85YwTJQlDypynVhyqr7ZO19DiUIbDNx_DqI9vCWiV Uiiprn7.o0yPS7W1QUc3oDjj1cC52ssMkRY5DYxjnQkqRcKQ7w_scUwVPzDmDrtYKGJw5Bvrntpu OJUyj9_JKZEqurpj06dmTM_5ZsXbNbUxUdKi6m6cA0jvo9RhywBkm5MI579QeSwTO.asdsbr2wFp st8RpgZb.3H2RKN1hevNrzoFcFC2PNX9KErPBe2E6UXJS1EaYhgXF9JWLhTKyoHwbJVGS7dDv1sJ 8a2jPVJElOb17XBvVbzb5uL6ZbmuhipbHnnAV7VDiW_I6MmYxXkqVH72qQ.5KOIG6tY19kCuWzAa .X3yrHdGhpc8ABu.3nc9nzKFkRzwOFyh3G9ftkFG5zpqFW9QdLLgnwEcaVSdtbCsTx.oFIhsv8aq .rn.py9EGOBzSZ1w7tAZDymWWr0z4E073jWwC6_2GppQDnJK4ehds.wgjXp4XDbiRyPESsF0_VYX jwx_EaDwBAfc2xyIGD86VhkKzaxKBJMzxQrPuV3uREsd7JP2rN2w0GJRO_bpj Received: from sonic.gate.mail.ne1.yahoo.com by sonic304.consmr.mail.ne1.yahoo.com with HTTP; Fri, 27 Sep 2019 20:53:04 +0000 Received: by smtp405.mail.ne1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID ac096501cfc03302cf9a9c0240bea97e; Fri, 27 Sep 2019 20:53:00 +0000 (UTC) From: Mark Millard Message-Id: <08CA4DA1-131C-4B14-BB57-EAA22A8CD5D9@yahoo.com> Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? Date: Fri, 27 Sep 2019 13:52:58 -0700 In-Reply-To: <20190927192434.GA93180@raichu> Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org To: Mark Johnston References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com> <20190926202936.GD5581@raichu> <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com> <6BC5F6BE-5FC3-48FA-9873-B20141FEFDF5@yahoo.com> <20190927192434.GA93180@raichu> X-Mailer: Apple Mail (2.3445.104.11) X-Rspamd-Queue-Id: 46g3sY4JPzz4Jj4 X-Spamd-Bar: +++ X-Spamd-Result: default: False [3.49 / 15.00]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[]; URI_COUNT_ODD(1.00)[9]; DKIM_TRACE(0.00)[yahoo.com:+]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; IP_SCORE(0.00)[ip: (5.47), ipnet: 66.163.184.0/21(1.32), asn: 36646(1.05), country: US(-0.05)]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; SUBJECT_ENDS_QUESTION(1.00)[]; MID_RHS_MATCH_FROM(0.00)[]; ASN(0.00)[asn:36646, ipnet:66.163.184.0/21, country:US]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0]; NEURAL_SPAM_MEDIUM(0.99)[0.992,0]; IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(0.99)[0.993,0]; RCVD_IN_DNSWL_NONE(0.00)[148.191.163.66.list.dnswl.org : 127.0.5.0]; RCVD_TLS_LAST(0.00)[]; RWL_MAILSPIKE_POSSIBLE(0.00)[148.191.163.66.rep.mailspike.net : 127.0.0.17]; RCVD_COUNT_TWO(0.00)[2] Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Sep 2019 20:53:06 -0000 On 2019-Sep-27, at 12:24, Mark Johnston wrote: > On Thu, Sep 26, 2019 at 08:37:39PM -0700, Mark Millard wrote: >>=20 >>=20 >> On 2019-Sep-26, at 17:05, Mark Millard wrote: >>=20 >>> On 2019-Sep-26, at 13:29, Mark Johnston = wrote: >>>> One possibility is that these are kernel memory allocations = occurring in >>>> the context of the benchmark threads. Such allocations may not = respect >>>> the configured policy since they are not private to the allocating >>>> thread. For instance, upon opening a file, the kernel may allocate = a >>>> vnode structure for that file. That vnode may be accessed by = threads >>>> from many processes over its lifetime, and may be recycled many = times >>>> before its memory is released back to the allocator. >>>=20 >>> For -l0-15 -n prefer:1 : >>>=20 >>> Looks like this reports sys_thr_new activity, sys_cpuset >>> activity, and 0xffffffff80bc09bd activity (whatever that >>> is). Mostly sys_thr_new activity, over 1300 of them . . . >>>=20 >>> dtrace: pid 13553 has exited >>>=20 >>>=20 >>> kernel`uma_small_alloc+0x61 >>> kernel`keg_alloc_slab+0x10b >>> kernel`zone_import+0x1d2 >>> kernel`uma_zalloc_arg+0x62b >>> kernel`thread_init+0x22 >>> kernel`keg_alloc_slab+0x259 >>> kernel`zone_import+0x1d2 >>> kernel`uma_zalloc_arg+0x62b >>> kernel`thread_alloc+0x23 >>> kernel`thread_create+0x13a >>> kernel`sys_thr_new+0xd2 >>> kernel`amd64_syscall+0x3ae >>> kernel`0xffffffff811b7600 >>> 2 >>>=20 >>> kernel`uma_small_alloc+0x61 >>> kernel`keg_alloc_slab+0x10b >>> kernel`zone_import+0x1d2 >>> kernel`uma_zalloc_arg+0x62b >>> kernel`cpuset_setproc+0x65 >>> kernel`sys_cpuset+0x123 >>> kernel`amd64_syscall+0x3ae >>> kernel`0xffffffff811b7600 >>> 2 >>>=20 >>> kernel`uma_small_alloc+0x61 >>> kernel`keg_alloc_slab+0x10b >>> kernel`zone_import+0x1d2 >>> kernel`uma_zalloc_arg+0x62b >>> kernel`uma_zfree_arg+0x36a >>> kernel`thread_reap+0x106 >>> kernel`thread_alloc+0xf >>> kernel`thread_create+0x13a >>> kernel`sys_thr_new+0xd2 >>> kernel`amd64_syscall+0x3ae >>> kernel`0xffffffff811b7600 >>> 6 >>>=20 >>> kernel`uma_small_alloc+0x61 >>> kernel`keg_alloc_slab+0x10b >>> kernel`zone_import+0x1d2 >>> kernel`uma_zalloc_arg+0x62b >>> kernel`uma_zfree_arg+0x36a >>> kernel`vm_map_process_deferred+0x8c >>> kernel`vm_map_remove+0x11d >>> kernel`vmspace_exit+0xd3 >>> kernel`exit1+0x5a9 >>> kernel`0xffffffff80bc09bd >>> kernel`amd64_syscall+0x3ae >>> kernel`0xffffffff811b7600 >>> 6 >>>=20 >>> kernel`uma_small_alloc+0x61 >>> kernel`keg_alloc_slab+0x10b >>> kernel`zone_import+0x1d2 >>> kernel`uma_zalloc_arg+0x62b >>> kernel`thread_alloc+0x23 >>> kernel`thread_create+0x13a >>> kernel`sys_thr_new+0xd2 >>> kernel`amd64_syscall+0x3ae >>> kernel`0xffffffff811b7600 >>> 22 >>>=20 >>> kernel`vm_page_grab_pages+0x1b4 >>> kernel`vm_thread_stack_create+0xc0 >>> kernel`kstack_import+0x52 >>> kernel`uma_zalloc_arg+0x62b >>> kernel`vm_thread_new+0x4d >>> kernel`thread_alloc+0x31 >>> kernel`thread_create+0x13a >>> kernel`sys_thr_new+0xd2 >>> kernel`amd64_syscall+0x3ae >>> kernel`0xffffffff811b7600 >>> 1324 >>=20 >> With sys_thr_new not respecting -n prefer:1 for >> -l0-15 (especially for the thread stacks), I >> looked some at the generated integration kernel >> code and it makes significant use of %rsp based >> memory accesses (read and write). >>=20 >> That would get both memory controllers going in >> parallel (kernel vectors accesses to the preferred >> memory domain), so not slowing down as expected. >>=20 >> If round-robin is not respected for thread stacks, >> and if threads migrate cpus across memory domains >> at times, there could be considerable variability >> for that context as well. (This may not be the >> only way to have different/extra variability for >> this context.) >>=20 >> Overall: I'd be surprised if this was not >> contributing to what I thought was odd about >> the benchmark results. >=20 > Your tracing refers to kernel thread stacks though, not the stacks = used > by threads when executing in user mode. My understanding is that a = HINT > implementation would spend virtually all of its time in user mode, so = it > shouldn't matter much or at all if kernel thread stacks are backed by > memory from the "wrong" domain. Looks like I was trying to think about it when I should have been = sleeping. You are correct. > This also doesn't really explain some of the disparities in the plots > you sent me. For instance, you get a much higher peak QUIS on FreeBSD > than on Fedora with 16 threads and an interleave/round-robin domain > selection policy. True. I suppose that there is the possibility that steady_clock's now() = results are odd for some reason for the type of context, leading to the = durations between such being on the short side where things look different. But the left hand side of the single-thread results (smaller memory = sizes for the vectors for the integration kernel's use) do not show such a = rescaling. (The single thread time measurements are strictly inside the thread of execution, no thread creation or such counted for any size.) The right = hand side of the single thread results (larger memory use, making smaller = cache levels fairly ineffective) do generally show some rescaling, but not as = drastic as multi-threaded. Both round-robin and prefer:1 showed such for single threaded. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar) From owner-freebsd-amd64@freebsd.org Fri Sep 27 22:22:22 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 81C8712F1FA for ; Fri, 27 Sep 2019 22:22:22 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic302-22.consmr.mail.ne1.yahoo.com (sonic302-22.consmr.mail.ne1.yahoo.com [66.163.186.148]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 46g5rY02t2z4S2C for ; Fri, 27 Sep 2019 22:22:20 +0000 (UTC) (envelope-from marklmi@yahoo.com) X-YMail-OSG: .JhUWa8VM1kRawIwgVHhDff5.D2J6hIw.lrxucmKuaAly71jc3efJ5oyPlFY6xd qp6BmLQ0jVKUmRa2pVyW51faWzDvyjwJoXzXX8Icmjn5CragbqWgPNCceM84QtnIDe.3EspS7Ywo NYwXvaBldHeOeBvKQ.ElYe3KIHpk_w2p64jG81ABAXDcXbNGv9q0bQkuqbCp36ekIN8dLCYiy.3U TzTV9AtLnrkwVuYsiZeeOAk8hqNL_gl6hwqKwl0gpFXPZ1DxTX8AVLSbs8qqRrVI73dXVeQ3csMR ac4vyIYgLmvfIn0byqGYPxD6AIw9xg7zOGOxBHDZaPnQ6vInB.79zhQwMKn8gdApGkK.1r4irr5j 2zSr8qHaxWAI6z0tLfXWkSRhXD.dbyPB_DY9KblsPjBOHjdWJB96b7TYe2uSthIW687iNnixUnV6 Ji4hRSF6I3TDfZc1Pev7bx94zGOut2QwOTS96zRxGScoN3TV8WNE7hcoDblIDfNR45ZStT_awoVT gnql7Uph3Yu_gpC1CSLFHupLR8Ufb5JV1BKw_Ps4er6vUz693gdt_jSSbvcrwRiAgZ0SCX85w6cf MNU_Cx21tK81J0TvI1DX0K.U.d1qkIdFtUMxBU7cRskjmlVItJHq.RdlXKIBoNWVm6fHPFiGxG3A SuRhsNR9hHT6dB8lbmSTI.f1a6fWd6RZTrz8KemiSLZGLgLaGabJkEh6mSoW_v4RSSH0oAYIYVmQ PgfwFboVMmOugEKMk0Bf3rePwIYgnn_s76TV8vMIkxq09tig.Do1HvGgLNAP9k5v2MDY_X9WH_Ga 9uzceI1CuEe0FvJrA1bW9JrQQ7ZYd9TGy.XVJgPRvnlI1eLBAw5g5I7aBrvBRXlfyrZWmBKTc8ao ao781N8n2riZb5zEb.P5TdYhzukznN5I7Dqs5aAut6E9aGgWEBk2JPm_Utg55kUDAXjQcfs6SyJw EuX9lR6NPLsI0D1qoVWw90wXFFkE.A9Gi1.W7ru4yG6Ril8GcBhO2VJqbDJU6zxqoQl_qjdwaVIq GhzqgsHJ8yub_9eys91GHbSDyOXKu_Gg6DxslO_dMMO6a9Df.qHk_NNF.ZlRd_IyNhNdBpnic8M8 Oi_ALHEwi6cyHtx4.mNpmQLu7sUNMCpOT2txOoCb0XquhPJ8kNFnVoWB74aGcQI_xpMRJoJuMa69 ApDNC_wazSbMSYHs93pc2F3bBHxdGFkbQMr2V.9s2sqVwH85Tm52tQ5j2BNBUYdh3xu7.FhcIo11 xUwcgtKlIqb3c_IbY7LxmZghMbO.TnON1xXcLMWAfZiSzbSX6gT.koE4U7_7gby7t_YqKAmMAxXg - Received: from sonic.gate.mail.ne1.yahoo.com by sonic302.consmr.mail.ne1.yahoo.com with HTTP; Fri, 27 Sep 2019 22:22:20 +0000 Received: by smtp416.mail.ne1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID fe38cc4492cc14311a5b8e24e630f249; Fri, 27 Sep 2019 22:22:18 +0000 (UTC) From: Mark Millard Message-Id: Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? Date: Fri, 27 Sep 2019 15:22:16 -0700 In-Reply-To: <08CA4DA1-131C-4B14-BB57-EAA22A8CD5D9@yahoo.com> Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org To: Mark Johnston References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com> <20190926202936.GD5581@raichu> <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com> <6BC5F6BE-5FC3-48FA-9873-B20141FEFDF5@yahoo.com> <20190927192434.GA93180@raichu> <08CA4DA1-131C-4B14-BB57-EAA22A8CD5D9@yahoo.com> X-Mailer: Apple Mail (2.3445.104.11) X-Rspamd-Queue-Id: 46g5rY02t2z4S2C X-Spamd-Bar: +++ X-Spamd-Result: default: False [3.49 / 15.00]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[]; URI_COUNT_ODD(1.00)[21]; DKIM_TRACE(0.00)[yahoo.com:+]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; SUBJECT_ENDS_QUESTION(1.00)[]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; ASN(0.00)[asn:36646, ipnet:66.163.184.0/21, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; IP_SCORE(0.00)[ip: (6.75), ipnet: 66.163.184.0/21(1.32), asn: 36646(1.05), country: US(-0.05)]; NEURAL_SPAM_MEDIUM(0.99)[0.994,0]; IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(1.00)[0.995,0]; RCVD_IN_DNSWL_NONE(0.00)[148.186.163.66.list.dnswl.org : 127.0.5.0]; RCVD_TLS_LAST(0.00)[]; RCVD_COUNT_TWO(0.00)[2] Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Sep 2019 22:22:22 -0000 On 2019-Sep-27, at 13:52, Mark Millard wrote: > On 2019-Sep-27, at 12:24, Mark Johnston > wrote: >=20 >> On Thu, Sep 26, 2019 at 08:37:39PM -0700, Mark Millard wrote: >>>=20 >>>=20 >>> On 2019-Sep-26, at 17:05, Mark Millard > wrote: >>>=20 >>>> On 2019-Sep-26, at 13:29, Mark Johnston > wrote: >>>>> One possibility is that these are kernel memory allocations = occurring in >>>>> the context of the benchmark threads. Such allocations may not = respect >>>>> the configured policy since they are not private to the allocating >>>>> thread. For instance, upon opening a file, the kernel may = allocate a >>>>> vnode structure for that file. That vnode may be accessed by = threads >>>>> from many processes over its lifetime, and may be recycled many = times >>>>> before its memory is released back to the allocator. >>>>=20 >>>> For -l0-15 -n prefer:1 : >>>>=20 >>>> Looks like this reports sys_thr_new activity, sys_cpuset >>>> activity, and 0xffffffff80bc09bd activity (whatever that >>>> is). Mostly sys_thr_new activity, over 1300 of them . . . >>>>=20 >>>> dtrace: pid 13553 has exited >>>>=20 >>>>=20 >>>> kernel`uma_small_alloc+0x61 >>>> kernel`keg_alloc_slab+0x10b >>>> kernel`zone_import+0x1d2 >>>> kernel`uma_zalloc_arg+0x62b >>>> kernel`thread_init+0x22 >>>> kernel`keg_alloc_slab+0x259 >>>> kernel`zone_import+0x1d2 >>>> kernel`uma_zalloc_arg+0x62b >>>> kernel`thread_alloc+0x23 >>>> kernel`thread_create+0x13a >>>> kernel`sys_thr_new+0xd2 >>>> kernel`amd64_syscall+0x3ae >>>> kernel`0xffffffff811b7600 >>>> 2 >>>>=20 >>>> kernel`uma_small_alloc+0x61 >>>> kernel`keg_alloc_slab+0x10b >>>> kernel`zone_import+0x1d2 >>>> kernel`uma_zalloc_arg+0x62b >>>> kernel`cpuset_setproc+0x65 >>>> kernel`sys_cpuset+0x123 >>>> kernel`amd64_syscall+0x3ae >>>> kernel`0xffffffff811b7600 >>>> 2 >>>>=20 >>>> kernel`uma_small_alloc+0x61 >>>> kernel`keg_alloc_slab+0x10b >>>> kernel`zone_import+0x1d2 >>>> kernel`uma_zalloc_arg+0x62b >>>> kernel`uma_zfree_arg+0x36a >>>> kernel`thread_reap+0x106 >>>> kernel`thread_alloc+0xf >>>> kernel`thread_create+0x13a >>>> kernel`sys_thr_new+0xd2 >>>> kernel`amd64_syscall+0x3ae >>>> kernel`0xffffffff811b7600 >>>> 6 >>>>=20 >>>> kernel`uma_small_alloc+0x61 >>>> kernel`keg_alloc_slab+0x10b >>>> kernel`zone_import+0x1d2 >>>> kernel`uma_zalloc_arg+0x62b >>>> kernel`uma_zfree_arg+0x36a >>>> kernel`vm_map_process_deferred+0x8c >>>> kernel`vm_map_remove+0x11d >>>> kernel`vmspace_exit+0xd3 >>>> kernel`exit1+0x5a9 >>>> kernel`0xffffffff80bc09bd >>>> kernel`amd64_syscall+0x3ae >>>> kernel`0xffffffff811b7600 >>>> 6 >>>>=20 >>>> kernel`uma_small_alloc+0x61 >>>> kernel`keg_alloc_slab+0x10b >>>> kernel`zone_import+0x1d2 >>>> kernel`uma_zalloc_arg+0x62b >>>> kernel`thread_alloc+0x23 >>>> kernel`thread_create+0x13a >>>> kernel`sys_thr_new+0xd2 >>>> kernel`amd64_syscall+0x3ae >>>> kernel`0xffffffff811b7600 >>>> 22 >>>>=20 >>>> kernel`vm_page_grab_pages+0x1b4 >>>> kernel`vm_thread_stack_create+0xc0 >>>> kernel`kstack_import+0x52 >>>> kernel`uma_zalloc_arg+0x62b >>>> kernel`vm_thread_new+0x4d >>>> kernel`thread_alloc+0x31 >>>> kernel`thread_create+0x13a >>>> kernel`sys_thr_new+0xd2 >>>> kernel`amd64_syscall+0x3ae >>>> kernel`0xffffffff811b7600 >>>> 1324 >>>=20 >>> With sys_thr_new not respecting -n prefer:1 for >>> -l0-15 (especially for the thread stacks), I >>> looked some at the generated integration kernel >>> code and it makes significant use of %rsp based >>> memory accesses (read and write). >>>=20 >>> That would get both memory controllers going in >>> parallel (kernel vectors accesses to the preferred >>> memory domain), so not slowing down as expected. >>>=20 >>> If round-robin is not respected for thread stacks, >>> and if threads migrate cpus across memory domains >>> at times, there could be considerable variability >>> for that context as well. (This may not be the >>> only way to have different/extra variability for >>> this context.) >>>=20 >>> Overall: I'd be surprised if this was not >>> contributing to what I thought was odd about >>> the benchmark results. >>=20 >> Your tracing refers to kernel thread stacks though, not the stacks = used >> by threads when executing in user mode. My understanding is that a = HINT >> implementation would spend virtually all of its time in user mode, so = it >> shouldn't matter much or at all if kernel thread stacks are backed by >> memory from the "wrong" domain. >=20 > Looks like I was trying to think about it when I should have been = sleeping. > You are correct. >=20 >> This also doesn't really explain some of the disparities in the plots >> you sent me. For instance, you get a much higher peak QUIS on = FreeBSD >> than on Fedora with 16 threads and an interleave/round-robin domain >> selection policy. >=20 > True. I suppose that there is the possibility that steady_clock's = now() results > are odd for some reason for the type of context, leading to the = durations > between such being on the short side where things look different. >=20 > But the left hand side of the single-thread results (smaller memory = sizes for > the vectors for the integration kernel's use) do not show such a = rescaling. > (The single thread time measurements are strictly inside the thread of > execution, no thread creation or such counted for any size.) The right = hand > side of the single thread results (larger memory use, making smaller = cache > levels fairly ineffective) do generally show some rescaling, but not = as drastic > as multi-threaded. >=20 > Both round-robin and prefer:1 showed such for single threaded. Just to be explicit about what would be executed in the FreeBSD kernel . . . One difference between single-threaded vs. multi-threaded for the benchmark code is that the multi-threaded calls steady_clock's now from the main thread, counting time that thread creations contribute. Single-threaded calls steady_clock's now from inside the same thread that executes the integration kernel, not counting thread creation. steady_clock's now uses sys calls requesting CLOCK_MONOTONIC from what I've seen with truss. This would be code involved from the FreeBSD kernel that could contribute some to the measured time. Having the kernel stack for this on the memory domain where the time-measuring-CPU is vs. on a remote memory domain might make some difference in duration results. (But I've no clue specifically what to expect for the differences for my context so it may well not explain much of anything.) =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar) =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar) From owner-freebsd-amd64@freebsd.org Sat Sep 28 18:34:20 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id EA49E12D073 for ; Sat, 28 Sep 2019 18:34:20 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic317-20.consmr.mail.gq1.yahoo.com (sonic317-20.consmr.mail.gq1.yahoo.com [98.137.66.146]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 46gckz1QYRz4Sns for ; Sat, 28 Sep 2019 18:34:18 +0000 (UTC) (envelope-from marklmi@yahoo.com) X-YMail-OSG: EDdeC6IVM1krEiBXZ8v5sAfnrmJjKIx3Sxmqzw8iq0FIImjWcw0WLodyjzcMMpS NDfkhwE1P9CtMNOYAsSNKtTUEwb_0LqiserxlFDv3tgHnHcUMk4oxdOVMc1sgEXtnGA551fs7cXY DhIt0LLsUcAQyQeDYgEUCjSUcVV2lchAvcRNfhUmLM3ZM_w7NE62JRfNWvLyua3fSzOCGpJBZ3jT 8Uasw_7pqLCihg0RuzNH_UnWWrw99ushnOb0zLY3QbhfW2010XQzxNh2itSPboXsGM.sGnv8B5dZ qHSXJlEu515yZiOT0865Zbhi8ZwoH8wyHpQCBYg7533lRMNRBUX74urojMC23myNv3ahR33kys7K oUZtx0n2l4YyfLj_48V4DYm5sfHw9bsVcRvyRLFW2WtgpdzZsK7RpqOCP7pZG0KbmLPcGcrcM01r FWpfVpf9stFJIOOGNExleGCS.m5kCS36ZrxMp_K3SA.lyUThIzDe5rSdxgyE5wEM1F4Yn_7VAWAm aWwoTcxOuT5_aoOVbg1WlO.UPQ4oJFj1dwf0NHPHWKQRe26U9eehRzCWk_ov1Yf2BdptRUuxzMrk Lc4Z261Y.AinGHydAmks5ulSH.GbT5RdnqB85qyaXMqPj5X_QtJwnK7wWfQuO7L0zRFksMEeItzK G5SQszt9d.wScxfLSZKbWHdr6hzkCzZTfH27T3rFM31ULdVWfKt8Hfk6uxB.S9K4183r8Hk7rGr7 ZAYXHPo_gXd3NUjHcyrMnvTjrkVdc_L_BSvv.89v6MZ3qGBZ3E5a_U403qTjAURGHuLsbxTceWl_ 5EwV017j99bjakDbra5IU2d5FD_ZQkHrLGhLnPP7Xsjbr0Dbd04yAUVr.t9o7WNvkpCdcuZPtOt7 hKGIaZQruSXx7IuPAYfDX12TNlux8GbwWh7hrgVyGRrnTvpUrKS6XPUv7EhOLEwOlZGlS8CbCoKM lovBa.KVKuX4X2DYTCXhq7DN_BZ_VtKuIRExP.V91_2ydF5Sp0Q8lLGexcnZx0MMg19jVyCsXb0W DNPF4SVgPzmYLX_3gLUhzsGG8ZZ6uMNMAG7aFz2hEtJc5oqRZg9y.NSvEmwBrrhLkbfCeyN5R3us IW6w4wh7fbMeWYa6Nv9ZShpAYb9TevBZG3tCZICNhVLV2VLG4Gm07RW6oL4uUGSFxoqN1ZNlHbyf .MTudXxueJPjKs6U7QdUjSiUJrQnVHEwtrPvLDWzAbon1I5oBqOpFN0afg2JzPUTnPBo0subgoZp DV7ivvEt3q9hPaNco_xJmQp5OeAt_MdTN_u1BZH7h_SjmRQaV8OO0eG1taiCPu94- Received: from sonic.gate.mail.ne1.yahoo.com by sonic317.consmr.mail.gq1.yahoo.com with HTTP; Sat, 28 Sep 2019 18:34:17 +0000 Received: by smtp418.mail.gq1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID f49bf7d670fe2a075c39795e746a2f14; Sat, 28 Sep 2019 18:34:16 +0000 (UTC) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? From: Mark Millard In-Reply-To: Date: Sat, 28 Sep 2019 11:34:15 -0700 Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com> <20190926202936.GD5581@raichu> <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com> <6BC5F6BE-5FC3-48FA-9873-B20141FEFDF5@yahoo.com> <20190927192434.GA93180@raichu> <08CA4DA1-131C-4B14-BB57-EAA22A8CD5D9@yahoo.com> To: Mark Johnston X-Mailer: Apple Mail (2.3445.104.11) X-Rspamd-Queue-Id: 46gckz1QYRz4Sns X-Spamd-Bar: / X-Spamd-Result: default: False [0.13 / 15.00]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[]; DKIM_TRACE(0.00)[yahoo.com:+]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; SUBJECT_ENDS_QUESTION(1.00)[]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; ASN(0.00)[asn:36647, ipnet:98.137.64.0/21, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; NEURAL_HAM_LONG(-0.67)[-0.673,0]; MIME_GOOD(-0.10)[text/plain]; IP_SCORE(0.00)[ip: (3.32), ipnet: 98.137.64.0/21(0.94), asn: 36647(0.75), country: US(-0.05)]; NEURAL_SPAM_MEDIUM(0.30)[0.302,0]; IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[146.66.137.98.list.dnswl.org : 127.0.5.0]; RCVD_TLS_LAST(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 28 Sep 2019 18:34:21 -0000 On 2019-Sep-27, at 15:22, Mark Millard wrote: > On 2019-Sep-27, at 13:52, Mark Millard wrote: >=20 >> On 2019-Sep-27, at 12:24, Mark Johnston wrote: >>=20 >>> On Thu, Sep 26, 2019 at 08:37:39PM -0700, Mark Millard wrote: >>>>=20 >>>>=20 >>>> On 2019-Sep-26, at 17:05, Mark Millard = wrote: >>>>=20 >>>>> On 2019-Sep-26, at 13:29, Mark Johnston = wrote: >>>>>> One possibility is that these are kernel memory allocations = occurring in >>>>>> the context of the benchmark threads. Such allocations may not = respect >>>>>> the configured policy since they are not private to the = allocating >>>>>> thread. For instance, upon opening a file, the kernel may = allocate a >>>>>> vnode structure for that file. That vnode may be accessed by = threads >>>>>> from many processes over its lifetime, and may be recycled many = times >>>>>> before its memory is released back to the allocator. >>>>>=20 >>>>> For -l0-15 -n prefer:1 : >>>>>=20 >>>>> Looks like this reports sys_thr_new activity, sys_cpuset >>>>> activity, and 0xffffffff80bc09bd activity (whatever that >>>>> is). Mostly sys_thr_new activity, over 1300 of them . . . >>>>>=20 >>>>> dtrace: pid 13553 has exited >>>>>=20 >>>>>=20 >>>>> kernel`uma_small_alloc+0x61 >>>>> kernel`keg_alloc_slab+0x10b >>>>> kernel`zone_import+0x1d2 >>>>> kernel`uma_zalloc_arg+0x62b >>>>> kernel`thread_init+0x22 >>>>> kernel`keg_alloc_slab+0x259 >>>>> kernel`zone_import+0x1d2 >>>>> kernel`uma_zalloc_arg+0x62b >>>>> kernel`thread_alloc+0x23 >>>>> kernel`thread_create+0x13a >>>>> kernel`sys_thr_new+0xd2 >>>>> kernel`amd64_syscall+0x3ae >>>>> kernel`0xffffffff811b7600 >>>>> 2 >>>>>=20 >>>>> kernel`uma_small_alloc+0x61 >>>>> kernel`keg_alloc_slab+0x10b >>>>> kernel`zone_import+0x1d2 >>>>> kernel`uma_zalloc_arg+0x62b >>>>> kernel`cpuset_setproc+0x65 >>>>> kernel`sys_cpuset+0x123 >>>>> kernel`amd64_syscall+0x3ae >>>>> kernel`0xffffffff811b7600 >>>>> 2 >>>>>=20 >>>>> kernel`uma_small_alloc+0x61 >>>>> kernel`keg_alloc_slab+0x10b >>>>> kernel`zone_import+0x1d2 >>>>> kernel`uma_zalloc_arg+0x62b >>>>> kernel`uma_zfree_arg+0x36a >>>>> kernel`thread_reap+0x106 >>>>> kernel`thread_alloc+0xf >>>>> kernel`thread_create+0x13a >>>>> kernel`sys_thr_new+0xd2 >>>>> kernel`amd64_syscall+0x3ae >>>>> kernel`0xffffffff811b7600 >>>>> 6 >>>>>=20 >>>>> kernel`uma_small_alloc+0x61 >>>>> kernel`keg_alloc_slab+0x10b >>>>> kernel`zone_import+0x1d2 >>>>> kernel`uma_zalloc_arg+0x62b >>>>> kernel`uma_zfree_arg+0x36a >>>>> kernel`vm_map_process_deferred+0x8c >>>>> kernel`vm_map_remove+0x11d >>>>> kernel`vmspace_exit+0xd3 >>>>> kernel`exit1+0x5a9 >>>>> kernel`0xffffffff80bc09bd >>>>> kernel`amd64_syscall+0x3ae >>>>> kernel`0xffffffff811b7600 >>>>> 6 >>>>>=20 >>>>> kernel`uma_small_alloc+0x61 >>>>> kernel`keg_alloc_slab+0x10b >>>>> kernel`zone_import+0x1d2 >>>>> kernel`uma_zalloc_arg+0x62b >>>>> kernel`thread_alloc+0x23 >>>>> kernel`thread_create+0x13a >>>>> kernel`sys_thr_new+0xd2 >>>>> kernel`amd64_syscall+0x3ae >>>>> kernel`0xffffffff811b7600 >>>>> 22 >>>>>=20 >>>>> kernel`vm_page_grab_pages+0x1b4 >>>>> kernel`vm_thread_stack_create+0xc0 >>>>> kernel`kstack_import+0x52 >>>>> kernel`uma_zalloc_arg+0x62b >>>>> kernel`vm_thread_new+0x4d >>>>> kernel`thread_alloc+0x31 >>>>> kernel`thread_create+0x13a >>>>> kernel`sys_thr_new+0xd2 >>>>> kernel`amd64_syscall+0x3ae >>>>> kernel`0xffffffff811b7600 >>>>> 1324 >>>>=20 >>>> With sys_thr_new not respecting -n prefer:1 for >>>> -l0-15 (especially for the thread stacks), I >>>> looked some at the generated integration kernel >>>> code and it makes significant use of %rsp based >>>> memory accesses (read and write). >>>>=20 >>>> That would get both memory controllers going in >>>> parallel (kernel vectors accesses to the preferred >>>> memory domain), so not slowing down as expected. >>>>=20 >>>> If round-robin is not respected for thread stacks, >>>> and if threads migrate cpus across memory domains >>>> at times, there could be considerable variability >>>> for that context as well. (This may not be the >>>> only way to have different/extra variability for >>>> this context.) >>>>=20 >>>> Overall: I'd be surprised if this was not >>>> contributing to what I thought was odd about >>>> the benchmark results. >>>=20 >>> Your tracing refers to kernel thread stacks though, not the stacks = used >>> by threads when executing in user mode. My understanding is that a = HINT >>> implementation would spend virtually all of its time in user mode, = so it >>> shouldn't matter much or at all if kernel thread stacks are backed = by >>> memory from the "wrong" domain. >>=20 >> Looks like I was trying to think about it when I should have been = sleeping. >> You are correct. >>=20 >>> This also doesn't really explain some of the disparities in the = plots >>> you sent me. For instance, you get a much higher peak QUIS on = FreeBSD >>> than on Fedora with 16 threads and an interleave/round-robin domain >>> selection policy. >>=20 >> True. I suppose that there is the possibility that steady_clock's = now() results >> are odd for some reason for the type of context, leading to the = durations >> between such being on the short side where things look different. >>=20 >> But the left hand side of the single-thread results (smaller memory = sizes for >> the vectors for the integration kernel's use) do not show such a = rescaling. >> (The single thread time measurements are strictly inside the thread = of >> execution, no thread creation or such counted for any size.) The = right hand >> side of the single thread results (larger memory use, making smaller = cache >> levels fairly ineffective) do generally show some rescaling, but not = as drastic >> as multi-threaded. >>=20 >> Both round-robin and prefer:1 showed such for single threaded. >=20 > Just to be explicit about what would be executed in the FreeBSD > kernel . . . >=20 > One difference between single-threaded vs. multi-threaded for > the benchmark code is that the multi-threaded calls steady_clock's > now from the main thread, counting time that thread creations > contribute. Single-threaded calls steady_clock's now from inside > the same thread that executes the integration kernel, not counting > thread creation. >=20 > steady_clock's now uses sys calls requesting CLOCK_MONOTONIC > from what I've seen with truss. >=20 > This would be code involved from the FreeBSD kernel that could > contribute some to the measured time. >=20 > Having the kernel stack for this on the memory domain where the > time-measuring-CPU is vs. on a remote memory domain might > make some difference in duration results. (But I've no clue > specifically what to expect for the differences for my context so it > may well not explain much of anything.) In case anyone else is following along. Gradually exploring different contexts is isolating the plot characteristics. The scale difference vs. Fedora 30 seems to always exist. But it turns out that the messy right hand side of the plots (widely variable QUality Improvment Per Second figures compared to the expected structure for QUIPS results) for prefer:N is specific to prefer:1, for example. (Only 2 memory domains available in my testing context.) I've sent Mark Johnston 3 more plots because (not in time of discovery order): A) I discovered that a non-NUMA kernel does not show the variability issue for either -l0-15 or -l16-31 for cpuset: both get fairly clean results, showing a clear difference between local vs. remote memory being involved as well. B) For the NUMA kernel, prefer:0 is like (A) above: again not widely variable. This is unlike the prefer:1 result. So prefer:0 and prefer:1 are not near being symmetric (swapping -l0-15 vs. -l16-31 status as well). C) The non-NUMA kernel context without CPU restrictions is messy on the right hand side of the plot, like the round-robin results were. Both this and round-robin have a subset of the CPU activity that is analogous to prefer:1 above, so this may not be surprising, given the prefer:1 results. So, for now, the primary question is why prefer:0 vs. prefer:1 is not (nearly) symmetric in the benchmark results on the right hand side of the plots. ("prefer" with cpu restrictions provides a means of controlling the behavior and seeing a comparison/contrast.) =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)