From owner-freebsd-arch@FreeBSD.ORG Fri Oct 10 15:58:08 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E96EE5B1; Fri, 10 Oct 2014 15:58:08 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A784FC53; Fri, 10 Oct 2014 15:58:08 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-70-85-31.nwrknj.fios.verizon.net [173.70.85.31]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 80837B9BD; Fri, 10 Oct 2014 11:58:07 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Subject: Re: [rfc] enumerating device / bus domain information Date: Fri, 10 Oct 2014 11:14:50 -0400 Message-ID: <4435143.bthBSP8NlX@ralph.baldwin.cx> User-Agent: KMail/4.12.5 (FreeBSD/10.1-BETA2; KDE/4.12.5; amd64; ; ) In-Reply-To: <838B58B2-22D6-4AA4-97D5-62E87101F234@bsdimp.com> References: <838B58B2-22D6-4AA4-97D5-62E87101F234@bsdimp.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Fri, 10 Oct 2014 11:58:07 -0400 (EDT) Cc: Adrian Chadd X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Oct 2014 15:58:09 -0000 On Thursday, October 09, 2014 09:53:52 PM Warner Losh wrote: > On Oct 8, 2014, at 5:12 PM, Adrian Chadd wrote: > > On 8 October 2014 12:07, Warner Losh wrote: > >> On Oct 7, 2014, at 7:37 PM, Adrian Chadd wrot= e: > >>> Hi, > >>>=20 > >>> Right now we're not enumerating any NUMA domain information about= > >>> devices. > >>>=20 > >>> The more recent intel NUMA stuff has some extra affinity informat= ion > >>> for devices that (eventually) will allow us to bind kernel/user > >>> threads and/or memory allocation to devices to keep access local.= > >>> There's a penalty for DMAing in/out of remote memory, so we'll wa= nt to > >>> figure out what counts as "Local" for memory allocation and perha= ps > >>> constrain the CPU set that worker threads for a device run on. > >>>=20 > >>> This patch adds a few things: > >>>=20 > >>> * it adds a bus_if.m method for fetching the VM domain ID of a gi= ven > >>> device; or ENOENT if it's not in a VM domain; > >>=20 > >> Maybe a default VM domain. All devices are in VM domains :) By def= ault > >> today, we have only one VM domain, and that=E2=80=99s the model th= at most of the > >> code expects=E2=80=A6 > >=20 > > Right, and that doesn't change until you compile in with num domain= s > 1. >=20 > The first part of the statement doesn=E2=80=99t change when the numbe= r of domains > is more than one. All devices are in a VM domain. >=20 > > Then, CPUs and memory have VM domains, but devices may or may not h= ave > > a VM domain. There's no "default" VM domain defined if num domains = > > > 1. >=20 > Please explain how a device cannot have a VM domain? For the > terminology I'm familiar with, to even get cycles to the device, you = have to > have a memory address (or an I/O port). That memory address has to > necessarily map to some domain, even if that domain is equally sucky = to get > to from all CPUs (as is the case with I/O ports). while there may not= be a > =E2=80=9Cdefault=E2=80=9D domain, by virtue of its physical location = it has to have one. >=20 > > The devices themselves don't know about VM domains right now, so > > there's nothing constraining things like IRQ routing, CPU set, memo= ry > > allocation, etc. The isilon team is working on extending the cpuset= > > and allocators to "know" about numa and I'm sure this stuff will fa= ll > > out of whatever they're working on. >=20 > Why would the device need to know the domain? Why aren=E2=80=99t the = IRQs, > for example, steered to the appropriate CPU? Why doesn=E2=80=99t the = bus handle > allocating memory for it in the appropriate place? How does this =E2=80= =9Cdomain=E2=80=9D > tie into memory allocation and thread creation? Because that's not what you always want (though it often is). However,= another reason is that system administrators want to know what devices are close to. You can sort of figure it out from devinfo on a modern x86 machine if you squint right, but isn't super obvious. I have a fol= lowup patch that adds a new per-device '%domain' sysctl node so that it is easier to see which domain a device is close to. In real-world experie= nce this can be useful as it lets a sysadmin/developer know which CPUs to schedule processes on. (Note that it doesn't always mean you put them close to the device. Sometimes you have processes that are more import= ant=20 than others, so you tie those close to the NIC and shove the other ones= over=20 to the "wrong" domain because you don't care if they have higher latenc= y.) > > So when I go to add sysctl and other tree knowledge for device -> v= m > > domain mapping I'm going to make them return -1 for "no domain.=E2=80= =9D >=20 > Seems like there=E2=80=99s too many things lumped together here. Firs= t off, how > can there be no domain. That just hurts my brain. It has to be in som= e > domain, or it can=E2=80=99t be seen. Maybe this domain is one that su= cks for > everybody to access, maybe it is one that=E2=80=99s fast for some CPU= or package of > CPUs to access, but it has to have a domain. They are not always tied to a single NUMA domain. On some dual-socket=20= Nehalem/Westmere class machines with per-CPU memory controllers (so 2 N= UMA=20 domains) you will have a single I/O hub that is directly connected to b= oth=20 CPUs. Thus, all memory in the system is equi-distant for I/O (but not = for CPU=20 access). The other problem is that you simply may not know. Not all BIOSes corr= ectly=20 communicate this information for devices. For example, certain 1U Roml= ey=20 servers I have worked with properly enumerate CPU <--> memory relations= hips in=20 the SRAT table, but they fail to include the necessary _PXM method in t= he top- level PCI bus devices (that correspond to the I/O hub). In that case,=20= returning a domain of 0 may very well be wrong. (In fact, for these=20= particular machines it mostly _is_ wrong as the expansion slots are all= tied=20 to NUMA domain 1, not 0.) > > (Things will get pretty hilarious later on if we have devices that = are > > "local" to two or more VM domains ..) >=20 > Well, devices aren=E2=80=99t local to domains, per se. Devices can co= mmunicate with > other components in a system at a given cost. One NUMA model is =E2=80= =9Cnear=E2=80=9D vs > =E2=80=9Cfar=E2=80=9D where a single near domain exists and all the =E2= =80=9Cfar=E2=80=9D resources are > quite costly. Other NUMA models may have a wider range of costs so th= at > some resources are cheap, others are a little less cheap, while other= s are > down right expensive depending on how far across the fabric of > interconnects the messages need to travel. While one can model this a= s a > full 1-1 partitioning, that doesn=E2=80=99t match all of the extant > implementations, even today. It is easy, but an imperfect match to th= e > underlying realities in many cases (though a very good match to x86, = which > is mostly what we care about). Even x86 already has a notion of multiple layers of cost. You can get = that=20 today if you buy a 4 socket Intel system. It seems you might also get = that if=20 you get a dual socket Haswell system with more than 8 cores per package= (due=20 to the funky split-brain thing on higher core count Haswells). I belie= ve AMD=20 also ships CPUs that contain 2 NUMA domains within a single physical pa= ckage=20 as well. Note that the I/O thing is becoming far more urgent in the past few yea= rs on=20 x86. With Nehalem/Westmere having I/O being remote or local didn't see= m to=20 matter very much (you could only measure very small differences in late= ncy or=20 throughput between the two scenarios in my experience). On Romley (San= dy=20 Bridge) and later it can be a very substantial difference in terms of b= oth=20 latency and throughput. --=20 John Baldwin