Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 10 Oct 2014 21:07:00 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        Adrian Chadd <adrian@freebsd.org>, freebsd-arch@freebsd.org
Subject:   Re: [rfc] enumerating device / bus domain information
Message-ID:  <20141010180700.GS2153@kib.kiev.ua>
In-Reply-To: <4435143.bthBSP8NlX@ralph.baldwin.cx>
References:  <CAJ-VmokF7Ey0fxaQ7EMBJpCbgFnyOteiL2497Z4AFovc%2BQRkTA@mail.gmail.com> <CAJ-VmonbGW1JbEiKXJ0sQCFr0%2BCRphVrSuBhFnh1gq6-X1CFdQ@mail.gmail.com> <838B58B2-22D6-4AA4-97D5-62E87101F234@bsdimp.com> <4435143.bthBSP8NlX@ralph.baldwin.cx>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Oct 10, 2014 at 11:14:50AM -0400, John Baldwin wrote:
> On Thursday, October 09, 2014 09:53:52 PM Warner Losh wrote:
> > On Oct 8, 2014, at 5:12 PM, Adrian Chadd <adrian@FreeBSD.org> wrote:
> > > On 8 October 2014 12:07, Warner Losh <imp@bsdimp.com> wrote:
> > >> On Oct 7, 2014, at 7:37 PM, Adrian Chadd <adrian@FreeBSD.org> wrote:
> > >>> Hi,
> > >>> 
> > >>> Right now we're not enumerating any NUMA domain information about
> > >>> devices.
> > >>> 
> > >>> The more recent intel NUMA stuff has some extra affinity information
> > >>> for devices that (eventually) will allow us to bind kernel/user
> > >>> threads and/or memory allocation to devices to keep access local.
> > >>> There's a penalty for DMAing in/out of remote memory, so we'll want to
> > >>> figure out what counts as "Local" for memory allocation and perhaps
> > >>> constrain the CPU set that worker threads for a device run on.
> > >>> 
> > >>> This patch adds a few things:
> > >>> 
> > >>> * it adds a bus_if.m method for fetching the VM domain ID of a given
> > >>> device; or ENOENT if it's not in a VM domain;
> > >> 
> > >> Maybe a default VM domain. All devices are in VM domains :) By default
> > >> today, we have only one VM domain, and that???s the model that most of the
> > >> code expects???
> > > 
> > > Right, and that doesn't change until you compile in with num domains > 1.
> > 
> > The first part of the statement doesn???t change when the number of domains
> > is more than one. All devices are in a VM domain.
> > 
> > > Then, CPUs and memory have VM domains, but devices may or may not have
> > > a VM domain. There's no "default" VM domain defined if num domains >
> > > 1.
> > 
> > Please explain how a device cannot have a VM domain? For the
> > terminology I'm familiar with, to even get cycles to the device, you have to
> > have a memory address (or an I/O port). That memory address has to
> > necessarily map to some domain, even if that domain is equally sucky to get
> > to from all CPUs (as is the case with I/O ports). while there may not be a
> > ???default??? domain, by virtue of its physical location it has to have one.
> > 
> > > The devices themselves don't know about VM domains right now, so
> > > there's nothing constraining things like IRQ routing, CPU set, memory
> > > allocation, etc. The isilon team is working on extending the cpuset
> > > and allocators to "know" about numa and I'm sure this stuff will fall
> > > out of whatever they're working on.
> > 
> > Why would the device need to know the domain? Why aren???t the IRQs,
> > for example, steered to the appropriate CPU? Why doesn???t the bus handle
> > allocating memory for it in the appropriate place? How does this ???domain???
> > tie into memory allocation and thread creation?
> 
> Because that's not what you always want (though it often is).  However,
> another reason is that system administrators want to know what devices
> are close to.  You can sort of figure it out from devinfo on a modern
> x86 machine if you squint right, but isn't super obvious.  I have a followup
> patch that adds a new per-device '%domain' sysctl node so that it is
> easier to see which domain a device is close to.  In real-world experience
> this can be useful as it lets a sysadmin/developer know which CPUs to
> schedule processes on.  (Note that it doesn't always mean you put them
> close to the device.  Sometimes you have processes that are more important 
> than others, so you tie those close to the NIC and shove the other ones over 
> to the "wrong" domain because you don't care if they have higher latency.)
> 
> > > So when I go to add sysctl and other tree knowledge for device -> vm
> > > domain mapping I'm going to make them return -1 for "no domain.???
> > 
> > Seems like there???s too many things lumped together here. First off, how
> > can there be no domain. That just hurts my brain. It has to be in some
> > domain, or it can???t be seen. Maybe this domain is one that sucks for
> > everybody to access, maybe it is one that???s fast for some CPU or package of
> > CPUs to access, but it has to have a domain.
> 
> They are not always tied to a single NUMA domain.  On some dual-socket 
> Nehalem/Westmere class machines with per-CPU memory controllers (so 2 NUMA 
> domains) you will have a single I/O hub that is directly connected to both 
> CPUs.  Thus, all memory in the system is equi-distant for I/O (but not for CPU 
> access).
> 
> The other problem is that you simply may not know.  Not all BIOSes correctly 
> communicate this information for devices.  For example, certain 1U Romley 
> servers I have worked with properly enumerate CPU <--> memory relationships in 
> the SRAT table, but they fail to include the necessary _PXM method in the top-
> level PCI bus devices (that correspond to the I/O hub).  In that case, 
> returning a domain of 0 may very well be wrong.  (In fact, for these 
> particular machines it mostly _is_ wrong as the expansion slots are all tied 
> to NUMA domain 1, not 0.)
> 
> > > (Things will get pretty hilarious later on if we have devices that are
> > > "local" to two or more VM domains ..)
> > 
> > Well, devices aren???t local to domains, per se. Devices can communicate with
> > other components in a system at a given cost. One NUMA model is ???near??? vs
> > ???far??? where a single near domain exists and all the ???far??? resources are
> > quite costly. Other NUMA models may have a wider range of costs so that
> > some resources are cheap, others are a little less cheap, while others are
> > down right expensive depending on how far across the fabric of
> > interconnects the messages need to travel. While one can model this as a
> > full 1-1 partitioning, that doesn???t match all of the extant
> > implementations, even today. It is easy, but an imperfect match to the
> > underlying realities in many cases (though a very good match to x86, which
> > is mostly what we care about).
> 
> Even x86 already has a notion of multiple layers of cost.  You can get that 
> today if you buy a 4 socket Intel system.  It seems you might also get that if 
> you get a dual socket Haswell system with more than 8 cores per package (due 
> to the funky split-brain thing on higher core count Haswells).  I believe AMD 
> also ships CPUs that contain 2 NUMA domains within a single physical package 
> as well.
> 
> Note that the I/O thing is becoming far more urgent in the past few years on 
> x86.  With Nehalem/Westmere having I/O being remote or local didn't seem to 
> matter very much (you could only measure very small differences in latency or 
> throughput between the two scenarios in my experience).  On Romley (Sandy 
> Bridge) and later it can be a very substantial difference in terms of both 
> latency and throughput.

This nicely augments my note of the unsuitability of the interface to
return VM domain for the given device.  I think that more correct is
to return a bitset of the 'close enough' VM domains, where proximity
is either explicitely asked by caller (like, belongs to, closer than
two domains, etc) or just always return the best bitset.  It would
solve both the split proximity domains issue, and multi-uplink south
bridge issue.

Might be, it makes sense to add additional object layer of the HW proximity
domain, which contain some set of VM domains, and function would return
such HW proximity domain.





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20141010180700.GS2153>