Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 10 Oct 2014 16:01:31 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Adrian Chadd <adrian@freebsd.org>, freebsd-arch@freebsd.org
Subject:   Re: [rfc] enumerating device / bus domain information
Message-ID:  <4090343.RYS6GcFkXt@ralph.baldwin.cx>
In-Reply-To: <20141010180700.GS2153@kib.kiev.ua>
References:  <CAJ-VmokF7Ey0fxaQ7EMBJpCbgFnyOteiL2497Z4AFovc%2BQRkTA@mail.gmail.com> <4435143.bthBSP8NlX@ralph.baldwin.cx> <20141010180700.GS2153@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Friday, October 10, 2014 09:07:00 PM Konstantin Belousov wrote:
> On Fri, Oct 10, 2014 at 11:14:50AM -0400, John Baldwin wrote:
> > Even x86 already has a notion of multiple layers of cost.  You can get that
> > today if you buy a 4 socket Intel system.  It seems you might also get
> > that if you get a dual socket Haswell system with more than 8 cores per
> > package (due to the funky split-brain thing on higher core count
> > Haswells).  I believe AMD also ships CPUs that contain 2 NUMA domains
> > within a single physical package as well.
> > 
> > Note that the I/O thing is becoming far more urgent in the past few years
> > on x86.  With Nehalem/Westmere having I/O being remote or local didn't
> > seem to matter very much (you could only measure very small differences
> > in latency or throughput between the two scenarios in my experience).  On
> > Romley (Sandy Bridge) and later it can be a very substantial difference
> > in terms of both latency and throughput.
> 
> This nicely augments my note of the unsuitability of the interface to
> return VM domain for the given device.  I think that more correct is
> to return a bitset of the 'close enough' VM domains, where proximity
> is either explicitely asked by caller (like, belongs to, closer than
> two domains, etc) or just always return the best bitset.  It would
> solve both the split proximity domains issue, and multi-uplink south
> bridge issue.
>
> Might be, it makes sense to add additional object layer of the HW proximity
> domain, which contain some set of VM domains, and function would return
> such HW proximity domain.

I know Jeff has some sort of structure he wants to use for describing NUMA
policies.  Perhaps that is something that can be reused.  However, we
probably need to be further down the road to see what we actually need as
our final interface here.  In particular, I suspect we will have an orthogonal
set of APIs to deal with CPU locality (i.e. Give me a cpuset of all CPUs
in domain X or close to domain X, etc.).  In as much as there are requests
that are not bus-specific, I'd rather have drivers use those rather than
having everything go through new-bus.  (So that, for example, a multiqueue
NIC driver could bind its queues to CPUs belonging to the same NUMA domain it
is in rather than always using CPUs 0...N which is what all the Intel drivers
do currently.  Variations of this could also allow for more intelligent
requests like "give me all CPUs close to N that are suitable for interrupts"
which might include only one SMT thread per core.)

Also, this is orthogonal to overloading the word "VM domain" to mean something
that is a subset of a given NUMA domain.  I think regardless that it probably
makes sense to use a different term to describe more finely-grained partitions
of NUMA domains.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4090343.RYS6GcFkXt>