Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 10 Oct 2014 11:14:50 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        freebsd-arch@freebsd.org
Cc:        Adrian Chadd <adrian@freebsd.org>
Subject:   Re: [rfc] enumerating device / bus domain information
Message-ID:  <4435143.bthBSP8NlX@ralph.baldwin.cx>
In-Reply-To: <838B58B2-22D6-4AA4-97D5-62E87101F234@bsdimp.com>
References:  <CAJ-VmokF7Ey0fxaQ7EMBJpCbgFnyOteiL2497Z4AFovc%2BQRkTA@mail.gmail.com> <CAJ-VmonbGW1JbEiKXJ0sQCFr0%2BCRphVrSuBhFnh1gq6-X1CFdQ@mail.gmail.com> <838B58B2-22D6-4AA4-97D5-62E87101F234@bsdimp.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday, October 09, 2014 09:53:52 PM Warner Losh wrote:
> On Oct 8, 2014, at 5:12 PM, Adrian Chadd <adrian@FreeBSD.org> wrote:
> > On 8 October 2014 12:07, Warner Losh <imp@bsdimp.com> wrote:
> >> On Oct 7, 2014, at 7:37 PM, Adrian Chadd <adrian@FreeBSD.org> wrot=
e:
> >>> Hi,
> >>>=20
> >>> Right now we're not enumerating any NUMA domain information about=

> >>> devices.
> >>>=20
> >>> The more recent intel NUMA stuff has some extra affinity informat=
ion
> >>> for devices that (eventually) will allow us to bind kernel/user
> >>> threads and/or memory allocation to devices to keep access local.=

> >>> There's a penalty for DMAing in/out of remote memory, so we'll wa=
nt to
> >>> figure out what counts as "Local" for memory allocation and perha=
ps
> >>> constrain the CPU set that worker threads for a device run on.
> >>>=20
> >>> This patch adds a few things:
> >>>=20
> >>> * it adds a bus_if.m method for fetching the VM domain ID of a gi=
ven
> >>> device; or ENOENT if it's not in a VM domain;
> >>=20
> >> Maybe a default VM domain. All devices are in VM domains :) By def=
ault
> >> today, we have only one VM domain, and that=E2=80=99s the model th=
at most of the
> >> code expects=E2=80=A6
> >=20
> > Right, and that doesn't change until you compile in with num domain=
s > 1.
>=20
> The first part of the statement doesn=E2=80=99t change when the numbe=
r of domains
> is more than one. All devices are in a VM domain.
>=20
> > Then, CPUs and memory have VM domains, but devices may or may not h=
ave
> > a VM domain. There's no "default" VM domain defined if num domains =
>
> > 1.
>=20
> Please explain how a device cannot have a VM domain? For the
> terminology I'm familiar with, to even get cycles to the device, you =
have to
> have a memory address (or an I/O port). That memory address has to
> necessarily map to some domain, even if that domain is equally sucky =
to get
> to from all CPUs (as is the case with I/O ports). while there may not=
 be a
> =E2=80=9Cdefault=E2=80=9D domain, by virtue of its physical location =
it has to have one.
>=20
> > The devices themselves don't know about VM domains right now, so
> > there's nothing constraining things like IRQ routing, CPU set, memo=
ry
> > allocation, etc. The isilon team is working on extending the cpuset=

> > and allocators to "know" about numa and I'm sure this stuff will fa=
ll
> > out of whatever they're working on.
>=20
> Why would the device need to know the domain? Why aren=E2=80=99t the =
IRQs,
> for example, steered to the appropriate CPU? Why doesn=E2=80=99t the =
bus handle
> allocating memory for it in the appropriate place? How does this =E2=80=
=9Cdomain=E2=80=9D
> tie into memory allocation and thread creation?

Because that's not what you always want (though it often is).  However,=

another reason is that system administrators want to know what devices
are close to.  You can sort of figure it out from devinfo on a modern
x86 machine if you squint right, but isn't super obvious.  I have a fol=
lowup
patch that adds a new per-device '%domain' sysctl node so that it is
easier to see which domain a device is close to.  In real-world experie=
nce
this can be useful as it lets a sysadmin/developer know which CPUs to
schedule processes on.  (Note that it doesn't always mean you put them
close to the device.  Sometimes you have processes that are more import=
ant=20
than others, so you tie those close to the NIC and shove the other ones=
 over=20
to the "wrong" domain because you don't care if they have higher latenc=
y.)

> > So when I go to add sysctl and other tree knowledge for device -> v=
m
> > domain mapping I'm going to make them return -1 for "no domain.=E2=80=
=9D
>=20
> Seems like there=E2=80=99s too many things lumped together here. Firs=
t off, how
> can there be no domain. That just hurts my brain. It has to be in som=
e
> domain, or it can=E2=80=99t be seen. Maybe this domain is one that su=
cks for
> everybody to access, maybe it is one that=E2=80=99s fast for some CPU=
 or package of
> CPUs to access, but it has to have a domain.

They are not always tied to a single NUMA domain.  On some dual-socket=20=

Nehalem/Westmere class machines with per-CPU memory controllers (so 2 N=
UMA=20
domains) you will have a single I/O hub that is directly connected to b=
oth=20
CPUs.  Thus, all memory in the system is equi-distant for I/O (but not =
for CPU=20
access).

The other problem is that you simply may not know.  Not all BIOSes corr=
ectly=20
communicate this information for devices.  For example, certain 1U Roml=
ey=20
servers I have worked with properly enumerate CPU <--> memory relations=
hips in=20
the SRAT table, but they fail to include the necessary _PXM method in t=
he top-
level PCI bus devices (that correspond to the I/O hub).  In that case,=20=

returning a domain of 0 may very well be wrong.  (In fact, for these=20=

particular machines it mostly _is_ wrong as the expansion slots are all=
 tied=20
to NUMA domain 1, not 0.)

> > (Things will get pretty hilarious later on if we have devices that =
are
> > "local" to two or more VM domains ..)
>=20
> Well, devices aren=E2=80=99t local to domains, per se. Devices can co=
mmunicate with
> other components in a system at a given cost. One NUMA model is =E2=80=
=9Cnear=E2=80=9D vs
> =E2=80=9Cfar=E2=80=9D where a single near domain exists and all the =E2=
=80=9Cfar=E2=80=9D resources are
> quite costly. Other NUMA models may have a wider range of costs so th=
at
> some resources are cheap, others are a little less cheap, while other=
s are
> down right expensive depending on how far across the fabric of
> interconnects the messages need to travel. While one can model this a=
s a
> full 1-1 partitioning, that doesn=E2=80=99t match all of the extant
> implementations, even today. It is easy, but an imperfect match to th=
e
> underlying realities in many cases (though a very good match to x86, =
which
> is mostly what we care about).

Even x86 already has a notion of multiple layers of cost.  You can get =
that=20
today if you buy a 4 socket Intel system.  It seems you might also get =
that if=20
you get a dual socket Haswell system with more than 8 cores per package=
 (due=20
to the funky split-brain thing on higher core count Haswells).  I belie=
ve AMD=20
also ships CPUs that contain 2 NUMA domains within a single physical pa=
ckage=20
as well.

Note that the I/O thing is becoming far more urgent in the past few yea=
rs on=20
x86.  With Nehalem/Westmere having I/O being remote or local didn't see=
m to=20
matter very much (you could only measure very small differences in late=
ncy or=20
throughput between the two scenarios in my experience).  On Romley (San=
dy=20
Bridge) and later it can be a very substantial difference in terms of b=
oth=20
latency and throughput.

--=20
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4435143.bthBSP8NlX>