Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 26 Jul 2014 13:11:28 -0700
From:      Adrian Chadd <adrian@freebsd.org>
To:        Jeff Roberson <jeff@freebsd.org>
Cc:        "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, Andrew Bates <andrewbates09@gmail.com>
Subject:   Re: Working on NUMA support
Message-ID:  <CAJ-Vmom-wWZLCuuAEKDO1vuaGaSQM-=4e3xoh3OeVibc6m9Z8A@mail.gmail.com>
In-Reply-To: <00E55D89-BDD1-41AD-BBF6-6752B90E8324@ccsys.com>
References:  <CAPi5LmkRO4QLbR2JQV8FuT=jw2jjcCRbP8jT0kj1g8Ks%2B7jv8A@mail.gmail.com> <CAJ-VmonJPT-NUSi=Wnu7a0oNwe8V=LQMZ-fZGriC7H44edRVLg@mail.gmail.com> <CAPi5Lm=8Z3fh_vxKY26qC3oEv1Ap%2BRvFGRAOhRosF5UEnDTVpw@mail.gmail.com> <00E55D89-BDD1-41AD-BBF6-6752B90E8324@ccsys.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi all!

Has there been any further progress on this?

I've been working on making the receive side scaling support usable by
mere mortals and I've reached a point where I'm going to need this
awareness in the 10ge/40ge drivers for the hardware I have access to.

I'm right now more interested in the kernel driver/allocator side of things, so:

* when bringing up a NIC, figure out what are the "most local" CPUs to run on;
* for each NIC queue, figure out what the "most local" bus resources
are for NIC resources like descriptors and packet memory (eg mbufs);
* for each NIC queue, figure out what the "most local" resources are
for local driver structures that the NIC doesn't touch (eg per-queue
state);
* for each RSS bucket, figure out what the "most local" resources are
for things like packet memory (mbufs), tcp/udp/inp control structures,
etc.

I had a chat with jhb yesterday and he reminded me that y'all at
isilon have been looking into this.

He described a few interesting cases from the kernel side to me.

* On architectures with external IO controllers, the path cost from an
IO device to multiple CPUs may be (almost) equivalent, so there's not
a huge penalty to allocate things on the wrong CPU. I think it'll be
nice to get CPU local affinity where possible so we can parallelise
DRAM access fully, but we can play with this and see.
* On architectures with CPU-integrated IO controllers, there's a large
penalty for doing inter-CPU IO,
* .. but there's not such a huge penalty for doing inter-CPU memory access.

Given that, we may find that we should always put the IO resources
local to the CPU it's attached to, even if we decide to run some / all
of the IO for the device on another CPU. Ie, any RAM that the IO
device is doing data or descriptor DMA into should be local to that
device. John said that in his experience it seemed the penalty for a
non-local CPU touching memory was much less than device DMA crossing
QPI.

So the tricky bit is figuring that out and expressing it all in a way
that allows us to do memory allocation and CPU binding in a more aware
way. The other half of this tricky thing is to allow it to be easily
overridden by a curious developer or system administrator that wants
to experiment with different policies.

Now, I'm very specifically only addressing the low level kernel IO /
memory allocation requirements here. There's other things to worry
about up in userland; I think you're trying to address that in your
KPI descriptions.

Thoughts?


-a



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-Vmom-wWZLCuuAEKDO1vuaGaSQM-=4e3xoh3OeVibc6m9Z8A>