Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 19 Feb 2015 10:37:28 -0500
From:      John Baldwin <jhb@freebsd.org>
To:        Adrian Chadd <adrian@freebsd.org>
Cc:        "freebsd-arch@freebsd.org" <arch@freebsd.org>
Subject:   Re: RFC: bus_get_cpus(9)
Message-ID:  <6147240.5Rne9DUXyM@ralph.baldwin.cx>
In-Reply-To: <CAJ-VmokFzuN2Q4ds6zV2M8rd%2B%2BbZ5M0oTaiFFBYueTekCL0s0w@mail.gmail.com>
References:  <1848011.eGOHhpCEMm@ralph.baldwin.cx> <CAJ-VmokFzuN2Q4ds6zV2M8rd%2B%2BbZ5M0oTaiFFBYueTekCL0s0w@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday, February 19, 2015 07:28:14 AM Adrian Chadd wrote:
> Hi,
> 
> On 19 February 2015 at 06:46, John Baldwin <jhb@freebsd.org> wrote:
> > One of the next steps for NUMA device-awareness is a way to let drivers
> > know which CPUs are ideal to use for interrupts (and in particular this
> > is targeted at multiqueue NICs that want to create a TX/RX ring pair per
> > CPU).  However, for modern Intel systems at least, it is usually best to
> > use CPUs from the physical processor package that contains the I/O hub
> > that a device connects to (e.g. to allow DDIO to work).
> > 
> > The PoC API I came up with is a new bus method called bus_get_cpus() that
> > returns a requested cpuset for a given device.  It accepts an enum for the
> > second parameter that says the type of cpuset being requested.  Currently
> > two> 
> > valus are supported:
> >  - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
> >  the>  
> >    device when NUMA is enabled)
> >  
> >  - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)
> > 
> > For a NIC driver the expectation is that the driver will call
> > 'bus_get_cpus(dev, INTR_CPUS, &set)' and create queues for each of the
> > CPUs in 'set'.  (In my current patchset I have updated igb(4) to use this
> > approach.)
> > 
> > For systems that do not support NUMA (or if it is not enabled in the
> > kernel
> > config), LOCAL_CPUS is mapped to 'all_cpus' by default in the 'root_bus'
> > driver.  INTR_CPUS is also mapped to 'all_cpus' by default.
> > 
> > The x86 interrupt code maintains its own set of interrupt CPUs which this
> > patch now exposes via INTR_CPUS in the x86 nexus driver.
> > 
> > The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
> > LOCAL_CPUS set when _PXM exists and NUMA is enabled.  They also and the
> > global INTR_CPUS set from the nexus driver with the per-domain set from
> > _PXM to generate a local INTR_CPUS set for child devices.
> > 
> > The current patch can be found here:
> > 
> > https://github.com/bsdjhb/freebsd/compare/bsdjhb:master...numa_bus_get_cpu
> > s
> > 
> > It includes a few other fixes besides the implementation of bus_get_cpu()
> > (and some things have already been committed such as
> > 
> > taskqueue_start_threads_cpuset() and CPU_COUNT()):
> >  - It fixes the x86 interrupt code to exclude modern SMT threads from the
> >  
> >    default interrupt set.  (Previously only Pentium 4-era HTT threads were
> >    excluded.)
> >  
> >  - It has a sample conversion of igb(4) to this interface (albeit ugly
> >  using
> >  
> >    #if's).
> > 
> > Longer term I think I would like to make the INTR_CPUS thing a bit more
> > formal.  In particular, Solaris allows you to alter the set of CPUs that
> > handle interrupts via prctl (or a tool named something close to that).  I
> > think I would like to have a dedicated global cpuset for that (but not
> > named "2", it would be a new WHICH level).  That would allow userland to
> > use cpuset to alter the set of CPUs that handle interrupts in case you
> > wanted to use SMT for example.  I think if we do this that all ithreads
> > would have their cpusets hang off of this set instead of the root set
> > (which would also remove some of the recent special case handling for
> > ithreads I believe).  The one uglier part about this is that we should
> > probably then have a way to notify drivers that INTR_CPUS changed so that
> > they could try to cope gracefully.  I think that's a bit of a longer
> > horizon thing, but for now I think bus_get_cpus() is a good next step.
> > 
> > What do other folks think?  (And yes, I know it needs a manpage before it
> > goes in, but I'd rather get the API agreed on before polishing that.)
> 
> So I'd rather something slightly more descriptive about the iteration
> of cpu ids and cpusets for each queue in a driver.
> 
> Eg, you're iterating over the interrupt set for a CPU, but it's still
> up to the driver to walk the cpuset array and figure out which queue
> goes to which CPU.
> 
> For RSS I'm going to take this stuff and have the driver call into RSS
> to do this. It'll provide the deviceid, and it'll return:
> 
> * how many queues?
> * for each queue id, what is the cpuset the interrupts/taskqueues should run
> on.
> 
> It already does this, but it currently has no idea about the
> underlying device topology.
> 
> (Later on for rebalancing we'll want to have those cpusets returned be
> some top level cpusets per RSS bucket that we can change and have
> everything cascade "right", but that's a later thing to worry about.)
> 
> For the RSS and non-RSS case, I can see situations where the admin may
> wish to define a mapping for queues to cpusets that aren't necessarily
> 1:1 mapping like you've done. For example, saying "I want 16 queues,
> but I have four CPUs, here's how they're mapped." Right now drivers do
> round-robin, but it's hard coded. If we have an iteration API like
> what exists for RSS then we can hide the policy config in the bus code
> and have it check kenv/hints at boot time for what the config should
> be.
> 
> So I'd like to have the API a little more higher level so we can do
> interesting things with it.

There's nothing preventing the RSS code from calling bus_get_cpus() internally 
to populate the info it returns in its APIs.

That is, I imagine something like:

#ifdef RSS
	queue_info = fetch_rss_info(dev);
	for (queue in queue_info) {
		create queue for CPU queue->cpu
	}
#else
	/* Use bus_get_cpus directly and do 1:1 */
#endif

That is, I think RSS should provide a layer on top of new-bus, not be a 
bus_foo API.  At some point all drivers might only have the #ifdef RSS case 
and not use bus_get_cpus() directly at all, but it doesn't seem like the RSS 
API is quite there yet.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?6147240.5Rne9DUXyM>