From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 17:49:38 2015 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id EEEC5DE; Thu, 19 Feb 2015 17:49:37 +0000 (UTC) Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A553282D; Thu, 19 Feb 2015 17:49:37 +0000 (UTC) Received: from slw by zxy.spb.ru with local (Exim 4.84 (FreeBSD)) (envelope-from ) id 1YOVEE-000JIx-N9; Thu, 19 Feb 2015 20:49:34 +0300 Date: Thu, 19 Feb 2015 20:49:34 +0300 From: Slawa Olhovchenkov To: John Baldwin Subject: Re: RFC: bus_get_cpus(9) Message-ID: <20150219174934.GB46228@zxy.spb.ru> References: <1848011.eGOHhpCEMm@ralph.baldwin.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1848011.eGOHhpCEMm@ralph.baldwin.cx> User-Agent: Mutt/1.5.23 (2014-03-12) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: slw@zxy.spb.ru X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false Cc: arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 17:49:38 -0000 On Thu, Feb 19, 2015 at 09:46:35AM -0500, John Baldwin wrote: > One of the next steps for NUMA device-awareness is a way to let drivers know > which CPUs are ideal to use for interrupts (and in particular this is targeted > at multiqueue NICs that want to create a TX/RX ring pair per CPU). However, > for modern Intel systems at least, it is usually best to use CPUs from the > physical processor package that contains the I/O hub that a device connects to > (e.g. to allow DDIO to work). > > The PoC API I came up with is a new bus method called bus_get_cpus() that > returns a requested cpuset for a given device. It accepts an enum for the > second parameter that says the type of cpuset being requested. Currently two > valus are supported: > > - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to the > device when NUMA is enabled) > - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core) > > For a NIC driver the expectation is that the driver will call > 'bus_get_cpus(dev, INTR_CPUS, &set)' and create queues for each of the CPUs in > 'set'. (In my current patchset I have updated igb(4) to use this approach.) > > For systems that do not support NUMA (or if it is not enabled in the kernel > config), LOCAL_CPUS is mapped to 'all_cpus' by default in the 'root_bus' > driver. INTR_CPUS is also mapped to 'all_cpus' by default. > > The x86 interrupt code maintains its own set of interrupt CPUs which this > patch now exposes via INTR_CPUS in the x86 nexus driver. > > The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable > LOCAL_CPUS set when _PXM exists and NUMA is enabled. They also and the global > INTR_CPUS set from the nexus driver with the per-domain set from _PXM to > generate a local INTR_CPUS set for child devices. > > The current patch can be found here: > > https://github.com/bsdjhb/freebsd/compare/bsdjhb:master...numa_bus_get_cpus > > It includes a few other fixes besides the implementation of bus_get_cpu() (and > some things have already been committed such as > taskqueue_start_threads_cpuset() and CPU_COUNT()): > > - It fixes the x86 interrupt code to exclude modern SMT threads from the > default interrupt set. (Previously only Pentium 4-era HTT threads were > excluded.) > - It has a sample conversion of igb(4) to this interface (albeit ugly using > #if's). > > Longer term I think I would like to make the INTR_CPUS thing a bit more > formal. In particular, Solaris allows you to alter the set of CPUs that > handle interrupts via prctl (or a tool named something close to that). I > think I would like to have a dedicated global cpuset for that (but not named > "2", it would be a new WHICH level). That would allow userland to use cpuset > to alter the set of CPUs that handle interrupts in case you wanted to use SMT > for example. I think if we do this that all ithreads would have their cpusets > hang off of this set instead of the root set (which would also remove some of > the recent special case handling for ithreads I believe). The one uglier part > about this is that we should probably then have a way to notify drivers that > INTR_CPUS changed so that they could try to cope gracefully. I think that's a > bit of a longer horizon thing, but for now I think bus_get_cpus() is a good > next step. > > What do other folks think? (And yes, I know it needs a manpage before it goes > in, but I'd rather get the API agreed on before polishing that.) I am already use this way by manual using cpuset. For some setups need dedicate one cpu set for interrupt handling and other cpsu set for some application. Because application may be not allow modification we need cpuset aware arithmetic, i.e. utility that may answer like 'cpu set not used by interrupt handlers device ix0 and ix1'