Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 11 Jul 2015 15:21:37 +0000 (UTC)
From:      Adrian Chadd <adrian@FreeBSD.org>
To:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   svn commit: r285387 - in head: lib/libc/sys share/man/man4 sys/conf sys/kern sys/sys sys/vm usr.bin usr.bin/numactl
Message-ID:  <201507111521.t6BFLcrv039934@repo.freebsd.org>

next in thread | raw e-mail | index | archive | help
Author: adrian
Date: Sat Jul 11 15:21:37 2015
New Revision: 285387
URL: https://svnweb.freebsd.org/changeset/base/285387

Log:
  Add an initial NUMA affinity/policy configuration for threads and processes.
  
  This is based on work done by jeff@ and jhb@, as well as the numa.diff
  patch that has been circulating when someone asks for first-touch NUMA
  on -10 or -11.
  
  * Introduce a simple set of VM policy and iterator types.
  * tie the policy types into the vm_phys path for now, mirroring how
    the initial first-touch allocation work was enabled.
  * add syscalls to control changing thread and process defaults.
  * add a global NUMA VM domain policy.
  * implement a simple cascade policy order - if a thread policy exists, use it;
    if a process policy exists, use it; use the default policy.
  * processes inherit policies from their parent processes, threads inherit
    policies from their parent threads.
  * add a simple tool (numactl) to query and modify default thread/process
    policities.
  * add documentation for the new syscalls, for numa and for numactl.
  * re-enable first touch NUMA again by default, as now policies can be
    set in a variety of methods.
  
  This is only relevant for very specific workloads.
  
  This doesn't pretend to be a final NUMA solution.
  
  The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
  'sysctl vm.default_policy=rr'.
  
  This is only relevant if MAXMEMDOM is set to something other than 1.
  Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
  this is a glorified no-op for you.
  
  Thank you to Norse Corp for giving me access to rather large
  (for FreeBSD!) NUMA machines in order to develop and verify this.
  
  Thank you to Dell for providing me with dual socket sandybridge
  and westmere v3 hardware to do NUMA development with.
  
  Thank you to Scott Long at Netflix for providing me with access
  to the two-socket, four-domain haswell v3 hardware.
  
  Thank you to Peter Holm for running the stress testing suite
  against the NUMA branch during various stages of development!
  
  Tested:
  
  * MIPS (regression testing; non-NUMA)
  * i386 (regression testing; non-NUMA GENERIC)
  * amd64 (regression testing; non-NUMA GENERIC)
  * westmere, 2 socket (thankyou norse!)
  * sandy bridge, 2 socket (thankyou dell!)
  * ivy bridge, 2 socket (thankyou norse!)
  * westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
  * haswell, 2 socket (thankyou norse!)
  * haswell v3, 2 socket (thankyou dell)
  * haswell v3, 2x18 core (thankyou scott long / netflix!)
  
  * Peter Holm ran a stress test suite on this work and found one
    issue, but has not been able to verify it (it doesn't look NUMA
    related, and he only saw it once over many testing runs.)
  
  * I've tested bhyve instances running in fixed NUMA domains and cpusets;
    all seems to work correctly.
  
  Verified:
  
  * intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
    NUMA policies for processes under test.
  
  Review:
  
  This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
  as well as privately and via emails to freebsd-arch@.  The git history
  with specific attributes is available at https://github.com/erikarn/freebsd/
  in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
  
  This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
  wblock) but not achieved a clear consensus.  My hope is that with further
  exposure and testing more functionality can be implemented and evaluated.
  
  Notes:
  
  * The VM doesn't handle unbalanced domains very well, and if you have an overly
    unbalanced memory setup whilst under high memory pressure, VM page allocation
    may fail leading to a kernel panic.  This was a problem in the past, but it's
    much more easily triggered now with these tools.
  
  * This work only controls the path through vm_phys; it doesn't yet strongly/predictably
    affect contigmalloc, KVA placement, UMA, etc.  So, driver placement of memory
    isn't really guaranteed in any way.  That's next on my plate.
  
  Sponsored by:	Norse Corp, Inc.; Dell

Added:
  head/lib/libc/sys/numa_getaffinity.2   (contents, props changed)
  head/share/man/man4/numa.4   (contents, props changed)
  head/sys/kern/kern_numa.c   (contents, props changed)
  head/sys/sys/_vm_domain.h   (contents, props changed)
  head/sys/sys/numa.h   (contents, props changed)
  head/sys/vm/vm_domain.c   (contents, props changed)
  head/sys/vm/vm_domain.h   (contents, props changed)
  head/usr.bin/numactl/
  head/usr.bin/numactl/Makefile   (contents, props changed)
  head/usr.bin/numactl/numactl.1   (contents, props changed)
  head/usr.bin/numactl/numactl.c   (contents, props changed)
Modified:
  head/lib/libc/sys/Makefile.inc
  head/lib/libc/sys/Symbol.map
  head/share/man/man4/Makefile
  head/sys/conf/files
  head/sys/kern/init_main.c
  head/sys/kern/init_sysent.c
  head/sys/kern/kern_exit.c
  head/sys/kern/kern_fork.c
  head/sys/kern/kern_thr.c
  head/sys/kern/kern_thread.c
  head/sys/sys/proc.h
  head/sys/vm/vm_phys.c
  head/sys/vm/vm_phys.h
  head/usr.bin/Makefile

Modified: head/lib/libc/sys/Makefile.inc
==============================================================================
--- head/lib/libc/sys/Makefile.inc	Sat Jul 11 13:07:50 2015	(r285386)
+++ head/lib/libc/sys/Makefile.inc	Sat Jul 11 15:21:37 2015	(r285387)
@@ -235,6 +235,7 @@ MAN+=	abort2.2 \
 	nanosleep.2 \
 	nfssvc.2 \
 	ntp_adjtime.2 \
+	numa_getaffinity.2 \
 	open.2 \
 	pathconf.2 \
 	pdfork.2 \
@@ -395,6 +396,7 @@ MLINKS+=mount.2 nmount.2 \
 MLINKS+=mq_receive.2 mq_timedreceive.2
 MLINKS+=mq_send.2 mq_timedsend.2
 MLINKS+=ntp_adjtime.2 ntp_gettime.2
+MLINKS+=numa_getaffinity.2 numa_setaffinity.2
 MLINKS+=open.2 openat.2
 MLINKS+=pathconf.2 fpathconf.2
 MLINKS+=pathconf.2 lpathconf.2

Modified: head/lib/libc/sys/Symbol.map
==============================================================================
--- head/lib/libc/sys/Symbol.map	Sat Jul 11 13:07:50 2015	(r285386)
+++ head/lib/libc/sys/Symbol.map	Sat Jul 11 15:21:37 2015	(r285387)
@@ -400,6 +400,8 @@ FBSD_1.4 {
 	futimens;
 	ppoll;
 	utimensat;
+	numa_setaffinity;
+	numa_getaffinity;
 };
 
 FBSDprivate_1.0 {

Added: head/lib/libc/sys/numa_getaffinity.2
==============================================================================
--- /dev/null	00:00:00 1970	(empty, because file is newly added)
+++ head/lib/libc/sys/numa_getaffinity.2	Sat Jul 11 15:21:37 2015	(r285387)
@@ -0,0 +1,197 @@
+.\" Copyright (c) 2008 Christian Brueffer
+.\" Copyright (c) 2008 Jeffrey Roberson
+.\" Copyright (c) 2015 Adrian Chadd
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd May 7, 2015
+.Dt NUMA_GETAFFINITY 2
+.Os
+.Sh NAME
+.Nm numa_getaffinity ,
+.Nm numa_setaffinity
+.Nd manage NUMA affinity
+.Sh LIBRARY
+.Lb libc
+.Sh SYNOPSIS
+.In sys/param.h
+.In sys/numa.h
+.Ft int
+.Fn numa_getaffinity "cpuwhich_t which" "id_t id" "struct vm_domain_policy_entry *policy"
+.Ft int
+.Fn numa_setaffinity "cpuwhich_t which" "id_t id" "const struct vm_domain_policy_entry *policy"
+.Sh DESCRIPTION
+.Fn numa_getaffinity
+and
+.Fn numa_setaffinity
+allow the manipulation of NUMA policies available to processes and threads.
+These functions may manipulate NUMA policies that contain many processes
+or affect only a single object.
+.Pp
+Valid values for the
+.Fa which
+argument are documented in
+.Xr cpuset 2 .
+These arguments specify which object set are used.
+Only
+.Dv CPU_WHICH_TID
+and
+.Dv CPU_WHICH_PID
+can be manipulated.
+.Pp
+The
+.Fa policy
+entry contains a vm_domain_policy_entry with the following fields:
+.Bd -literal
+struct vm_domain_policy_entry {
+    vm_domain_policy_type_t policy;   /* VM policy */
+    int domain;   /* VM domain, if applicable */
+}
+.Ed
+.Fa vm_domain_policy_type_t policy
+is one these:
+.Bl -tag -width VM_POLICY_NONE
+.It Dv VM_POLICY_NONE
+Reset the domain back to none.
+Any parent object NUMA domain policy will apply.
+The only valid value for
+.Dv domain
+is -1.
+.It Dv VM_POLICY_ROUND_ROBIN
+Select round-robin policy.
+Pages will be allocated round-robin from each VM domain in order.
+The only valid value for
+.Dv domain
+is -1.
+.It Dv VM_POLICY_FIXED_DOMAIN
+Select fixed-domain only policy.
+Pages will be allocated from the given
+.Dv domain
+which must be set to a valid VM domain.
+Pages will not be allocated from another domain if
+.Dv domain
+is out of free pages.
+.It Dv VM_POLICY_FIXED_DOMAIN_ROUND_ROBIN
+Select fixed-domain only policy.
+Pages will be allocated from
+.Dv domain
+which must be set to a valid VM domain.
+If page allocation fails, pages will be round-robin
+allocated from another domain if
+.Dv domain
+is out of free pages.
+.It Dv VM_POLICY_FIRST_TOUCH
+Select first-touch policy.
+Pages will be allocated from the NUMA domain which the thread
+is currently scheduled upon.
+Pages will not be allocated from another domain if the current domain
+is out of free pages.
+The only valid value for
+.Dv domain
+is -1.
+.It Dv VM_POLICY_FIRST_TOUCH_ROUND_ROBIN
+Select first-touch policy.
+Pages will be allocated from the NUMA domain which the thread
+is currently scheduled upon.
+Pages will be allocated round-robin from another domain if the
+current domain is out of free pages.
+The only valid value for
+.Dv domain
+is -1.
+.El
+.Pp
+Note that the VM might assign some pages from other domains.
+For example, if an existing page allocation is covered by a superpage
+allocation.
+.Pp
+.Fn numa_getaffinity
+retrieves the
+NUMA policy from the object specified by
+.Fa which
+and
+.Fa id
+and stores it in the space provided by
+.Fa policy .
+.Pp
+.Fn numa_setaffinity
+attempts to set the NUMA policy for the object specified by
+.Fa which
+and
+.Fa id
+to the policy in
+.Fa policy .
+.Sh RETURN VALUES
+.Rv -std
+.Sh ERRORS
+.Va errno
+can contain these error codes:
+.Bl -tag -width Er
+.It Bq Er EINVAL
+The
+.Fa level
+or
+.Fa which
+argument was not a valid value.
+.It Bq Er EINVAL
+The
+.Fa policy
+argument specified when calling
+.Fn numa_setaffinity
+did not contain a valid policy.
+.It Bq Er EFAULT
+The policy pointer passed was invalid.
+.It Bq Er ESRCH
+The object specified by the
+.Fa id
+and
+.Fa which
+arguments could not be found.
+.It Bq Er ERANGE
+The
+.Fa domain
+in the given policy
+was out of the range of possible VM domains available.
+.It Bq Er EPERM
+The calling process did not have the credentials required to complete the
+operation.
+.El
+.Sh SEE ALSO
+.Xr cpuset 1 ,
+.Xr numactl 1 ,
+.Xr cpuset 2 ,
+.Xr cpuset_getaffinity 2 ,
+.Xr cpuset_getid 2 ,
+.Xr cpuset_setaffinity 2 ,
+.Xr cpuset_setid 2 ,
+.Xr pthread_affinity_np 3 ,
+.Xr pthread_attr_affinity_np 3 ,
+.Xr numa 4
+.Sh HISTORY
+The
+.Nm
+family of system calls first appeared in
+.Fx 11.0 .
+.Sh AUTHORS
+.An Adrian Chadd Aq Mt adrian@FreeBSD.org

Modified: head/share/man/man4/Makefile
==============================================================================
--- head/share/man/man4/Makefile	Sat Jul 11 13:07:50 2015	(r285386)
+++ head/share/man/man4/Makefile	Sat Jul 11 15:21:37 2015	(r285387)
@@ -364,6 +364,7 @@ MAN=	aac.4 \
 	nsp.4 \
 	${_ntb.4} \
 	null.4 \
+	numa.4 \
 	${_nvd.4} \
 	${_nvme.4} \
 	${_nvram.4} \

Added: head/share/man/man4/numa.4
==============================================================================
--- /dev/null	00:00:00 1970	(empty, because file is newly added)
+++ head/share/man/man4/numa.4	Sat Jul 11 15:21:37 2015	(r285387)
@@ -0,0 +1,172 @@
+.\" Copyright (c) 2015 Adrian Chadd <adrian@FreeBSD.org>
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd May 10, 2015
+.Dt NUMA 4
+.Os
+.Sh NAME
+.Nm NUMA
+.Nd Non-Uniform Memory Access
+.Sh SYNOPSIS
+.Cd options SMP
+.Cd options MAXMEMDOM=16
+.Pp
+.In sys/numa.h
+.In sys/cpuset.h
+.In sys/bus.h
+.Sh DESCRIPTION
+Non-Uniform Memory Access is a computer architecture design which
+involves unequal costs between processors, memory and IO devices
+in a given system.
+.Pp
+In a
+.Nm
+architecture, the latency to access specific memory or IO devices
+depends upon which processor the memory or device is attached to.
+Accessing memory local to a processor is faster than accessing memory
+that is connected to one of the other processors.
+.Pp
+.Nm
+is enabled when the
+.Cd MAXMEMDOM
+option is used in a kernel configuration
+file and is set to a value greater than 1.
+.Pp
+Thread and process
+.Nm
+policies are controlled with the
+.Xr numa_setaffinity 2
+and
+.Xr numa_getaffinity 2
+syscalls.
+.Pp
+The
+.Xr numactl 1
+tool is available for starting processes with a non-default
+policy, or to change the policy of an existing thread or process.
+.Pp
+Systems with non-uniform access to I/O devices may mark those devices
+with the local VM domain identifier.
+Drivers can find out their local domain information by calling
+.Xr bus_get_domain 9 .
+.Ss MIB Variables
+The operation of
+.Nm
+is controlled and exposes information with these
+.Xr sysctl 8
+MIB variables:
+.Pp
+.Bl -tag -width indent -compact
+.It Va vm.ndomains
+The number of VM domains which have been detected.
+.Pp
+.It Va vm.default_policy
+The default VM domain allocation policy.
+Defaults to "first-touch-rr".
+The valid values are "first-touch", "first-touch-rr",
+"rr", where "rr" is a short-hand for "round-robin."
+See
+.Xr numa_setaffinity 2
+for more information about the available policies.
+.Pp
+.It Va vm.phys_locality
+A table indicating the relative cost of each VM domain to each other.
+A value of 10 indicates equal cost.
+A value of -1 means the locality map is not available or no
+locality information is available.
+.Pp
+.It Va vm.phys_segs
+The map of physical memory, grouped by VM domain.
+.El
+.Sh IMPLEMENTATION NOTES
+The current
+.Nm
+implementation is VM-focused.
+The hardware
+.Nm
+domains are mapped into a contiguous, non-sparse
+VM domain space, starting from 0.
+Thus, VM domain information (for example, the domain identifier) is not
+necessarily the same as is found in the hardware specific information.
+.Pp
+The
+.Nm
+allocation policies are implemented as a policy and iterator in
+.Pa sys/vm/vm_domain.c
+and
+.Pa sys/vm/vm_domain.h .
+Policy information is available in both struct thread and struct proc.
+Processes inherit
+.Nm
+policy from parent processes and threads inherit
+.Nm
+policy from parent threads.
+Note that threads do not explicitly inherit their
+.Nm
+policy from processes.
+Instead, if no thread policy is set, the system
+will fall back to the process policy.
+.Pp
+For now,
+.Nm
+domain policies only influence physical page allocation in
+.Pa sys/vm/vm_phys.c .
+This is useful for userland memory allocation, but not for kernel
+and driver memory allocation.
+These features will be implemented in future work.
+.Sh SEE ALSO
+.Xr numactl 1 ,
+.Xr numa_getaffinity 2 ,
+.Xr numa_setaffinity 2 ,
+.Xr bus_get_domain 9
+.Sh HISTORY
+.Nm
+first appeared in
+.Fx 9.0
+as a first-touch allocation policy with a fail-over to round-robin allocation
+and was not configurable.
+It was then modified in
+.Fx 10.0
+to implement a round-robin allocation policy and was also not configurable.
+.Pp
+The
+.Xr numa_getaffinity 2
+and
+.Xr numa_setaffinity 2
+syscalls first appeared in
+.Fx 11.0 .
+.Pp
+The
+.Xr numactl 1
+tool first appeared in
+.Fx 11.0 .
+.Sh AUTHORS
+This manual page written by
+.An Adrian Chadd Aq Mt adrian@FreeBSD.org .
+.Sh NOTES
+No statistics are kept to indicate how often
+.Nm
+allocation policies succeed or fail.

Modified: head/sys/conf/files
==============================================================================
--- head/sys/conf/files	Sat Jul 11 13:07:50 2015	(r285386)
+++ head/sys/conf/files	Sat Jul 11 15:21:37 2015	(r285387)
@@ -3017,6 +3017,7 @@ kern/kern_module.c		standard
 kern/kern_mtxpool.c		standard
 kern/kern_mutex.c		standard
 kern/kern_ntptime.c		standard
+kern/kern_numa.c		standard
 kern/kern_osd.c			standard
 kern/kern_physio.c		standard
 kern/kern_pmc.c			standard
@@ -4043,6 +4044,7 @@ vm/vm_pager.c			standard
 vm/vm_phys.c			standard
 vm/vm_radix.c			standard
 vm/vm_reserv.c			standard
+vm/vm_domain.c			standard
 vm/vm_unix.c			standard
 vm/vm_zeroidle.c		standard
 vm/vnode_pager.c		standard

Modified: head/sys/kern/init_main.c
==============================================================================
--- head/sys/kern/init_main.c	Sat Jul 11 13:07:50 2015	(r285386)
+++ head/sys/kern/init_main.c	Sat Jul 11 15:21:37 2015	(r285387)
@@ -87,6 +87,7 @@ __FBSDID("$FreeBSD$");
 #include <vm/vm_param.h>
 #include <vm/pmap.h>
 #include <vm/vm_map.h>
+#include <vm/vm_domain.h>
 #include <sys/copyright.h>
 
 #include <ddb/ddb.h>
@@ -496,6 +497,10 @@ proc0_init(void *dummy __unused)
 	td->td_flags = TDF_INMEM;
 	td->td_pflags = TDP_KTHREAD;
 	td->td_cpuset = cpuset_thread0();
+	vm_domain_policy_init(&td->td_vm_dom_policy);
+	vm_domain_policy_set(&td->td_vm_dom_policy, VM_POLICY_NONE, -1);
+	vm_domain_policy_init(&p->p_vm_dom_policy);
+	vm_domain_policy_set(&p->p_vm_dom_policy, VM_POLICY_NONE, -1);
 	prison0_init();
 	p->p_peers = 0;
 	p->p_leader = p;

Modified: head/sys/kern/init_sysent.c
==============================================================================
--- head/sys/kern/init_sysent.c	Sat Jul 11 13:07:50 2015	(r285386)
+++ head/sys/kern/init_sysent.c	Sat Jul 11 15:21:37 2015	(r285387)
@@ -588,4 +588,6 @@ struct sysent sysent[] = {
 	{ AS(ppoll_args), (sy_call_t *)sys_ppoll, AUE_POLL, NULL, 0, 0, 0, SY_THR_STATIC },	/* 545 = ppoll */
 	{ AS(futimens_args), (sy_call_t *)sys_futimens, AUE_FUTIMES, NULL, 0, 0, SYF_CAPENABLED, SY_THR_STATIC },	/* 546 = futimens */
 	{ AS(utimensat_args), (sy_call_t *)sys_utimensat, AUE_FUTIMESAT, NULL, 0, 0, SYF_CAPENABLED, SY_THR_STATIC },	/* 547 = utimensat */
+	{ AS(numa_getaffinity_args), (sy_call_t *)sys_numa_getaffinity, AUE_NULL, NULL, 0, 0, 0, SY_THR_STATIC },	/* 548 = numa_getaffinity */
+	{ AS(numa_setaffinity_args), (sy_call_t *)sys_numa_setaffinity, AUE_NULL, NULL, 0, 0, 0, SY_THR_STATIC },	/* 549 = numa_setaffinity */
 };

Modified: head/sys/kern/kern_exit.c
==============================================================================
--- head/sys/kern/kern_exit.c	Sat Jul 11 13:07:50 2015	(r285386)
+++ head/sys/kern/kern_exit.c	Sat Jul 11 15:21:37 2015	(r285387)
@@ -86,6 +86,7 @@ __FBSDID("$FreeBSD$");
 #include <vm/vm_map.h>
 #include <vm/vm_page.h>
 #include <vm/uma.h>
+#include <vm/vm_domain.h>
 
 #ifdef KDTRACE_HOOKS
 #include <sys/dtrace_bsd.h>
@@ -950,6 +951,11 @@ proc_reap(struct thread *td, struct proc
 #ifdef MAC
 	mac_proc_destroy(p);
 #endif
+	/*
+	 * Free any domain policy that's still hiding around.
+	 */
+	vm_domain_policy_cleanup(&p->p_vm_dom_policy);
+
 	KASSERT(FIRST_THREAD_IN_PROC(p),
 	    ("proc_reap: no residual thread!"));
 	uma_zfree(proc_zone, p);

Modified: head/sys/kern/kern_fork.c
==============================================================================
--- head/sys/kern/kern_fork.c	Sat Jul 11 13:07:50 2015	(r285386)
+++ head/sys/kern/kern_fork.c	Sat Jul 11 15:21:37 2015	(r285387)
@@ -80,6 +80,7 @@ __FBSDID("$FreeBSD$");
 #include <vm/vm_map.h>
 #include <vm/vm_extern.h>
 #include <vm/uma.h>
+#include <vm/vm_domain.h>
 
 #ifdef KDTRACE_HOOKS
 #include <sys/dtrace_bsd.h>
@@ -405,6 +406,7 @@ do_fork(struct thread *td, int flags, st
 	bcopy(&p1->p_startcopy, &p2->p_startcopy,
 	    __rangeof(struct proc, p_startcopy, p_endcopy));
 	pargs_hold(p2->p_args);
+
 	PROC_UNLOCK(p1);
 
 	bzero(&p2->p_startzero,
@@ -497,6 +499,14 @@ do_fork(struct thread *td, int flags, st
 	if (p1->p_flag & P_PROFIL)
 		startprofclock(p2);
 
+	/*
+	 * Whilst the proc lock is held, copy the VM domain data out
+	 * using the VM domain method.
+	 */
+	vm_domain_policy_init(&p2->p_vm_dom_policy);
+	vm_domain_policy_localcopy(&p2->p_vm_dom_policy,
+	    &p1->p_vm_dom_policy);
+
 	if (flags & RFSIGSHARE) {
 		p2->p_sigacts = sigacts_hold(p1->p_sigacts);
 	} else {

Added: head/sys/kern/kern_numa.c
==============================================================================
--- /dev/null	00:00:00 1970	(empty, because file is newly added)
+++ head/sys/kern/kern_numa.c	Sat Jul 11 15:21:37 2015	(r285387)
@@ -0,0 +1,170 @@
+/*-
+ * Copyright (c) 2015, Adrian Chadd <adrian@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/sysproto.h>
+#include <sys/jail.h>
+#include <sys/kernel.h>
+#include <sys/lock.h>
+#include <sys/malloc.h>
+#include <sys/mutex.h>
+#include <sys/priv.h>
+#include <sys/proc.h>
+#include <sys/refcount.h>
+#include <sys/sched.h>
+#include <sys/smp.h>
+#include <sys/syscallsubr.h>
+#include <sys/cpuset.h>
+#include <sys/sx.h>
+#include <sys/queue.h>
+#include <sys/libkern.h>
+#include <sys/limits.h>
+#include <sys/bus.h>
+#include <sys/interrupt.h>
+
+#include <vm/uma.h>
+#include <vm/vm.h>
+#include <vm/vm_page.h>
+#include <vm/vm_param.h>
+#include <vm/vm_phys.h>
+#include <vm/vm_domain.h>
+
+int
+sys_numa_setaffinity(struct thread *td, struct numa_setaffinity_args *uap)
+{
+	int error;
+	struct vm_domain_policy vp;
+	struct thread *ttd;
+	struct proc *p;
+	struct cpuset *set;
+
+	set = NULL;
+	p = NULL;
+
+	/*
+	 * Copy in just the policy information into the policy
+	 * struct.  Userland only supplies vm_domain_policy_entry.
+	 */
+	error = copyin(uap->policy, &vp.p, sizeof(vp.p));
+	if (error)
+		goto out;
+
+	/*
+	 * Ensure the seq number is zero - otherwise seq.h
+	 * may get very confused.
+	 */
+	vp.seq = 0;
+
+	/*
+	 * Validate policy.
+	 */
+	if (vm_domain_policy_validate(&vp) != 0) {
+		error = EINVAL;
+		goto out;
+	}
+
+	/*
+	 * Go find the desired proc/tid for this operation.
+	 */
+	error = cpuset_which(uap->which, uap->id, &p,
+	    &ttd, &set);
+	if (error)
+		goto out;
+
+	/* Only handle CPU_WHICH_TID and CPU_WHICH_PID */
+	/*
+	 * XXX if cpuset_which is called with WHICH_CPUSET and NULL cpuset,
+	 * it'll return ESRCH.  We should just return EINVAL.
+	 */
+	switch (uap->which) {
+	case CPU_WHICH_TID:
+		vm_domain_policy_copy(&ttd->td_vm_dom_policy, &vp);
+		break;
+	case CPU_WHICH_PID:
+		vm_domain_policy_copy(&p->p_vm_dom_policy, &vp);
+		break;
+	default:
+		error = EINVAL;
+		break;
+	}
+
+	PROC_UNLOCK(p);
+out:
+	if (set)
+		cpuset_rel(set);
+	return (error);
+}
+
+int
+sys_numa_getaffinity(struct thread *td, struct numa_getaffinity_args *uap)
+{
+	int error;
+	struct vm_domain_policy vp;
+	struct thread *ttd;
+	struct proc *p;
+	struct cpuset *set;
+
+	set = NULL;
+	p = NULL;
+
+	error = cpuset_which(uap->which, uap->id, &p,
+	    &ttd, &set);
+	if (error)
+		goto out;
+
+	/* Only handle CPU_WHICH_TID and CPU_WHICH_PID */
+	/*
+	 * XXX if cpuset_which is called with WHICH_CPUSET and NULL cpuset,
+	 * it'll return ESRCH.  We should just return EINVAL.
+	 */
+	switch (uap->which) {
+	case CPU_WHICH_TID:
+		vm_domain_policy_localcopy(&vp, &ttd->td_vm_dom_policy);
+		break;
+	case CPU_WHICH_PID:
+		vm_domain_policy_localcopy(&vp, &p->p_vm_dom_policy);
+		break;
+	default:
+		error = EINVAL;
+		break;
+	}
+	if (p)
+		PROC_UNLOCK(p);
+	/*
+	 * Copy out only the vm_domain_policy_entry part.
+	 */
+	if (error == 0)
+		error = copyout(&vp.p, uap->policy, sizeof(vp.p));
+out:
+	if (set)
+		cpuset_rel(set);
+	return (error);
+}

Modified: head/sys/kern/kern_thr.c
==============================================================================
--- head/sys/kern/kern_thr.c	Sat Jul 11 13:07:50 2015	(r285386)
+++ head/sys/kern/kern_thr.c	Sat Jul 11 15:21:37 2015	(r285387)
@@ -54,6 +54,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/umtx.h>
 #include <sys/limits.h>
 
+#include <vm/vm_domain.h>
+
 #include <machine/frame.h>
 
 #include <security/audit/audit.h>
@@ -254,6 +256,13 @@ create_thread(struct thread *td, mcontex
 	thread_unlock(td);
 	if (P_SHOULDSTOP(p))
 		newtd->td_flags |= TDF_ASTPENDING | TDF_NEEDSUSPCHK;
+
+	/*
+	 * Copy the existing thread VM policy into the new thread.
+	 */
+	vm_domain_policy_localcopy(&newtd->td_vm_dom_policy,
+	    &td->td_vm_dom_policy);
+
 	PROC_UNLOCK(p);
 
 	tidhash_add(newtd);

Modified: head/sys/kern/kern_thread.c
==============================================================================
--- head/sys/kern/kern_thread.c	Sat Jul 11 13:07:50 2015	(r285386)
+++ head/sys/kern/kern_thread.c	Sat Jul 11 15:21:37 2015	(r285387)
@@ -60,6 +60,7 @@ __FBSDID("$FreeBSD$");
 #include <vm/vm.h>
 #include <vm/vm_extern.h>
 #include <vm/uma.h>
+#include <vm/vm_domain.h>
 #include <sys/eventhandler.h>
 
 SDT_PROVIDER_DECLARE(proc);
@@ -351,6 +352,7 @@ thread_alloc(int pages)
 		return (NULL);
 	}
 	cpu_thread_alloc(td);
+	vm_domain_policy_init(&td->td_vm_dom_policy);
 	return (td);
 }
 
@@ -380,6 +382,7 @@ thread_free(struct thread *td)
 	cpu_thread_free(td);
 	if (td->td_kstack != 0)
 		vm_thread_dispose(td);
+	vm_domain_policy_cleanup(&td->td_vm_dom_policy);
 	uma_zfree(thread_zone, td);
 }
 

Added: head/sys/sys/_vm_domain.h
==============================================================================
--- /dev/null	00:00:00 1970	(empty, because file is newly added)
+++ head/sys/sys/_vm_domain.h	Sat Jul 11 15:21:37 2015	(r285387)
@@ -0,0 +1,61 @@
+/*-
+ * Copyright (c) 2015 Adrian Chadd <adrian@FreeBSD.org>.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer,
+ *    without modification.
+ * 2. Redistributions in binary form must reproduce at minimum a disclaimer
+ *    similar to the "NO WARRANTY" disclaimer below ("Disclaimer") and any
+ *    redistribution must be conditioned upon including a substantially
+ *    similar Disclaimer requirement for further binary redistribution.
+ *
+ * NO WARRANTY
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTIBILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY,
+ * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
+ * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
+ * THE POSSIBILITY OF SUCH DAMAGES.
+ *
+ * $FreeBSD$
+ */
+#ifndef	__SYS_VM_DOMAIN_H__
+#define	__SYS_VM_DOMAIN_H__
+
+#include <sys/seq.h>
+
+typedef enum {
+	VM_POLICY_NONE,
+	VM_POLICY_ROUND_ROBIN,
+	VM_POLICY_FIXED_DOMAIN,
+	VM_POLICY_FIXED_DOMAIN_ROUND_ROBIN,
+	VM_POLICY_FIRST_TOUCH,
+	VM_POLICY_FIRST_TOUCH_ROUND_ROBIN,
+	VM_POLICY_MAX
+} vm_domain_policy_type_t;
+
+struct vm_domain_policy_entry {
+	vm_domain_policy_type_t policy;
+	int domain;
+};
+
+struct vm_domain_policy {
+	seq_t seq;
+	struct vm_domain_policy_entry p;
+};
+
+#define VM_DOMAIN_POLICY_STATIC_INITIALISER(vt, vd) \
+	{ .seq = 0, \
+	  .p.policy = vt, \
+	  .p.domain = vd }
+
+#endif	/* __SYS_VM_DOMAIN_H__ */

Added: head/sys/sys/numa.h
==============================================================================
--- /dev/null	00:00:00 1970	(empty, because file is newly added)
+++ head/sys/sys/numa.h	Sat Jul 11 15:21:37 2015	(r285387)
@@ -0,0 +1,41 @@
+/*
+ * Copyright (c) 2015 Adrian Chadd <adrian@FreeBSD.org>.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 4. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ */
+#ifndef	__SYS_NUMA_H__
+#define	__SYS_NUMA_H__
+
+#include <sys/_vm_domain.h>
+
+extern	int numa_setaffinity(cpuwhich_t which, id_t id,
+	    struct vm_domain_policy_entry *vd);
+extern	int numa_getaffinity(cpuwhich_t which, id_t id,
+	    struct vm_domain_policy_entry *vd);
+
+#endif	/* __SYS_NUMA_H__ */

Modified: head/sys/sys/proc.h
==============================================================================
--- head/sys/sys/proc.h	Sat Jul 11 13:07:50 2015	(r285386)
+++ head/sys/sys/proc.h	Sat Jul 11 15:21:37 2015	(r285387)
@@ -63,6 +63,7 @@
 #endif
 #include <sys/ucontext.h>
 #include <sys/ucred.h>
+#include <sys/_vm_domain.h>
 #include <machine/proc.h>		/* Machine-dependent proc substruct. */
 
 /*
@@ -217,6 +218,7 @@ struct thread {
 	struct turnstile *td_turnstile;	/* (k) Associated turnstile. */
 	struct rl_q_entry *td_rlqe;	/* (k) Associated range lock entry. */
 	struct umtx_q   *td_umtxq;	/* (c?) Link for when we're blocked. */
+	struct vm_domain_policy td_vm_dom_policy;	/* (c) current numa domain policy */
 	lwpid_t		td_tid;		/* (b) Thread ID. */
 	sigqueue_t	td_sigqueue;	/* (c) Sigs arrived, not delivered. */
 #define	td_siglist	td_sigqueue.sq_signals
@@ -606,6 +608,7 @@ struct proc {
 	uint64_t	p_prev_runtime;	/* (c) Resource usage accounting. */
 	struct racct	*p_racct;	/* (b) Resource accounting. */
 	u_char		p_throttled;	/* (c) Flag for racct pcpu throttling */
+	struct vm_domain_policy p_vm_dom_policy;	/* (c) process default VM domain, or -1 */
 	/*
 	 * An orphan is the child that has beed re-parented to the
 	 * debugger as a result of attaching to it.  Need to keep

Added: head/sys/vm/vm_domain.c
==============================================================================
--- /dev/null	00:00:00 1970	(empty, because file is newly added)
+++ head/sys/vm/vm_domain.c	Sat Jul 11 15:21:37 2015	(r285387)
@@ -0,0 +1,374 @@
+/*-
+ * Copyright (c) 2015 Adrian Chadd <adrian@FreeBSD.org>.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer,
+ *    without modification.
+ * 2. Redistributions in binary form must reproduce at minimum a disclaimer
+ *    similar to the "NO WARRANTY" disclaimer below ("Disclaimer") and any
+ *    redistribution must be conditioned upon including a substantially
+ *    similar Disclaimer requirement for further binary redistribution.
+ *
+ * NO WARRANTY
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTIBILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY,
+ * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
+ * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
+ * THE POSSIBILITY OF SUCH DAMAGES.
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+
+#include "opt_vm.h"
+#include "opt_ddb.h"
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/lock.h>
+#include <sys/kernel.h>
+#include <sys/malloc.h>
+#include <sys/mutex.h>
+#if MAXMEMDOM > 1
+#include <sys/proc.h>
+#endif
+#include <sys/queue.h>
+#include <sys/rwlock.h>
+#include <sys/sbuf.h>
+#include <sys/sysctl.h>
+#include <sys/tree.h>
+#include <sys/vmmeter.h>
+#include <sys/seq.h>
+
+#include <ddb/ddb.h>
+
+#include <vm/vm.h>
+#include <vm/vm_param.h>
+#include <vm/vm_kern.h>
+#include <vm/vm_object.h>
+#include <vm/vm_page.h>
+#include <vm/vm_phys.h>
+
+#include <vm/vm_domain.h>
+
+static __inline int
+vm_domain_rr_selectdomain(void)
+{
+#if MAXMEMDOM > 1
+	struct thread *td;
+
+	td = curthread;
+
+	td->td_dom_rr_idx++;
+	td->td_dom_rr_idx %= vm_ndomains;
+	return (td->td_dom_rr_idx);
+#else
+	return (0);
+#endif
+}
+
+/*
+ * This implements a very simple set of VM domain memory allocation
+ * policies and iterators.
+ */
+
+/*
+ * A VM domain policy represents a desired VM domain policy.
+ * Iterators implement searching through VM domains in a specific
+ * order.
+ */
+
+/*
+ * When setting a policy, the caller must establish their own
+ * exclusive write protection for the contents of the domain
+ * policy.
+ */
+int
+vm_domain_policy_init(struct vm_domain_policy *vp)
+{
+
+	bzero(vp, sizeof(*vp));
+	vp->p.policy = VM_POLICY_NONE;
+	vp->p.domain = -1;
+	return (0);
+}
+
+int
+vm_domain_policy_set(struct vm_domain_policy *vp,
+    vm_domain_policy_type_t vt, int domain)
+{
+
+	seq_write_begin(&vp->seq);
+	vp->p.policy = vt;
+	vp->p.domain = domain;
+	seq_write_end(&vp->seq);
+	return (0);
+}
+
+/*
+ * Take a local copy of a policy.
+ *
+ * The destination policy isn't write-barriered; this is used
+ * for doing local copies into something that isn't shared.

*** DIFF OUTPUT TRUNCATED AT 1000 LINES ***



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201507111521.t6BFLcrv039934>