Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 Feb 2008 12:34:13 -1000 (HST)
From:      Jeff Roberson <jroberson@chesapeake.net>
To:        Robert Watson <rwatson@FreeBSD.org>
Cc:        Daniel Eischen <deischen@FreeBSD.org>, arch@FreeBSD.org, David Xu <davidxu@FreeBSD.org>, Andrew Gallatin <gallatin@cs.duke.edu>
Subject:   Re: getaffinity/setaffinity and cpu sets.
Message-ID:  <20080222121253.N920@desktop>
In-Reply-To: <20080221092011.J52922@fledge.watson.org>
References:  <20071219211025.T899@desktop> <18311.49715.457070.397815@grasshopper.cs.duke.edu> <20080112182948.F36731@fledge.watson.org> <20080112170831.A957@desktop> <Pine.GSO.4.64.0801122240510.15683@sea.ntplx.net> <20080112194521.I957@desktop> <20080219234101.D920@desktop> <20080220101348.D44565@fledge.watson.org> <20080220005030.Y920@desktop> <20080220105333.G44565@fledge.watson.org> <47BCEFDB.5040207@freebsd.org> <20080220175532.Q920@desktop> <20080220213253.A920@desktop> <20080221092011.J52922@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On Thu, 21 Feb 2008, Robert Watson wrote:

> On Wed, 20 Feb 2008, Jeff Roberson wrote:
>
>> I also have a 'cpuset' command which can run a new program with a given cpu 
>> set, view and modify sets of arbitrary pids.  This is all working and I can 
>> supply patches if anyone is interested.  I have to implement 4BSD support 
>> before I can commit.
>> 
>> I have a proposal for solaris style processor sets which I think is simple 
>> and sufficient for most cases.  It involves the following new syscalls:
>> 
>> int cpuset(void); int setcpuset(pid_t pid, int setid); int getcpuset(pid_t 
>> pid);
>> 
>> The notion would be that you can create a new numbered cpuset with 
>> cpuset(). You can modify or inspect its affinity with get/setaffinity above 
>> and the CPU_WHICH_SET argument.  The cpuset exists as long as there are 
>> members of the set.  Sort of like a process group or session.  The 
>> {get,set}cpuset calls can inspect or modify the state.
>> 
>> This set would not be modifiable by user processes or by processes in a 
>> jail. It would create the restriction that differs between 'avail' and 
>> 'sys' above. Processors would be able to directly bind to any processor 
>> within the set. Changing the set would apply to all processes in the set. 
>> The cpuset would be per-process while the mask is per-thread.  Sets 
>> involvement is inherited on fork().
>> 
>> In solaris sets can be named and have a more complete management api.  I'm 
>> not really interested in implementing all of that but I believe what I have 
>> outlined here would be subset of this and no code/syscalls would be wasted.
>> 
>> Comments?  Objections?  I'm fairly pleased with this arrangement now.
>
> Just to put a few notes from our conversation on IRC in e-mail:
>
> - I think I'd prefer int cpuset(cpuset_t *set), int getcpuset(pid_t, cpuset_t
>  *) so that we don't mix up ID's and return values.  More recent interfaces
>  tend to do this, I believe, and it means that the prototype, even if not 
> the
>  ABI, remains the same if the set identifier changes in the future.

Ok, this is a good suggestion and I did this.  This is actually my 
preferred method as well but most syscalls don't follow this pattern and I 
was trying to make it look syscallish.

>
> - You don't mention what happens if a process's cpu set changes to preclude a
>  CPU the process has a thread with affinity for.  Online, you suggested
>  SIGKILL, and I thought maybe a new SIGCPUGONE with a default SIGKILL action
>  might be a friendlier model.  We should see what Solaris and others do here
>  though.  I like the idea that the affinity is a guarantee in userspace
>  because it means that you can rely on it; I'm OK with the idea that your
>  thread always runs on the CPUs you have affinity for unless in the
>  SIGCPUGONE handler :-).

I could also reject changes to the cpuset if they leave a thread with 
nothing to run on.  It might be confusing for the administrator and hard 
to tell them which thread caused the problem.  However, it might be nicer 
than killing a thread as well.

Another option would be to expel the offending thread from the set that 
is in violation and reparent it to the real system root along with a 
syslog message or similar.  If the administrator addressed the problem 
with the set he could then reassign the grouping.

This is what I would most like comments about.  Should we have a force 
mode?  Which of these behaviors sound best to you?

>
> - It would be nice to be able to use CPU sets in jail as well, suggesting a
>  hierarchal model with some sort of tagging so you know what CPU sets were
>  created in a jail such that you know whether they can be changed in a jail.
>  While I recognize this makes things a lot more tricky, I think we should
>  basically be planning more carefully with respect to virtualization when we
>  add new interfaces, since it's a widely used feature, and the current set 
> of
>  "stragglers" unsupported in Jail is growing rather than shrinking.

I have implemented a hierarchical model.  Each thread has a pointer to the 
cpuset that it's in.  If it makes a local modification via setaffinity() 
it gets an anonymous cpuset that is a child of the set assigned to the 
process.  This anonymous set will also be inherited across fork/thread 
creation.

In this model presently there are nodes marked as root.  To query the 
'system' cpus available we walk up from the current node until we find a 
root.  These are the 'system' set.  A thread may not break out of its 
system set.  A process may join the root set but it may not modify a root 
that is a parent.  Jails would create a new root.  A process outside of 
the jail can modify the set of processors in the jail but a process within 
the jail/root may not.

The next level down from the root is the assigned set.  The root may be an 
assigned set or this may be a subset of the root.  Processes may create 
sets which are parented back to their root and may include any processors 
within their root.  The mask of the assigned set is returned as 
'available' processors.

This gives a 1 to 3 level hierarchy. The root, an assigned set, and an 
anonymous set.  Any of these but the root may be omitted.  There is no 
current way for userland to create subsets of assigned sets to permit 
further nesting.  I'm not sure I see value in it right now and it gives 
the possibility of unbound tree depth.

Anonymous sets are immutable as they are shared and changes only apply to 
the thread/pid in the WHICH argument and not others which have inherited 
from it.  Anonymous sets have no id and may not be specifically 
manipulated via a setid.  You must refer to the process/thread.  From the 
administration point of view they don't exist.

When a set is modified we walk down the children recursively and apply the 
new mask.  This is done with a global set lock under which all 
modifications and tree operations are performed.  The td_cpuset pointer is 
protected under the thread_lock() and may read the set without a lock. 
This gives the possibility for certain kinds of races but I believe they 
are all safe.

Hopefully I explained that well enough for people to follow.  I realize 
it's a lot of text but it's fairly simple book keeping code.  This is all 
implemented and I'm debugging now.

>
> - There's still no way to specify an affinity policy rather than explicit
>  affinity, but if our CPU set model is sufficiently general, that might be a
>  vehicle to do that.  I.e., cpuset_setpolicy() rather than setting a mask.

Yes, I think this is orthogonal and can be addressed seperately.  I'm not 
sure how many userland programs are smart enough or even capable of making 
determinations about their cache behavior however.  We should open another 
discussion once this one is done.

>
> - In the interests of boring API changes, recent APIs tend to prefix the
>  method on the object name.  Have you thought about cpuset_create(),
>  cpuset_foo(), etc?  That reduces the chances of interfering with 
> application
>  namespaces.  I think, anyway. :-).

Yes, I prefer that as well, as I mentioned syscalls tended to favor 
brevity.  I'm fine with changing that trend.

>
> I need to ponder the proposal a little more, ideally over a hot beverage this 
> morning, and will follow up if I have further thoughts.  Thanks for working 
> on this, BTW -- affinity is well-overdue for FreeBSD.

A little more to ponder now!  Your feedback is much appreciated.

I believe the present hierarchical model satisfies the jail requirements 
of restricting cpus in the jail while still allowing the jail to create 
sets.

The unanswered questions are:

1)  What to do about sets that strand threads, options described above.
2)  Are people ok with the transient nature of sets?
3)  Does anyone want to help with man pages, administrative tools, etc?  I 
have a prototype tool called 'cpuset' that fully exercises the api but is 
probably ugly.  Will post details soon.

>
> Robert N M Watson
> Computer Laboratory
> University of Cambridge
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080222121253.N920>