Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 Feb 2008 13:52:54 -1000 (HST)
From:      Jeff Roberson <jroberson@chesapeake.net>
To:        Brooks Davis <brooks@freebsd.org>
Cc:        Daniel Eischen <deischen@freebsd.org>, arch@freebsd.org, Robert Watson <rwatson@freebsd.org>, David Xu <davidxu@freebsd.org>, Andrew Gallatin <gallatin@cs.duke.edu>
Subject:   Re: getaffinity/setaffinity and cpu sets.
Message-ID:  <20080222134923.M920@desktop>
In-Reply-To: <20080222231245.GA28788@lor.one-eyed-alien.net>
References:  <20080112194521.I957@desktop> <20080219234101.D920@desktop> <20080220101348.D44565@fledge.watson.org> <20080220005030.Y920@desktop> <20080220105333.G44565@fledge.watson.org> <47BCEFDB.5040207@freebsd.org> <20080220175532.Q920@desktop> <20080220213253.A920@desktop> <20080221092011.J52922@fledge.watson.org> <20080222121253.N920@desktop> <20080222231245.GA28788@lor.one-eyed-alien.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 22 Feb 2008, Brooks Davis wrote:

> On Fri, Feb 22, 2008 at 12:34:13PM -1000, Jeff Roberson wrote:
>>
>> On Thu, 21 Feb 2008, Robert Watson wrote:
>>
>>> On Wed, 20 Feb 2008, Jeff Roberson wrote:
>>>
>>>> I also have a 'cpuset' command which can run a new program with a given
>>>> cpu set, view and modify sets of arbitrary pids.  This is all working and
>>>> I can supply patches if anyone is interested.  I have to implement 4BSD
>>>> support before I can commit.
>>>> I have a proposal for solaris style processor sets which I think is
>>>> simple and sufficient for most cases.  It involves the following new
>>>> syscalls:
>>>> int cpuset(void); int setcpuset(pid_t pid, int setid); int
>>>> getcpuset(pid_t pid);
>>>> The notion would be that you can create a new numbered cpuset with
>>>> cpuset(). You can modify or inspect its affinity with get/setaffinity
>>>> above and the CPU_WHICH_SET argument.  The cpuset exists as long as there
>>>> are members of the set.  Sort of like a process group or session.  The
>>>> {get,set}cpuset calls can inspect or modify the state.
>>>> This set would not be modifiable by user processes or by processes in a
>>>> jail. It would create the restriction that differs between 'avail' and
>>>> 'sys' above. Processors would be able to directly bind to any processor
>>>> within the set. Changing the set would apply to all processes in the set.
>>>> The cpuset would be per-process while the mask is per-thread.  Sets
>>>> involvement is inherited on fork().
>>>> In solaris sets can be named and have a more complete management api.
>>>> I'm not really interested in implementing all of that but I believe what
>>>> I have outlined here would be subset of this and no code/syscalls would
>>>> be wasted.
>>>> Comments?  Objections?  I'm fairly pleased with this arrangement now.
>>>
>>> Just to put a few notes from our conversation on IRC in e-mail:
>>>
>>> - I think I'd prefer int cpuset(cpuset_t *set), int getcpuset(pid_t,
>>> cpuset_t
>>>  *) so that we don't mix up ID's and return values.  More recent
>>> interfaces
>>>  tend to do this, I believe, and it means that the prototype, even if not
>>> the
>>>  ABI, remains the same if the set identifier changes in the future.
>>
>> Ok, this is a good suggestion and I did this.  This is actually my
>> preferred method as well but most syscalls don't follow this pattern and I
>> was trying to make it look syscallish.
>>
>>> - You don't mention what happens if a process's cpu set changes to
>>> preclude a
>>>  CPU the process has a thread with affinity for.  Online, you suggested
>>>  SIGKILL, and I thought maybe a new SIGCPUGONE with a default SIGKILL
>>> action
>>>  might be a friendlier model.  We should see what Solaris and others do
>>> here
>>>  though.  I like the idea that the affinity is a guarantee in userspace
>>>  because it means that you can rely on it; I'm OK with the idea that your
>>>  thread always runs on the CPUs you have affinity for unless in the
>>>  SIGCPUGONE handler :-).
>>
>> I could also reject changes to the cpuset if they leave a thread with
>> nothing to run on.  It might be confusing for the administrator and hard to
>> tell them which thread caused the problem.  However, it might be nicer than
>> killing a thread as well.
>>
>> Another option would be to expel the offending thread from the set that is
>> in violation and reparent it to the real system root along with a syslog
>> message or similar.  If the administrator addressed the problem with the
>> set he could then reassign the grouping.
>>
>> This is what I would most like comments about.  Should we have a force
>> mode?  Which of these behaviors sound best to you?
>
> It seems to me that refusing by default and reparenting when forced sound righ
> to me.  There migth also be some value in adding the ability to signal all
> processes/threads bound to a cpu set so you can kill them if that's what you
> want to do.

This is where I'm leaning as well.  The refuse/force.  the cpuset_signal() 
would have to walk all processes to determine which processes belong to 
that set however.  There are no back pointers between threads and sets. 
Still, that's not to terrible given that it would be very infrequent.

>
>>> - It would be nice to be able to use CPU sets in jail as well, suggesting
>>> a
>>>  hierarchal model with some sort of tagging so you know what CPU sets were
>>>  created in a jail such that you know whether they can be changed in a
>>> jail.
>>>  While I recognize this makes things a lot more tricky, I think we should
>>>  basically be planning more carefully with respect to virtualization when
>>> we
>>>  add new interfaces, since it's a widely used feature, and the current set
>>> of
>>>  "stragglers" unsupported in Jail is growing rather than shrinking.
>>
>> I have implemented a hierarchical model.  Each thread has a pointer to the
>> cpuset that it's in.  If it makes a local modification via setaffinity() it
>> gets an anonymous cpuset that is a child of the set assigned to the
>> process.  This anonymous set will also be inherited across fork/thread
>> creation.
>>
>> In this model presently there are nodes marked as root.  To query the
>> 'system' cpus available we walk up from the current node until we find a
>> root.  These are the 'system' set.  A thread may not break out of its
>> system set.  A process may join the root set but it may not modify a root
>> that is a parent.  Jails would create a new root.  A process outside of the
>> jail can modify the set of processors in the jail but a process within the
>> jail/root may not.
>>
>> The next level down from the root is the assigned set.  The root may be an
>> assigned set or this may be a subset of the root.  Processes may create
>> sets which are parented back to their root and may include any processors
>> within their root.  The mask of the assigned set is returned as 'available'
>> processors.
>>
>> This gives a 1 to 3 level hierarchy. The root, an assigned set, and an
>> anonymous set.  Any of these but the root may be omitted.  There is no
>> current way for userland to create subsets of assigned sets to permit
>> further nesting.  I'm not sure I see value in it right now and it gives the
>> possibility of unbound tree depth.
>>
>> Anonymous sets are immutable as they are shared and changes only apply to
>> the thread/pid in the WHICH argument and not others which have inherited
>> from it.  Anonymous sets have no id and may not be specifically manipulated
>> via a setid.  You must refer to the process/thread.  From the
>> administration point of view they don't exist.
>>
>> When a set is modified we walk down the children recursively and apply the
>> new mask.  This is done with a global set lock under which all
>> modifications and tree operations are performed.  The td_cpuset pointer is
>> protected under the thread_lock() and may read the set without a lock. This
>> gives the possibility for certain kinds of races but I believe they are all
>> safe.
>>
>> Hopefully I explained that well enough for people to follow.  I realize
>> it's a lot of text but it's fairly simple book keeping code.  This is all
>> implemented and I'm debugging now.
>
> One place I'd like to implement CPU affinity is in the Sun Grid Engine
> execution daemon.  I think anonymous set would not be sufficent there
> because the model allows new tasks to be started on a particular node at
> any time during a parallel job.  I'd have to do some more digging in the
> code to be entierly certain.  I think the less limits we place on the
> hierarchy, the better off we'll be unless there are compeling complexity
> reasons to avoid them.

With the anonymous set you can bind any thread to any cpu that is visible 
to it.  How would this not work?

>
>>> - There's still no way to specify an affinity policy rather than explicit
>>>  affinity, but if our CPU set model is sufficiently general, that might be
>>> a
>>>  vehicle to do that.  I.e., cpuset_setpolicy() rather than setting a mask.
>>
>> Yes, I think this is orthogonal and can be addressed seperately.  I'm not
>> sure how many userland programs are smart enough or even capable of making
>> determinations about their cache behavior however.  We should open another
>> discussion once this one is done.
>>
>>>
>>> - In the interests of boring API changes, recent APIs tend to prefix the
>>>  method on the object name.  Have you thought about cpuset_create(),
>>>  cpuset_foo(), etc?  That reduces the chances of interfering with
>>> application
>>>  namespaces.  I think, anyway. :-).
>>
>> Yes, I prefer that as well, as I mentioned syscalls tended to favor
>> brevity.  I'm fine with changing that trend.
>>
>>>
>>> I need to ponder the proposal a little more, ideally over a hot beverage
>>> this morning, and will follow up if I have further thoughts.  Thanks for
>>> working on this, BTW -- affinity is well-overdue for FreeBSD.
>>
>> A little more to ponder now!  Your feedback is much appreciated.
>>
>> I believe the present hierarchical model satisfies the jail requirements of
>> restricting cpus in the jail while still allowing the jail to create sets.
>>
>> The unanswered questions are:
>>
>> 1)  What to do about sets that strand threads, options described above.
>> 2)  Are people ok with the transient nature of sets?
>> 3)  Does anyone want to help with man pages, administrative tools, etc?  I
>> have a prototype tool called 'cpuset' that fully exercises the api but is
>> probably ugly.  Will post details soon.
>
> I could help with some of this as it furthers a funded project at work.

I will provide patches soon.  It would be great to have a developer with a 
users perspective to look at some of the details and especially the 
administration side of things.  I think someone else has offered to help 
with man pages but I need to double check.

Thanks,
Jeff

>
> -- Brooks
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080222134923.M920>