Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 23 Feb 2008 11:21:33 -1000 (HST)
From:      Jeff Roberson <jroberson@chesapeake.net>
To:        Brooks Davis <brooks@freebsd.org>
Cc:        Daniel Eischen <deischen@freebsd.org>, arch@freebsd.org, Robert Watson <rwatson@freebsd.org>, David Xu <davidxu@freebsd.org>, Andrew Gallatin <gallatin@cs.duke.edu>
Subject:   Re: getaffinity/setaffinity and cpu sets.
Message-ID:  <20080223111659.K920@desktop>
In-Reply-To: <20080223194047.GB38485@lor.one-eyed-alien.net>
References:  <20080220101348.D44565@fledge.watson.org> <20080220005030.Y920@desktop> <20080220105333.G44565@fledge.watson.org> <47BCEFDB.5040207@freebsd.org> <20080220175532.Q920@desktop> <20080220213253.A920@desktop> <20080221092011.J52922@fledge.watson.org> <20080222121253.N920@desktop> <20080222231245.GA28788@lor.one-eyed-alien.net> <20080222134923.M920@desktop> <20080223194047.GB38485@lor.one-eyed-alien.net>

next in thread | previous in thread | raw e-mail | index | archive | help

On Sat, 23 Feb 2008, Brooks Davis wrote:

> On Fri, Feb 22, 2008 at 01:52:54PM -1000, Jeff Roberson wrote:
>> On Fri, 22 Feb 2008, Brooks Davis wrote:
>>
>>> On Fri, Feb 22, 2008 at 12:34:13PM -1000, Jeff Roberson wrote:
>>>>
>>>> On Thu, 21 Feb 2008, Robert Watson wrote:
>>>>
>>>>> On Wed, 20 Feb 2008, Jeff Roberson wrote:
>
>>>>> - It would be nice to be able to use CPU sets in jail as well,
>>>>> suggesting
>>>>> a
>>>>>  hierarchal model with some sort of tagging so you know what CPU sets
>>>>> were
>>>>>  created in a jail such that you know whether they can be changed in a
>>>>> jail.
>>>>>  While I recognize this makes things a lot more tricky, I think we
>>>>> should
>>>>>  basically be planning more carefully with respect to virtualization
>>>>> when
>>>>> we
>>>>>  add new interfaces, since it's a widely used feature, and the current
>>>>> set
>>>>> of
>>>>>  "stragglers" unsupported in Jail is growing rather than shrinking.
>>>>
>>>> I have implemented a hierarchical model.  Each thread has a pointer to
>>>> the
>>>> cpuset that it's in.  If it makes a local modification via setaffinity()
>>>> it
>>>> gets an anonymous cpuset that is a child of the set assigned to the
>>>> process.  This anonymous set will also be inherited across fork/thread
>>>> creation.
>>>>
>>>> In this model presently there are nodes marked as root.  To query the
>>>> 'system' cpus available we walk up from the current node until we find a
>>>> root.  These are the 'system' set.  A thread may not break out of its
>>>> system set.  A process may join the root set but it may not modify a root
>>>> that is a parent.  Jails would create a new root.  A process outside of
>>>> the
>>>> jail can modify the set of processors in the jail but a process within
>>>> the
>>>> jail/root may not.
>>>>
>>>> The next level down from the root is the assigned set.  The root may be
>>>> an
>>>> assigned set or this may be a subset of the root.  Processes may create
>>>> sets which are parented back to their root and may include any processors
>>>> within their root.  The mask of the assigned set is returned as
>>>> 'available'
>>>> processors.
>>>>
>>>> This gives a 1 to 3 level hierarchy. The root, an assigned set, and an
>>>> anonymous set.  Any of these but the root may be omitted.  There is no
>>>> current way for userland to create subsets of assigned sets to permit
>>>> further nesting.  I'm not sure I see value in it right now and it gives
>>>> the
>>>> possibility of unbound tree depth.
>>>>
>>>> Anonymous sets are immutable as they are shared and changes only apply to
>>>> the thread/pid in the WHICH argument and not others which have inherited
>>>> from it.  Anonymous sets have no id and may not be specifically
>>>> manipulated
>>>> via a setid.  You must refer to the process/thread.  From the
>>>> administration point of view they don't exist.
>>>>
>>>> When a set is modified we walk down the children recursively and apply
>>>> the
>>>> new mask.  This is done with a global set lock under which all
>>>> modifications and tree operations are performed.  The td_cpuset pointer
>>>> is
>>>> protected under the thread_lock() and may read the set without a lock.
>>>> This
>>>> gives the possibility for certain kinds of races but I believe they are
>>>> all
>>>> safe.
>>>>
>>>> Hopefully I explained that well enough for people to follow.  I realize
>>>> it's a lot of text but it's fairly simple book keeping code.  This is all
>>>> implemented and I'm debugging now.
>>>
>>> One place I'd like to implement CPU affinity is in the Sun Grid Engine
>>> execution daemon.  I think anonymous set would not be sufficent there
>>> because the model allows new tasks to be started on a particular node at
>>> any time during a parallel job.  I'd have to do some more digging in the
>>> code to be entierly certain.  I think the less limits we place on the
>>> hierarchy, the better off we'll be unless there are compeling complexity
>>> reasons to avoid them.
>>
>> With the anonymous set you can bind any thread to any cpu that is visible
>> to it.  How would this not work?
>
> I'm still trying to wrap my head around the anonymous sets.  Is the idea
> that once you are in an anonymous set, you can't expand it, or can you
> expand out as far as the assigned set?  I'd like for parallel jobs to
> be allocated a set of cpus that they can't change, but still be able
> to make their own decisions about thread affinity if they desire (for
> example OpenMPI has some support for this so processes stay put and in
> theory benefit from positive cache effects).  If that's feasible in
> this model, I'm happy ok it.  I think we should keep in mind that these
> SGE execution daemons might be sitting inside jails. ;-)

Ah, when I said the anonymous sets were immutable, that only means that 
they are copy-on-write.  Because you can't know who shares a copy via fork 
or thread creation you must make a new set each time you write.

I made the anonymous sets so that the parent would have a list of all 
derivative children sets so that modifications to the parent would be 
reflected in the child.  This also means that the scheduler only has to 
look at one bitmap to determine the available cpus for a thread.

>
>>>>> - There's still no way to specify an affinity policy rather than
>>>>> explicit
>>>>>  affinity, but if our CPU set model is sufficiently general, that might
>>>>> be
>>>>> a
>>>>>  vehicle to do that.  I.e., cpuset_setpolicy() rather than setting a
>>>>> mask.
>>>>
>>>> Yes, I think this is orthogonal and can be addressed seperately.  I'm not
>>>> sure how many userland programs are smart enough or even capable of
>>>> making
>>>> determinations about their cache behavior however.  We should open
>>>> another
>>>> discussion once this one is done.
>>>>
>>>>>
>>>>> - In the interests of boring API changes, recent APIs tend to prefix the
>>>>>  method on the object name.  Have you thought about cpuset_create(),
>>>>>  cpuset_foo(), etc?  That reduces the chances of interfering with
>>>>> application
>>>>>  namespaces.  I think, anyway. :-).
>>>>
>>>> Yes, I prefer that as well, as I mentioned syscalls tended to favor
>>>> brevity.  I'm fine with changing that trend.
>>>>
>>>>>
>>>>> I need to ponder the proposal a little more, ideally over a hot beverage
>>>>> this morning, and will follow up if I have further thoughts.  Thanks for
>>>>> working on this, BTW -- affinity is well-overdue for FreeBSD.
>>>>
>>>> A little more to ponder now!  Your feedback is much appreciated.
>>>>
>>>> I believe the present hierarchical model satisfies the jail requirements
>>>> of
>>>> restricting cpus in the jail while still allowing the jail to create
>>>> sets.
>>>>
>>>> The unanswered questions are:
>>>>
>>>> 1)  What to do about sets that strand threads, options described above.
>>>> 2)  Are people ok with the transient nature of sets?
>>>> 3)  Does anyone want to help with man pages, administrative tools, etc?
>>>> I
>>>> have a prototype tool called 'cpuset' that fully exercises the api but is
>>>> probably ugly.  Will post details soon.
>>>
>>> I could help with some of this as it furthers a funded project at work.
>>
>> I will provide patches soon.  It would be great to have a developer with a
>> users perspective to look at some of the details and especially the
>> administration side of things.  I think someone else has offered to help
>> with man pages but I need to double check.
>
> Cool.  If you can get some basics out by late Sunday afternoon (CST) I
> should be able to look at it and think about it on the plane Monday.

I can definitely do that.  I'm just debugging now.

>
> -- Brooks
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080223111659.K920>