Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 Feb 2008 17:12:46 -0600
From:      Brooks Davis <brooks@freebsd.org>
To:        Jeff Roberson <jroberson@chesapeake.net>
Cc:        Daniel Eischen <deischen@freebsd.org>, arch@freebsd.org, Robert Watson <rwatson@freebsd.org>, David Xu <davidxu@freebsd.org>, Andrew Gallatin <gallatin@cs.duke.edu>
Subject:   Re: getaffinity/setaffinity and cpu sets.
Message-ID:  <20080222231245.GA28788@lor.one-eyed-alien.net>
In-Reply-To: <20080222121253.N920@desktop>
References:  <20080112194521.I957@desktop> <20080219234101.D920@desktop> <20080220101348.D44565@fledge.watson.org> <20080220005030.Y920@desktop> <20080220105333.G44565@fledge.watson.org> <47BCEFDB.5040207@freebsd.org> <20080220175532.Q920@desktop> <20080220213253.A920@desktop> <20080221092011.J52922@fledge.watson.org> <20080222121253.N920@desktop>

next in thread | previous in thread | raw e-mail | index | archive | help

--a8Wt8u1KmwUX3Y2C
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Feb 22, 2008 at 12:34:13PM -1000, Jeff Roberson wrote:
>=20
> On Thu, 21 Feb 2008, Robert Watson wrote:
>=20
>> On Wed, 20 Feb 2008, Jeff Roberson wrote:
>>=20
>>> I also have a 'cpuset' command which can run a new program with a given=
=20
>>> cpu set, view and modify sets of arbitrary pids.  This is all working a=
nd=20
>>> I can supply patches if anyone is interested.  I have to implement 4BSD=
=20
>>> support before I can commit.
>>> I have a proposal for solaris style processor sets which I think is=20
>>> simple and sufficient for most cases.  It involves the following new=20
>>> syscalls:
>>> int cpuset(void); int setcpuset(pid_t pid, int setid); int=20
>>> getcpuset(pid_t pid);
>>> The notion would be that you can create a new numbered cpuset with=20
>>> cpuset(). You can modify or inspect its affinity with get/setaffinity=
=20
>>> above and the CPU_WHICH_SET argument.  The cpuset exists as long as the=
re=20
>>> are members of the set.  Sort of like a process group or session.  The=
=20
>>> {get,set}cpuset calls can inspect or modify the state.
>>> This set would not be modifiable by user processes or by processes in a=
=20
>>> jail. It would create the restriction that differs between 'avail' and=
=20
>>> 'sys' above. Processors would be able to directly bind to any processor=
=20
>>> within the set. Changing the set would apply to all processes in the se=
t.=20
>>> The cpuset would be per-process while the mask is per-thread.  Sets=20
>>> involvement is inherited on fork().
>>> In solaris sets can be named and have a more complete management api. =
=20
>>> I'm not really interested in implementing all of that but I believe wha=
t=20
>>> I have outlined here would be subset of this and no code/syscalls would=
=20
>>> be wasted.
>>> Comments?  Objections?  I'm fairly pleased with this arrangement now.
>>=20
>> Just to put a few notes from our conversation on IRC in e-mail:
>>=20
>> - I think I'd prefer int cpuset(cpuset_t *set), int getcpuset(pid_t,=20
>> cpuset_t
>>  *) so that we don't mix up ID's and return values.  More recent=20
>> interfaces
>>  tend to do this, I believe, and it means that the prototype, even if no=
t=20
>> the
>>  ABI, remains the same if the set identifier changes in the future.
>=20
> Ok, this is a good suggestion and I did this.  This is actually my=20
> preferred method as well but most syscalls don't follow this pattern and =
I=20
> was trying to make it look syscallish.
>
>> - You don't mention what happens if a process's cpu set changes to=20
>> preclude a
>>  CPU the process has a thread with affinity for.  Online, you suggested
>>  SIGKILL, and I thought maybe a new SIGCPUGONE with a default SIGKILL=20
>> action
>>  might be a friendlier model.  We should see what Solaris and others do=
=20
>> here
>>  though.  I like the idea that the affinity is a guarantee in userspace
>>  because it means that you can rely on it; I'm OK with the idea that your
>>  thread always runs on the CPUs you have affinity for unless in the
>>  SIGCPUGONE handler :-).
>=20
> I could also reject changes to the cpuset if they leave a thread with=20
> nothing to run on.  It might be confusing for the administrator and hard =
to=20
> tell them which thread caused the problem.  However, it might be nicer th=
an=20
> killing a thread as well.
>=20
> Another option would be to expel the offending thread from the set that i=
s=20
> in violation and reparent it to the real system root along with a syslog=
=20
> message or similar.  If the administrator addressed the problem with the=
=20
> set he could then reassign the grouping.
>=20
> This is what I would most like comments about.  Should we have a force=20
> mode?  Which of these behaviors sound best to you?

It seems to me that refusing by default and reparenting when forced sound r=
igh
to me.  There migth also be some value in adding the ability to signal all
processes/threads bound to a cpu set so you can kill them if that's what you
want to do.

>> - It would be nice to be able to use CPU sets in jail as well, suggestin=
g=20
>> a
>>  hierarchal model with some sort of tagging so you know what CPU sets we=
re
>>  created in a jail such that you know whether they can be changed in a=
=20
>> jail.
>>  While I recognize this makes things a lot more tricky, I think we should
>>  basically be planning more carefully with respect to virtualization whe=
n=20
>> we
>>  add new interfaces, since it's a widely used feature, and the current s=
et=20
>> of
>>  "stragglers" unsupported in Jail is growing rather than shrinking.
>=20
> I have implemented a hierarchical model.  Each thread has a pointer to th=
e=20
> cpuset that it's in.  If it makes a local modification via setaffinity() =
it=20
> gets an anonymous cpuset that is a child of the set assigned to the=20
> process.  This anonymous set will also be inherited across fork/thread=20
> creation.
>=20
> In this model presently there are nodes marked as root.  To query the=20
> 'system' cpus available we walk up from the current node until we find a=
=20
> root.  These are the 'system' set.  A thread may not break out of its=20
> system set.  A process may join the root set but it may not modify a root=
=20
> that is a parent.  Jails would create a new root.  A process outside of t=
he=20
> jail can modify the set of processors in the jail but a process within th=
e=20
> jail/root may not.
>=20
> The next level down from the root is the assigned set.  The root may be a=
n=20
> assigned set or this may be a subset of the root.  Processes may create=
=20
> sets which are parented back to their root and may include any processors=
=20
> within their root.  The mask of the assigned set is returned as 'availabl=
e'=20
> processors.
>=20
> This gives a 1 to 3 level hierarchy. The root, an assigned set, and an=20
> anonymous set.  Any of these but the root may be omitted.  There is no=20
> current way for userland to create subsets of assigned sets to permit=20
> further nesting.  I'm not sure I see value in it right now and it gives t=
he=20
> possibility of unbound tree depth.
>=20
> Anonymous sets are immutable as they are shared and changes only apply to=
=20
> the thread/pid in the WHICH argument and not others which have inherited=
=20
> from it.  Anonymous sets have no id and may not be specifically manipulat=
ed=20
> via a setid.  You must refer to the process/thread.  From the=20
> administration point of view they don't exist.
>=20
> When a set is modified we walk down the children recursively and apply th=
e=20
> new mask.  This is done with a global set lock under which all=20
> modifications and tree operations are performed.  The td_cpuset pointer i=
s=20
> protected under the thread_lock() and may read the set without a lock. Th=
is=20
> gives the possibility for certain kinds of races but I believe they are a=
ll=20
> safe.
>=20
> Hopefully I explained that well enough for people to follow.  I realize=
=20
> it's a lot of text but it's fairly simple book keeping code.  This is all=
=20
> implemented and I'm debugging now.

One place I'd like to implement CPU affinity is in the Sun Grid Engine
execution daemon.  I think anonymous set would not be sufficent there
because the model allows new tasks to be started on a particular node at
any time during a parallel job.  I'd have to do some more digging in the
code to be entierly certain.  I think the less limits we place on the
hierarchy, the better off we'll be unless there are compeling complexity
reasons to avoid them.

>> - There's still no way to specify an affinity policy rather than explicit
>>  affinity, but if our CPU set model is sufficiently general, that might =
be=20
>> a
>>  vehicle to do that.  I.e., cpuset_setpolicy() rather than setting a mas=
k.
>=20
> Yes, I think this is orthogonal and can be addressed seperately.  I'm not=
=20
> sure how many userland programs are smart enough or even capable of makin=
g=20
> determinations about their cache behavior however.  We should open anothe=
r=20
> discussion once this one is done.
>=20
>>=20
>> - In the interests of boring API changes, recent APIs tend to prefix the
>>  method on the object name.  Have you thought about cpuset_create(),
>>  cpuset_foo(), etc?  That reduces the chances of interfering with=20
>> application
>>  namespaces.  I think, anyway. :-).
>=20
> Yes, I prefer that as well, as I mentioned syscalls tended to favor=20
> brevity.  I'm fine with changing that trend.
>=20
>>=20
>> I need to ponder the proposal a little more, ideally over a hot beverage=
=20
>> this morning, and will follow up if I have further thoughts.  Thanks for=
=20
>> working on this, BTW -- affinity is well-overdue for FreeBSD.
>=20
> A little more to ponder now!  Your feedback is much appreciated.
>=20
> I believe the present hierarchical model satisfies the jail requirements =
of=20
> restricting cpus in the jail while still allowing the jail to create sets.
>=20
> The unanswered questions are:
>=20
> 1)  What to do about sets that strand threads, options described above.
> 2)  Are people ok with the transient nature of sets?
> 3)  Does anyone want to help with man pages, administrative tools, etc?  =
I=20
> have a prototype tool called 'cpuset' that fully exercises the api but is=
=20
> probably ugly.  Will post details soon.

I could help with some of this as it furthers a funded project at work.

-- Brooks

--a8Wt8u1KmwUX3Y2C
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (FreeBSD)

iD8DBQFHv1btXY6L6fI4GtQRAnnGAJ9z3R/j+8/TrqOni6YsWrPyPFWA9gCgxfNK
7Dm2dW5L4wJDeLucFO3x2ME=
=MJzF
-----END PGP SIGNATURE-----

--a8Wt8u1KmwUX3Y2C--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080222231245.GA28788>