From owner-freebsd-arch@FreeBSD.ORG Fri Feb 22 23:13:07 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9046E16A407; Fri, 22 Feb 2008 23:13:07 +0000 (UTC) (envelope-from brooks@lor.one-eyed-alien.net) Received: from lor.one-eyed-alien.net (cl-162.ewr-01.us.sixxs.net [IPv6:2001:4830:1200:a1::2]) by mx1.freebsd.org (Postfix) with ESMTP id EB87D13C468; Fri, 22 Feb 2008 23:13:06 +0000 (UTC) (envelope-from brooks@lor.one-eyed-alien.net) Received: from lor.one-eyed-alien.net (localhost [127.0.0.1]) by lor.one-eyed-alien.net (8.14.1/8.13.8) with ESMTP id m1MNCkOI029137; Fri, 22 Feb 2008 17:12:46 -0600 (CST) (envelope-from brooks@lor.one-eyed-alien.net) Received: (from brooks@localhost) by lor.one-eyed-alien.net (8.14.1/8.13.8/Submit) id m1MNCkBh029136; Fri, 22 Feb 2008 17:12:46 -0600 (CST) (envelope-from brooks) Date: Fri, 22 Feb 2008 17:12:46 -0600 From: Brooks Davis To: Jeff Roberson Message-ID: <20080222231245.GA28788@lor.one-eyed-alien.net> References: <20080112194521.I957@desktop> <20080219234101.D920@desktop> <20080220101348.D44565@fledge.watson.org> <20080220005030.Y920@desktop> <20080220105333.G44565@fledge.watson.org> <47BCEFDB.5040207@freebsd.org> <20080220175532.Q920@desktop> <20080220213253.A920@desktop> <20080221092011.J52922@fledge.watson.org> <20080222121253.N920@desktop> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="a8Wt8u1KmwUX3Y2C" Content-Disposition: inline In-Reply-To: <20080222121253.N920@desktop> User-Agent: Mutt/1.5.16 (2007-06-09) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-3.0 (lor.one-eyed-alien.net [127.0.0.1]); Fri, 22 Feb 2008 17:12:46 -0600 (CST) Cc: Daniel Eischen , arch@freebsd.org, Robert Watson , David Xu , Andrew Gallatin Subject: Re: getaffinity/setaffinity and cpu sets. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Feb 2008 23:13:07 -0000 --a8Wt8u1KmwUX3Y2C Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Feb 22, 2008 at 12:34:13PM -1000, Jeff Roberson wrote: >=20 > On Thu, 21 Feb 2008, Robert Watson wrote: >=20 >> On Wed, 20 Feb 2008, Jeff Roberson wrote: >>=20 >>> I also have a 'cpuset' command which can run a new program with a given= =20 >>> cpu set, view and modify sets of arbitrary pids. This is all working a= nd=20 >>> I can supply patches if anyone is interested. I have to implement 4BSD= =20 >>> support before I can commit. >>> I have a proposal for solaris style processor sets which I think is=20 >>> simple and sufficient for most cases. It involves the following new=20 >>> syscalls: >>> int cpuset(void); int setcpuset(pid_t pid, int setid); int=20 >>> getcpuset(pid_t pid); >>> The notion would be that you can create a new numbered cpuset with=20 >>> cpuset(). You can modify or inspect its affinity with get/setaffinity= =20 >>> above and the CPU_WHICH_SET argument. The cpuset exists as long as the= re=20 >>> are members of the set. Sort of like a process group or session. The= =20 >>> {get,set}cpuset calls can inspect or modify the state. >>> This set would not be modifiable by user processes or by processes in a= =20 >>> jail. It would create the restriction that differs between 'avail' and= =20 >>> 'sys' above. Processors would be able to directly bind to any processor= =20 >>> within the set. Changing the set would apply to all processes in the se= t.=20 >>> The cpuset would be per-process while the mask is per-thread. Sets=20 >>> involvement is inherited on fork(). >>> In solaris sets can be named and have a more complete management api. = =20 >>> I'm not really interested in implementing all of that but I believe wha= t=20 >>> I have outlined here would be subset of this and no code/syscalls would= =20 >>> be wasted. >>> Comments? Objections? I'm fairly pleased with this arrangement now. >>=20 >> Just to put a few notes from our conversation on IRC in e-mail: >>=20 >> - I think I'd prefer int cpuset(cpuset_t *set), int getcpuset(pid_t,=20 >> cpuset_t >> *) so that we don't mix up ID's and return values. More recent=20 >> interfaces >> tend to do this, I believe, and it means that the prototype, even if no= t=20 >> the >> ABI, remains the same if the set identifier changes in the future. >=20 > Ok, this is a good suggestion and I did this. This is actually my=20 > preferred method as well but most syscalls don't follow this pattern and = I=20 > was trying to make it look syscallish. > >> - You don't mention what happens if a process's cpu set changes to=20 >> preclude a >> CPU the process has a thread with affinity for. Online, you suggested >> SIGKILL, and I thought maybe a new SIGCPUGONE with a default SIGKILL=20 >> action >> might be a friendlier model. We should see what Solaris and others do= =20 >> here >> though. I like the idea that the affinity is a guarantee in userspace >> because it means that you can rely on it; I'm OK with the idea that your >> thread always runs on the CPUs you have affinity for unless in the >> SIGCPUGONE handler :-). >=20 > I could also reject changes to the cpuset if they leave a thread with=20 > nothing to run on. It might be confusing for the administrator and hard = to=20 > tell them which thread caused the problem. However, it might be nicer th= an=20 > killing a thread as well. >=20 > Another option would be to expel the offending thread from the set that i= s=20 > in violation and reparent it to the real system root along with a syslog= =20 > message or similar. If the administrator addressed the problem with the= =20 > set he could then reassign the grouping. >=20 > This is what I would most like comments about. Should we have a force=20 > mode? Which of these behaviors sound best to you? It seems to me that refusing by default and reparenting when forced sound r= igh to me. There migth also be some value in adding the ability to signal all processes/threads bound to a cpu set so you can kill them if that's what you want to do. >> - It would be nice to be able to use CPU sets in jail as well, suggestin= g=20 >> a >> hierarchal model with some sort of tagging so you know what CPU sets we= re >> created in a jail such that you know whether they can be changed in a= =20 >> jail. >> While I recognize this makes things a lot more tricky, I think we should >> basically be planning more carefully with respect to virtualization whe= n=20 >> we >> add new interfaces, since it's a widely used feature, and the current s= et=20 >> of >> "stragglers" unsupported in Jail is growing rather than shrinking. >=20 > I have implemented a hierarchical model. Each thread has a pointer to th= e=20 > cpuset that it's in. If it makes a local modification via setaffinity() = it=20 > gets an anonymous cpuset that is a child of the set assigned to the=20 > process. This anonymous set will also be inherited across fork/thread=20 > creation. >=20 > In this model presently there are nodes marked as root. To query the=20 > 'system' cpus available we walk up from the current node until we find a= =20 > root. These are the 'system' set. A thread may not break out of its=20 > system set. A process may join the root set but it may not modify a root= =20 > that is a parent. Jails would create a new root. A process outside of t= he=20 > jail can modify the set of processors in the jail but a process within th= e=20 > jail/root may not. >=20 > The next level down from the root is the assigned set. The root may be a= n=20 > assigned set or this may be a subset of the root. Processes may create= =20 > sets which are parented back to their root and may include any processors= =20 > within their root. The mask of the assigned set is returned as 'availabl= e'=20 > processors. >=20 > This gives a 1 to 3 level hierarchy. The root, an assigned set, and an=20 > anonymous set. Any of these but the root may be omitted. There is no=20 > current way for userland to create subsets of assigned sets to permit=20 > further nesting. I'm not sure I see value in it right now and it gives t= he=20 > possibility of unbound tree depth. >=20 > Anonymous sets are immutable as they are shared and changes only apply to= =20 > the thread/pid in the WHICH argument and not others which have inherited= =20 > from it. Anonymous sets have no id and may not be specifically manipulat= ed=20 > via a setid. You must refer to the process/thread. From the=20 > administration point of view they don't exist. >=20 > When a set is modified we walk down the children recursively and apply th= e=20 > new mask. This is done with a global set lock under which all=20 > modifications and tree operations are performed. The td_cpuset pointer i= s=20 > protected under the thread_lock() and may read the set without a lock. Th= is=20 > gives the possibility for certain kinds of races but I believe they are a= ll=20 > safe. >=20 > Hopefully I explained that well enough for people to follow. I realize= =20 > it's a lot of text but it's fairly simple book keeping code. This is all= =20 > implemented and I'm debugging now. One place I'd like to implement CPU affinity is in the Sun Grid Engine execution daemon. I think anonymous set would not be sufficent there because the model allows new tasks to be started on a particular node at any time during a parallel job. I'd have to do some more digging in the code to be entierly certain. I think the less limits we place on the hierarchy, the better off we'll be unless there are compeling complexity reasons to avoid them. >> - There's still no way to specify an affinity policy rather than explicit >> affinity, but if our CPU set model is sufficiently general, that might = be=20 >> a >> vehicle to do that. I.e., cpuset_setpolicy() rather than setting a mas= k. >=20 > Yes, I think this is orthogonal and can be addressed seperately. I'm not= =20 > sure how many userland programs are smart enough or even capable of makin= g=20 > determinations about their cache behavior however. We should open anothe= r=20 > discussion once this one is done. >=20 >>=20 >> - In the interests of boring API changes, recent APIs tend to prefix the >> method on the object name. Have you thought about cpuset_create(), >> cpuset_foo(), etc? That reduces the chances of interfering with=20 >> application >> namespaces. I think, anyway. :-). >=20 > Yes, I prefer that as well, as I mentioned syscalls tended to favor=20 > brevity. I'm fine with changing that trend. >=20 >>=20 >> I need to ponder the proposal a little more, ideally over a hot beverage= =20 >> this morning, and will follow up if I have further thoughts. Thanks for= =20 >> working on this, BTW -- affinity is well-overdue for FreeBSD. >=20 > A little more to ponder now! Your feedback is much appreciated. >=20 > I believe the present hierarchical model satisfies the jail requirements = of=20 > restricting cpus in the jail while still allowing the jail to create sets. >=20 > The unanswered questions are: >=20 > 1) What to do about sets that strand threads, options described above. > 2) Are people ok with the transient nature of sets? > 3) Does anyone want to help with man pages, administrative tools, etc? = I=20 > have a prototype tool called 'cpuset' that fully exercises the api but is= =20 > probably ugly. Will post details soon. I could help with some of this as it furthers a funded project at work. -- Brooks --a8Wt8u1KmwUX3Y2C Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (FreeBSD) iD8DBQFHv1btXY6L6fI4GtQRAnnGAJ9z3R/j+8/TrqOni6YsWrPyPFWA9gCgxfNK 7Dm2dW5L4wJDeLucFO3x2ME= =MJzF -----END PGP SIGNATURE----- --a8Wt8u1KmwUX3Y2C--