From owner-freebsd-arch@FreeBSD.ORG Sat Feb 23 21:35:19 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9A56916A404; Sat, 23 Feb 2008 21:35:19 +0000 (UTC) (envelope-from brooks@lor.one-eyed-alien.net) Received: from lor.one-eyed-alien.net (cl-162.ewr-01.us.sixxs.net [IPv6:2001:4830:1200:a1::2]) by mx1.freebsd.org (Postfix) with ESMTP id 1A32113C459; Sat, 23 Feb 2008 21:35:18 +0000 (UTC) (envelope-from brooks@lor.one-eyed-alien.net) Received: from lor.one-eyed-alien.net (localhost [127.0.0.1]) by lor.one-eyed-alien.net (8.14.1/8.13.8) with ESMTP id m1NLZ7AB040388; Sat, 23 Feb 2008 15:35:07 -0600 (CST) (envelope-from brooks@lor.one-eyed-alien.net) Received: (from brooks@localhost) by lor.one-eyed-alien.net (8.14.1/8.13.8/Submit) id m1NLZ73l040387; Sat, 23 Feb 2008 15:35:07 -0600 (CST) (envelope-from brooks) Date: Sat, 23 Feb 2008 15:35:07 -0600 From: Brooks Davis To: Jeff Roberson Message-ID: <20080223213507.GD39699@lor.one-eyed-alien.net> References: <20080220105333.G44565@fledge.watson.org> <47BCEFDB.5040207@freebsd.org> <20080220175532.Q920@desktop> <20080220213253.A920@desktop> <20080221092011.J52922@fledge.watson.org> <20080222121253.N920@desktop> <20080222231245.GA28788@lor.one-eyed-alien.net> <20080222134923.M920@desktop> <20080223194047.GB38485@lor.one-eyed-alien.net> <20080223111659.K920@desktop> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="1sNVjLsmu1MXqwQ/" Content-Disposition: inline In-Reply-To: <20080223111659.K920@desktop> User-Agent: Mutt/1.5.16 (2007-06-09) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-3.0 (lor.one-eyed-alien.net [127.0.0.1]); Sat, 23 Feb 2008 15:35:07 -0600 (CST) Cc: Daniel Eischen , arch@freebsd.org, Robert Watson , David Xu , Andrew Gallatin Subject: Re: getaffinity/setaffinity and cpu sets. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Feb 2008 21:35:19 -0000 --1sNVjLsmu1MXqwQ/ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Feb 23, 2008 at 11:21:33AM -1000, Jeff Roberson wrote: >=20 > On Sat, 23 Feb 2008, Brooks Davis wrote: >=20 >> On Fri, Feb 22, 2008 at 01:52:54PM -1000, Jeff Roberson wrote: >>> On Fri, 22 Feb 2008, Brooks Davis wrote: >>>=20 >>>> On Fri, Feb 22, 2008 at 12:34:13PM -1000, Jeff Roberson wrote: >>>>>=20 >>>>> On Thu, 21 Feb 2008, Robert Watson wrote: >>>>>=20 >>>>>> On Wed, 20 Feb 2008, Jeff Roberson wrote: >>=20 >>>>>> - It would be nice to be able to use CPU sets in jail as well, >>>>>> suggesting >>>>>> a >>>>>> hierarchal model with some sort of tagging so you know what CPU sets >>>>>> were >>>>>> created in a jail such that you know whether they can be changed in= a >>>>>> jail. >>>>>> While I recognize this makes things a lot more tricky, I think we >>>>>> should >>>>>> basically be planning more carefully with respect to virtualization >>>>>> when >>>>>> we >>>>>> add new interfaces, since it's a widely used feature, and the curre= nt >>>>>> set >>>>>> of >>>>>> "stragglers" unsupported in Jail is growing rather than shrinking. >>>>>=20 >>>>> I have implemented a hierarchical model. Each thread has a pointer to >>>>> the >>>>> cpuset that it's in. If it makes a local modification via=20 >>>>> setaffinity() >>>>> it >>>>> gets an anonymous cpuset that is a child of the set assigned to the >>>>> process. This anonymous set will also be inherited across fork/thread >>>>> creation. >>>>>=20 >>>>> In this model presently there are nodes marked as root. To query the >>>>> 'system' cpus available we walk up from the current node until we fin= d=20 >>>>> a >>>>> root. These are the 'system' set. A thread may not break out of its >>>>> system set. A process may join the root set but it may not modify a= =20 >>>>> root >>>>> that is a parent. Jails would create a new root. A process outside = of >>>>> the >>>>> jail can modify the set of processors in the jail but a process within >>>>> the >>>>> jail/root may not. >>>>>=20 >>>>> The next level down from the root is the assigned set. The root may = be >>>>> an >>>>> assigned set or this may be a subset of the root. Processes may crea= te >>>>> sets which are parented back to their root and may include any=20 >>>>> processors >>>>> within their root. The mask of the assigned set is returned as >>>>> 'available' >>>>> processors. >>>>>=20 >>>>> This gives a 1 to 3 level hierarchy. The root, an assigned set, and an >>>>> anonymous set. Any of these but the root may be omitted. There is no >>>>> current way for userland to create subsets of assigned sets to permit >>>>> further nesting. I'm not sure I see value in it right now and it giv= es >>>>> the >>>>> possibility of unbound tree depth. >>>>>=20 >>>>> Anonymous sets are immutable as they are shared and changes only appl= y=20 >>>>> to >>>>> the thread/pid in the WHICH argument and not others which have=20 >>>>> inherited >>>>> from it. Anonymous sets have no id and may not be specifically >>>>> manipulated >>>>> via a setid. You must refer to the process/thread. From the >>>>> administration point of view they don't exist. >>>>>=20 >>>>> When a set is modified we walk down the children recursively and apply >>>>> the >>>>> new mask. This is done with a global set lock under which all >>>>> modifications and tree operations are performed. The td_cpuset point= er >>>>> is >>>>> protected under the thread_lock() and may read the set without a lock. >>>>> This >>>>> gives the possibility for certain kinds of races but I believe they a= re >>>>> all >>>>> safe. >>>>>=20 >>>>> Hopefully I explained that well enough for people to follow. I reali= ze >>>>> it's a lot of text but it's fairly simple book keeping code. This is= =20 >>>>> all >>>>> implemented and I'm debugging now. >>>>=20 >>>> One place I'd like to implement CPU affinity is in the Sun Grid Engine >>>> execution daemon. I think anonymous set would not be sufficent there >>>> because the model allows new tasks to be started on a particular node = at >>>> any time during a parallel job. I'd have to do some more digging in t= he >>>> code to be entierly certain. I think the less limits we place on the >>>> hierarchy, the better off we'll be unless there are compeling complexi= ty >>>> reasons to avoid them. >>>=20 >>> With the anonymous set you can bind any thread to any cpu that is visib= le >>> to it. How would this not work? >>=20 >> I'm still trying to wrap my head around the anonymous sets. Is the idea >> that once you are in an anonymous set, you can't expand it, or can you >> expand out as far as the assigned set? I'd like for parallel jobs to >> be allocated a set of cpus that they can't change, but still be able >> to make their own decisions about thread affinity if they desire (for >> example OpenMPI has some support for this so processes stay put and in >> theory benefit from positive cache effects). If that's feasible in >> this model, I'm happy ok it. I think we should keep in mind that these >> SGE execution daemons might be sitting inside jails. ;-) >=20 > Ah, when I said the anonymous sets were immutable, that only means that= =20 > they are copy-on-write. Because you can't know who shares a copy via for= k=20 > or thread creation you must make a new set each time you write. >=20 > I made the anonymous sets so that the parent would have a list of all=20 > derivative children sets so that modifications to the parent would be=20 > reflected in the child. This also means that the scheduler only has to= =20 > look at one bitmap to determine the available cpus for a thread. I think the anonymous sets seem like a good idea. On solution to my problem might be to make changing your current set to be something that is not a subset of your parent (or maybe your current set?) is privileged. -- Brooks --1sNVjLsmu1MXqwQ/ Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (FreeBSD) iD8DBQFHwJGKXY6L6fI4GtQRAl3iAKDXYMD6U6rx87OVqGsDfQgQk/GVfACfXlra EDNQLEYWfYoI6H5v7YsDBWM= =YC+R -----END PGP SIGNATURE----- --1sNVjLsmu1MXqwQ/--