Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 2 Jul 2001 09:36:38 -0500
From:      "Michael C . Wu" <keichii@iteration.net>
To:        Alfred Perlstein <bright@sneakerz.org>
Cc:        smp@freebsd.org
Subject:   Re: per cpu runqueues, cpu affinity and cpu binding.
Message-ID:  <20010702093638.B96996@peorth.iteration.net>
In-Reply-To: <20010702003213.I84523@sneakerz.org>; from bright@sneakerz.org on Mon, Jul 02, 2001 at 12:32:13AM -0500
References:  <20010702003213.I84523@sneakerz.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi Alfred,

First of all, we have two different types of processor affinity.
1. user specified CPU attachment, as you have implemented.
2. system-wide transparent processor affinity, transparent
   to all users, which I see some work below.

In SMPng, IMHO, if we can do (2) well, a lot of the problems
in performance can be solved. 

Another problem is the widely varied application that we have.
For example, on a system with many many PCI devices, (2)'s implementation
will be very different from a system that is intended to run
an Oracle database or a HTTP server.

I don't think doing per-thread affinity is a good idea.  Because
we want to keep threads lightweight.

You may want to take a look at this url about processor affinity: :)
http://www.isi.edu/lsam/tools/autosearch/load_balancing/19970804.html

On Mon, Jul 02, 2001 at 12:32:13AM -0500, Alfred Perlstein scribbled:
| ) The cpu affinity seems to actually buy performance, I've seen
| seconds taken off user/sys time when doing kernel compiles with
| this.  Of course if people were to provide thier own micro-benchmarks
| it would assist in determining the utility of this work.
| 
| ) The binding is not very flexible.  You can only bind to one cpu,
| not a group of cpus, nor can you prohibit a process from running
| on any particular cpu.  Suggestions would be appreciated.

How many CPU's do we want to scale to? or how many can we?
Binding to more than 3 or 4 CPU's defeats all purpose of affinity
unless we have a mega >32 CPU machine.  We want to hit the L2 and L3
cache, and hopefully the L1.  On a AMD, PPC, Alpha, the L1
is sufficiently big to possibly retain some cache on the previous proc.
On IA-64/Pentium III/Pentium IV, the L1 is so small, that worrying
about the L1 makes no sense.  Hence I suggest worrying about
the affinity at the L2/L3 level, keeping the L1 slightly in mind.

Doug Rabson and I had several lengthy conversations regarding this.
Perhaps he and the others can make some input too.

| ) It somewhat butchers the nice functional interface that Jake did
| because it accesses a global, namely the per-cpu queues are a
| global.  I plan on fixing this.
| 
| ) Input on how affinity/binding could be improved (along with code
| examples) would be appreciated.  Please don't say "I would do it
| this way" unless your mail happens to contain an algorithm that
| clearly maps to some code. :)

Lots are available.  Please see the URL above. When I get
back to Austin and get settled, I will search for one that has worked
well, since I have researched the topic in April for another reason.

| The current way it is implemented is that for unbound processes
| there is a double linkage, basically an unbound process will be on
| both the cpu it last ran on and the global queue.  A certain weight
| is assigned to tip the scales in favor of running a process that's
| last ran on a particular cpu, basically 4 * RQ_PPQ (see the mod to

Is there a special reason for choosing 4 * RQ_PPQ?

| runq_choose()), this could be adjusted in order to give either
| higher priority processes a boost, or a process that last ran on
| the cpu pulling it off the runqueue a boost.
| 
| Bound processes only exist on the per-cpu queue that they are bound
| to.
| 
| What I'd actually prefer is no global queue, when schedcpu() is
| called it would balance out the processes amongst the per-cpu
| queues, or if a particular cpu realized it was stuck with a lot of
| high or low priority processes while another cpu is occupied with
| the opposite it would attempt to migrate or steal depending on the
| type of imbalance going on.  Suggestions on how to do this would
| also be appreciated. :)

An actual empirical measurement is required in this case.
When can we justify the cache performance loss to switch to another
CPU?  In addition, once this process is switched to another CPU,
we want to keep it there.

| The attached bindcpu.c program will need sys/pioctl.h installed to
| compile, once compiled and the kernel is rebuilt (don't forget
| modules as the size of proc has changed) you can use it to bind
| processes like so:
| 
| ./bindcpu <curproc|pid> 1  # bind curproc/pid to cpu 1
| ./bindcpu <curproc|pid> -1 # unbind

This interface may not be the best to do. We can figure this out later.

| Index: fs/procfs/procfs_vnops.c
| ===================================================================
| RCS file: /home/ncvs/src/sys/fs/procfs/procfs_vnops.c,v
| retrieving revision 1.98
| diff -u -r1.98 procfs_vnops.c
| --- fs/procfs/procfs_vnops.c	2001/05/25 16:59:04	1.98
| +++ fs/procfs/procfs_vnops.c	2001/07/01 16:48:51

| +
| +	if ((p->p_sflag & PS_BOUND) == 0) {
| +		cpu = p->p_lastcpu;
| +		if (cpu < 0 || cpu >= mp_ncpus)
| +			cpu = PCPU_GET(cpuid);
| +		p->p_rqcpu = cpu;
| +		runq_setbit(rq, pri);
| +		rqh = &rq->rq_queues[pri];
| +		CTR4(KTR_RUNQ, "runq_add: p=%p pri=%d %d rqh=%p",
| +		    p, p->p_pri.pri_level, pri, rqh);
| +		TAILQ_INSERT_TAIL(rqh, p, p_procq);
| +	} else {
| +		CTR2(KTR_RUNQ, "runq_add: proc %p bound to cpu %d",
| +		    p, (int)p->p_rqcpu);
| +		cpu = p->p_rqcpu;
| +	}

I recall a better algorithm in the almighty TAOCP. Will look
it up when I get back.

| +	cpu = PCPU_GET(cpuid);
| +	pricpu = runq_findbit(&runqcpu[cpu]);
| +	pri = runq_findbit(rq);
| +	CTR2(KTR_RUNQ, "runq_choose: pri=%d cpupri=%d", pri, pricpu);
| +	if (pricpu != -1 && (pricpu <= pri + 4 * RQ_PPQ || pri == -1)) {
| +		pri = pricpu;
| +		rqh = &runqcpu[cpu].rq_queues[pri];
| +	} else if (pri != -1) {
| +		rqh = &rq->rq_queues[pri];
| +	} else {
| +		CTR1(KTR_RUNQ, "runq_choose: idleproc pri=%d", pri);
| +		return (PCPU_GET(idleproc));
| +	}

Do you intend the algorithm to be this simple? Or are you going to
change it in the future?


Thank you,
Michael

-- 
+-----------------------------------------------------------+
| keichii@iteration.net         | keichii@freebsd.org       |
| http://iteration.net/~keichii | Yes, BSD is a conspiracy. |
+-----------------------------------------------------------+

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010702093638.B96996>