Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 21 Jun 2002 00:58:33 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Julian Elischer <julian@elischer.org>
Cc:        Gary Thorpe <gat7634@hotmail.com>, freebsd-arch@freebsd.org
Subject:   Re: multiple threads for interrupts
Message-ID:  <3D12DCA9.A5EB415C@mindspring.com>
References:  <Pine.BSF.4.21.0206202313570.34612-100000@InterJet.elischer.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Julian Elischer wrote:
> Per CPU scheduling queues are I think over-rated..
> it turns out that you need to have quite a lot of leakage between the
> queues to get a real priority based system, otherwise you
> have processor A running a sequence of threads that are lower in priority
> than threads in Processor B's run queue. To get around this you need
> to let processor A steal threads from Processor B's queue.
> Once you have this then it rapidly escalates into a mess. It may be worth
> while if you have 64 or 32 processors but for less than that it
> complicates things more than you win..

Scheduling priority is not a facist dictatorship.  You sort of
imply that it should be, where *every* process with priority "A"
*must* be scheduled before *every* process with priority "B".

It's perfectly fine to make the decisions statistically.  The
current system has user specified priorities which are merely
hints to the system.

I will go further: it is *preferrable* to make the decision
statistically... you are less likely damage locality by
migrating processes for reasons of a short term imbalance.


The idea of a  "pull model" for migration is inherently flawed.
It requires that one CPU's scheduler be able to reach into the
most contended region of another CPU's scheduler queue (the head,
where policy dictates that the next process that *should* be run
is located).  To do this implies a globally accessible lock,
that is locally asserted and deasserted on each local access. It
implies that on all CPU's participating in the protocol.


> > This solves the affinity problem.
> 
> yes, but it introduces synchronisation problems that were not there before..
> 
> You still have to lock the other queues while you check that you are not
> running low prio threads to teh disadvantage of Hi prioones elsewhere,
> and then since any processor can want to look at any other processor's
> queue, you get lock-order problems.
> 
> These can be worked around but it's not a trivial problem, and
> I'm not implementing them for the forst version of KSEs.

I thought we were on the third version... milestone 3.  And it's
totally independent of KSE's, I think.

Personally, I thinks it *is* a trivial problem.  Here is some
pseudo-code for a per-CPU scheduler loop:

	/*
	 * Migrate process that have been *pushed* to us (if any)
	 * into the local scheduler queue
	 */
	IF cpu_1_migrate_queue != NULL
		LOCK(cpu_1_migrate_queue)
		local_migrate_queue = cpu_1_migrate_queue
		cpu_1_migrate_queue = NULL
		UNLOCK(migrate_queue)
		FOR_EACH_ELEMENT(local_migrate_queue)
			ADD_TO_LOCAL_SCHEDULER()
	ENDIF

	/*
	 * Examine per CPU load figure of merit; this figure of
	 * merit is statistical; therefore examination of it need
	 * not be locked.
	 */
	min = this_cpu_figure_of_merit
	mincpu = inavlid
	FOR_EACH(per_cpu_figure_of_merit)
		IF min > figure
			mincpu = figure_cpu
			min = figure
		ENDIF
	IF (this_cpu_figure_of_merit - min) > acceptable_difference
		/*
		 * Migrate a process from this CPU to the least
		 * loaded CPU.
		 */
		process = REMOVE(cpu_1_shecduler_queue)
		LOCK(mincpu_migrate_queue)
		INSERT(process,mincpu_migrate_queue)
		UNLOCK(migrate_queue)
	ENDIF

	/*
	 * Update stats
	 */
	this_cpu_ficure_of_merit = XXX

	/*
	 * Do per CPU scheduling
	 */
	DOSCHEDULE()

...no locks required for the scheduler, unless a migration is in
progress, and the contention is never global, it's only between
two CPUs: the migrate from, and the target being migrated to.

This is a push model.  It resolves all of the issues you raise,
except that of migration policy, which is based on the figure
of merit calculation and the threshold for the acceptable level
of difference.  For threads, there may also be an issue of wanting
negaffinity for a given thread group (process), but this is easily
handled inside this model.


> > I'm not sure the affinity fix solves the NETISR problem, because I
> > think that the issue there is that the affinity you want in that
> > case is mbuf<->cpu affinity.  Basically, if you take the network
> > interrupt on CPU 3, then you want to run the NETISR code that is
> > associated with the protocol processing on CPU 3, as well, to avoid
> > cache busting.
> 
> You have both data and instrauction caches to consider..
> 2 processors running the net stack means that
> you are flushing stuff out of the instruction caches of 2 processors.
> This would degrade other stuff.

Yes, it would, which is why I dislike it.  If you look at the
work Ingo Molnar has accomplished with getting the interrupt
processing, the TCP stack, and all the intersting data, to live
inside the CPU cache for the "Tux" in-kernel HTTP server, you
will see what can be accomplished when you don't risk ping-ponging
processes all over the place for no good reason.


> > The way I would suggest doing this is to run the protocol processing
> > up to user space at interrupt time (LRP).  This gets rid of NETISR.
> 
> maybe..

Yes, it's a big "maybe".  Load up 4.3 and try it with and
without LRP, and then make a decision when you have numbers
in hand.


> > What this all boils down to is that you should only permit receive
> > data interrupts to occur at the rate that you can move the data from
> > the wire, all the way through the system, to completion.
> 
> if it were a netgraph problem, I would push the data as far as I could
> without hitting a blocking lock, at which point it would be handled
> asynchronsly thereafter..

Yes.  That's one of the problems with the earlier suggestion
that TCP and IP should be implemented as netgraph nodes.  THere
will be at least a unidirectional stall which will have to occur
during the processing at the boundary between the protocol layers
in order to allow the processing to run to completion.  You really
can't involve netgraph (or Streams, on SVR4) without causing such
a stall.  The stall introduce with the Streams based ODI drivers
replacing the native drivers for UnixWare 2.0 (SVR4.2) ended up
costing us a full 35% latency increase.


> > I'm convinced that CPU affinity needs to happen.
> 
> Somewhat though I'm coming to believe that per cpu queues is not the way
> to do it..

Any approach which requires direct modification of the scheduler
policy to achieve affinity is intrinsically broken.  Linux may
claim thread group affinit works, but it is very easy to find
degenerate cases which will result either in starvation deadlock
for the non-group member processes, or for unnecessary thrashing,
for the group members.  THe benefit to the FreeBSD use of scheduler
activations -- that once a quantum is allocated, it remains
allocated, until it is used up, or there is no more work to do --
can easily be lost, if you try to enforce strict priority ordering
between all threads.


> > I'm also convinced that, for the most part, running NETISR in
> > kernel threads, rather than to completion at interrupt, is the
> > wrong way to go.
> 
> Mach did it that way. (threads).
> 
> It's not necessarily  right to push all the way up in one go.
> My personal opinion is that it shoudl go 'as far as it can'
> and that that may not be all the way up all the time....

MACH is widely acknowledged to have failed in a lot of its goals,
particularly with threading, but also with the necessity for
protection domain crossing for external pagers, and the need to
support message passing (now that it turns out that all modern
computers are not Northern Telecom DV1's, and so message passing
ends up not being supported by the hardware bus).  Chorus got a
lot of things right that MACH ended up getting wrong.  Neither
one of them are sterling ecxamples of why one should use threads
to handle interrupts.


> > To me, it looks like a lot of people are believing something is
> > better because they are being told that it is better, not because
> > they have personally measured it, gotten better numbers, and have
> > proven to themselves that those better numbers were a result of
> > what they thought they were measuring, rather than an artifact
> > that could be exploited in the old code, as well.
> 
> one could say that about PCPU queues.....

On the other hand, we have proof by demonstration from Sequent
and SGI on per CPU run queues, with real statistics and may
research papers...

Can you point me at one or two on kernel threads for interrupt
processing?  8-).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D12DCA9.A5EB415C>