From owner-freebsd-arch  Mon Nov 27 17: 2:16 2000
Delivered-To: freebsd-arch@freebsd.org
Received: from net2.gendyn.com (nat2.gendyn.com [204.60.171.12])
	by hub.freebsd.org (Postfix) with ESMTP id 303E337B4C5
	for <arch@freebsd.org>; Mon, 27 Nov 2000 17:02:09 -0800 (PST)
Received: from [153.11.11.3] (helo=plunger.gdeb.com)
	by net2.gendyn.com with esmtp (Exim 2.12 #1)
	id 140Z9g-000Ln6-00
	for arch@freebsd.org; Mon, 27 Nov 2000 20:01:56 -0500
Received: from orion.caen.gdeb.com ([153.11.109.11])
	by plunger.gdeb.com  with ESMTP id TAA03337;
	Mon, 27 Nov 2000 19:58:45 -0500 (EST)
Received: from gdeb.com (gpz.clc.gdeb.com [192.168.3.12])
	by orion.caen.gdeb.com (8.9.3/8.9.3) with ESMTP id TAA00995;
	Mon, 27 Nov 2000 19:59:00 -0500 (EST)
	(envelope-from deischen@gdeb.com)
Message-ID: <3A230437.F8318078@gdeb.com>
Date: Mon, 27 Nov 2000 20:02:47 -0500
From: Dan Eischen <deischen@gdeb.com>
X-Mailer: Mozilla 4.75 [en] (X11; U; SunOS 5.8 sun4u)
X-Accept-Language: en
MIME-Version: 1.0
To: arch@freebsd.org
Cc: julian@elischer.org, jasone@canonware.com
Subject: Re: Threads (KSE etc) comments
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

I promised to post my thoughts on the User Thread Scheduler.
Here they are in hopefully some presentable form.

--
Dan Eischen

----------------------------------------------------------------------

1. Overview

This is a discussion of some of the issues in the User Thread Scheduler
in the FreeBSD KSE project.  This doesn't go into much detail on the
design and implementation, but focuses more on the scheduling models
and the (userland) API.

2. Definitions

Kernel Scheduled Entities (KSE) - This is the entity in which user
threads are scheduled.  A process may have multiple KSEs in which
to schedule threads.  A KSE may be viewed as a virtual processor
to the UTS.  The number of KSEs available for threads with contention
scope process may be limited to the actual number of processors in
the system (TBD).

Kernel Scheduled Entity Group (KSEG) - A group of KSEs that, as
outlined so far, contains the shared quantum and priority.  This is
currently under discussion.

Scheduling Allocation Domain - The number of processors over which
threads can be scheduled.  This really correlates into how many KSEs
are allocated to the process.

User Thread Scheduler (UTS) - This is responsible for scheduling user
level threads over the available processors (KSEs).

3. Scheduling Models

Threads are scheduled according to their contention scope and allocation
domain.  The contention scope, as defined by POSIX, is either system or
process contention scope (PTHREAD_SCOPE_SYSTEM and PTHREAD_SCOPE_PROCESS
respectively).  Each contention scope system thread is bound to its own
private KSE.  This KSE has its own quantum and priority (if it needs
a KSEG to obtain that, so be it) such that the thread competes on a
system-wide basis with other system scope threads.  Contention scope
system threads need not be part of any scheduling queue since there are
no other threads in competition for processing time on that KSE.

Contention scope process threads are more interesting in that there are
several possible scheduling schemes depending on the allocation domain
and on the number of quanta granted by the kernel (not including quanta
granted for scope system threads).

3.1 Single Queue, Single Allocation Domain

This is the common case where there is only 1 CPU and only 1 quantum
granted to handle scope process threads.  All runnable threads are placed
in the same queue and compete for processor time on one scheduleable
entity (with 1 quantum).

3.2 Single Queue, Multiple Allocation Domain

Here we have N schedulable entities over which threads can be executed.
All runnable threads are placed in the same queue and are scheduled onto
N schedule entities as they become avaliable.  When a thread blocks or
completes in a schedulable entity, then another thread is pulled from
the single run queue for execution.  If a KSE is preempted while running
a thread, and in lieu of any hint from the application as to the binding
of threads to KSEs, the UTS would only resume that thread on the next
KSE if its priority was (strictly) greater than the priority of the next
thread in the run queue, or if the preempted thread was in a critical
region.

Whether or not the scheduled entities have their own quantum is not known
at this point, but under the current design it is possible (the UTS could
create N KSEG/KSE pairs to allow this instead of N KSEs within 1 KSEG).  

3.3 Multiple Queue, Multiple Allocation Domain

Again we have N schedulable entities over which threads can be executed,
but in this case there are also multiple (up to N) scheduling queues.  In
this model, there may exist a run queue for each schedulable entity, and
threads may be bound a particular entity.  As with the single queue model
above, whether or not scheduled entities have their own quantum is not yet
known, but it is possible with the current design.

How does the UTS decide which threads get bound to each of the N scheduled
entities?  At the least, the application should have the ability to decide
this.  In lieu of any hint from the application, the UTS could also
provide some method of (soft) binding threads to the N scheduled entities
and optimize for maximum CPU utilization and minimum thread reassignment.
My thought is that we concentrate on keeping it simple for now, and allow
for this possiblity later.

For the case where not all threads are bound to a specific KSE by the
application, there could be a global run queue from which unbound threads
are taken.  In this case there would be as many as N+1 scheduling queues,
with each KSE taking a peek at the priority of the global run queue
before deciding on taking a thread from its own run queue.  The global
run queue would not have to be locked (unless adding/removing a thread),
so this would only add the overhead of a couple of instructions to examine
its priority.  You might also have KSEs that don't have any bound threads,
in which case they wouldn't have a run queue and would always obtain
threads from the global run queue.

Yet another option would be to disallow binding of threads to the
original (main) KSE and only allow binding to other KSEs.  All unbound
threads would be executed on the main KSE (and any other KSE which does
not have bound threads).  Any KSE that has bound threads would only
execute those threads.

4. API

For the most part, the POSIX API is sufficient for our needs.  But if
we want to allow application control of how threads are assigned and
scheduled on the KSEs, we could define the following set of interfaces:

  pthread_setconcurrency() - This is the POSIX interface to set the
  concurrency level.  This will request the desired concurrency level
  and informs the UTS as to how many KSEs are to be requested from
  the kernel (which the kernel may limit).  Whether or not this also
  allows additional quantum remains to be seen.  Certainly, the UTS
  could create an additional KSEG/KSE pair for each level of concurrency
  above 1 in order to achieve additional quantum under the current
  design.  This function does not change/reflect the scheduled entities
  for system scope threads.  The limited concurrency level is returned.
  If we wanted this routine to act the same as it does under Solaris,
  then it would actually request the number of entities with quantum
  and priority (KSEGs as currently defined).

  thr_create(...) - This would be an alternative to using
  pthread_setconcurrency() to set the number of KSEs.  This function
  allows an application to specify additional attributes for thread
  creation.  Solaris allows additional flags to be specified, noteably
  THR_NEW_LWP and THR_BOUND.  The effect of specifying THR_BOUND is the
  same as specifying PTHREAD_SCOPE_SYSTEM.  But specifying THR_NEW_LWP
  (and omitting THR_BOUND) allocates an additional LWP that can be used
  to schedule unbound (scope process) threads.  We could provide a
  similar flag THR_NEW_KSE (or THR_NEW_KSEG) that could tell the UTS
  to request an additional KSE (or KSEG) to be used to schedule scope
  process threads.  I'm not too keen on this interface as an alternative
  to pthread_setconcurrency(), but perhaps it has some merit if we
  want a Solaris-like API.

  _kse_self() - returns the current KSE ID, where the integer ID ranges
  from 0 (for the original KSE of the main process) to M-1 (where M
  is the total number of KSEs in the process).  Solaris provides a
  similar function _lwp_self().

  pthread_bind_np(pthread_t pthread, int kse_id) - binds a given thread
  to the specified KSE.  A kse_id of -1 refers to the current KSE.  I
  suppose this could also be called _kse_bind() depending on how you
  looked at it.

  _kse_bind(int kse_id, int processor) - binds a KSE to a particular
  processor.  This might also be called _cpu_bind() if you use _kse_bind
  instead of pthread_bind_np.  Solaris has processor_bind() which can
  handle both LWPs and PIDs, and pset_bind() which allows binding of
  LWPs or PIDs to processor sets.  This is probably getting a little
  ahead of ourselves, but something to think about anyways.

For a moment, let's make the assumption that each KSE has a priority
and quantum, or that we always use a KSEG with one KSE to achieve the
same effect.  We _could_ now present an interface that is very similar
to that provided by Solaris.  True, perhaps a KSE (or KSEG) is not as
heavy as an LWP in Solaris, but that is just an internal implementation
issue.  To the application, they are seen as very much similar things.
If we provide an API that is very similar to that provided by Solaris,
that would make porting Solaris applications trivial.  Again, something
to think about.

5. Interaction of Existing Scheduling Interfaces

We currently have the following interfaces that affect process scheduling:

  setpriority()
  rtprio()
  sched_setparam()
  sched_setscheduler()
  any others?

In a threaded process, I think these should operate on the entity that
contains the quantum and priority, not the process.  Whether that is a
KSE or a KSEG, I don't know.  If it is a KSEG, then that's the only
case I can see for forcing the threads library to know anything about
KSEGs.  Still, the kernel is responsible for setting these priorities,
not the UTS, so it wouldn't be strictly necessary for the UTS to have
any knowledge about KSEGs.

6. Summary

I'd like some resolution as to what interfaces the threads library
should provide to the application.  I've outlined some of my thoughts
above and I'd like some feedback.  My biggest question is do we want
to provide the ability for a threaded application to request more
scheduling time (aside from PTHREAD_SCOPE_SYSTEM threads)?  I've
already seen applications that always use PTHREAD_SCOPE_SYSTEM when
creating threads.  I suppose this is mostly in part to obtain as
much CPU as possible.  At USENIX, Jason and I attended a BOF on threads
and it was kind of amazing to me that folks seemed to prefer the
LinuxThreads model.  Given this attitude, I don't think it makes
sense to attempt to restrict an application to only 1 (or N where
N = # of CPUs) quantum; we'll just end up with applications that
always use system scope threads.

Do we want to provide a method of binding threads to KSEs, and KSEs
to processors?  Binding threads to KSEs isn't really that hard to
implement in the UTS, and I wouldn't think it too difficult for the
kernel to bind KSEs to processors either (?).  Some KSEs may be
automatically bound to processors, but others might not; KSEs allocated
for system scope threads, or KSEs allocated (for process scope threads)
above and beyond the number of CPUs (assuming we allow this).

I'd like to resolve these issues (any others?) very soon so I can
concentrate on more of the UTS details (like what is the communication
channel between the kernel and the UTS).  At some point, it may be
worthwhile to have a telecon or IRC (never tried it) because it could
take too long via this mailing list.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message