FreeBSD Mail Archives

Date:      Mon, 13 Dec 1999 19:43:50 -0800 (PST)
From:      Julian Elischer <julian@whistle.com>
To:        Terry Lambert <tlambert@primenet.com>
Cc:        "Daniel M. Eischen" <eischen@vigrid.com>, adsharma@sharmas.dhs.org, arch@freebsd.org
Subject:   Re: Recent face to face threads talks.
Message-ID:  <Pine.BSF.4.10.9912131901420.31923-100000@current1.whistle.com>
In-Reply-To: <199912140237.TAA01908@usr08.primenet.com>

Today archie and I tried to "mentally implement" this system.
The basic architecture remains the same but we ended up having 
to make one glaring change. (seen below).

On Tue, 14 Dec 1999, Terry Lambert wrote:

[stuff deleted]

> 
> KSEs have become what I would call an async system call blocking
> call context, or "blocking call context" -- BCC, for short.
> They are intended to be allocated out of a pool, or preferrably,
> out of a per CPU resource pool, so as to avoid requiring The Big
> Giant Lock(tm), except when draining or refilling the pool at the
> high water mark or low water mark on the pool, respectively (this
> is precisely what Dynix did to achive much higher granularity,
> at the cost of maintaining a two-layer malloc() system, with the
> possibility of some memory being "lost" to the non-empty pools;
> it is a brilliant design, IMO).
> 
> KSEs are container objects for sleeping on their addresses and
> for the kernel stack per outstanding blocked system call.  The
> stack _may_ be given back prior to the KSE itself becoming
> unreferenced.  A KSE must also have queue pointers for putting
> it on the sleep queues, and for maintaining a linked list in
> the pools.
> 
> Editorial note:	A "Q" is what I would have called the KSE.  A
> 		kernel schedulable entity implies an interaction
> 		with the scheduler, but we are probably going to
> 		get stuck with the current definitions.  If so,
> 		I suggest the term SR for "Scheduler Reservation"
> 		is a better fit.
> 

Yes, maybe we should rename KSE's to be "kernel Sleepable object" ;-)

Archie and I have spent about 4 hours today arguing about some of this an
dwe have come up with a proposal that we will put forward in the next day
when we get it all written down. I hate to say it but we decided we may
actually need an "R" stucture as well in some cases.

Renemaing the KSE to "S" gives us a royal flush ("Proc", "Queue/Quantum",
"Runnable" and "Sleepable", P,Q,R,S (ok I'm pushing it a bit)

the structures are as follows:


1) Proc.. holds VM references and Credentials
2) Thread Class.. Holds the priority information for a group of kernel
     threads.
3) Thread CPU designator. This has  a 1:1 mapping with an upcall handler.
   All KSE's attached to a particular one of these will come back to the
   same UTS stack. 
4) Thread kernel context container (ex-KSE).. Terry's BCC. A new name is
   becoming needed.


Both the middle structures may not be implemented immediatly.
Now, "Why do I want yet another stucture?" I hear you ask!

It allows us to handle some intersting concurrancy issues, while still
allowing us to be sure that a particular upcall handler is inherrantly
safe from being clobbered by upcalls from multiple CPUs at the same time.

It also gives a natural target for gang scheduling (should it be required)
and thread migration. We are writing a set of man pages for discussion
purposes.

with this particular structure layout the code almost writes itself
which is always a pleasing feeling.

Structure (3) is the one that goes on the run queues.
Structure (4) is the one that goes on the sleep queues
The UTS implicitly groups "type-3's" into a group described by a 
"type (2)". (runnable type 4's in the kernel may switch between any type
(3) structs in the same group, in order to get the maximum concurrancy
needed. A proces that cannot handle this would only ever place one
type (3) in a type (2), in fact this would be the default behaviour, and
for early implementations, the only possible model.

> 
> In terms of the scheduler (also deferred work), the main idea
> is to have a per-CPU scheduler that grabs a per-CPU scheduler
> lock each time it wants to manipulate the data.
> 
> When migrating "Q"s between CPUs, the CPU migrating will have
> to grab both its own scheduler lock ("migrate from") and the
> target CPUs scheduler lock ("migrate to").
> 
> Common use would be grabbing its own lock, not in the same
> contention domain as the other CPUs, so there would be no
> stalling, as there is with The Big Giant Lock(tm).

The type (3) structs are Per-cpu and are on different cpu sched lists.
the type(4) structs can migrate between any type (3)s that are in the same
type (2) :-)


> 
> Like the per processor KSE pools, this enables great concurrency:
> a migration of a "Q" from one CPU to another on an 8 CPU system
> will utilize 1 CPU, potentially stalling another CPU, and
> leaving 6 CPUs unimpacted.
> 
> 
> 
> There are only three cases where a "Q" is created:
> 
> o	On fork(2) (obviously).
> 
> o	When asking to be scheduled on additional CPUs in an
> 	SMP system.  It is likely that this particular case
> 	will not require priviledges, or that soft limits
> 	requring priviledges to violate will be implemented
> 	(e.g. an administrative limit of reservations on 4
> 	CPUs in a 16 CPU system, controlled by the users
> 	login class).

In the system Archie and I have drawn up a "Q" is broken into two parts
this is the 'type (3)'. The one below is a 'type (2)'.

> 
> o	When asking for an additional "Q" in order to effect
> 	sheduling priorities of PTHREAD_SCOPE_SYSTEM.  It is
> 	likely that dropping priority will be unpriviledged,
> 	and merely result in a UTS designation change, since
> 	creating an additional "Q" would permit the process
> 	to compete as multiple processes against other processes
> 	for quantum (in other words, it would pretend to run the
> 	affected threads at a lower system priority, but in
> 	reality, would not).  Raising priority is already a
> 	priviledged instruction; it is assumed that raising
> 	priority, by virtue of the fact that it requires
> 	priviledge, will in fact create another "Q".  To be
> 	able to effectively manage multiple priority scheduler
> 	reservations, except in the face of priority lending
> 	to deal with inversion, there is a new requirement for
> 	an explicit yield(2) call to give unused higher priority
> 	quantum back to the system.

yes we definity need a 'yield'.. A 'sleep' that just destroys the kse
and doesn't return. (kinda).

> 
> Because CPU migration is an exceptional event for a process, and
> is a matter of kernel scheduler policy, there is natural CPU
> affinity for each "Q" in this approach.  This is many times
> better than the current situation in both FreeBSD and Linux,
> as I think some experiments by Mike Smith would support.
> 
> As a simple algorithm for achieving initial balancing, adding:
> 
> 	u_int32_t	p_cpumask;	/* CPUs with scheduler reservations*/
> 	u_int32_t	p_Q_count;	/* number of "Q"s in process*/
> 
> To the proc structure, and then setting a bit for each CPU that
> there is currently a "Q" outstanding on (and moving the bit if
> it gets migrated) will ensure that each new "Q" can be created
> on a different CPU.  Counting the bits and keeping a running "Q"
> count is a cheap way of knowing if multiple "Q"s are on the same
> CPU as a result of migration; you would need to traverse the list
> in order to identify and move these around between CPUs, assuming
> a load balancing vs. concurrency algorithm in the kernel scheduler.

you can however find that your threads on one processor are trapped behind
a constantly higher priority process, and need to get 'rescued' by being 
brought out to another processor in order to let them get out from under
it.


> 
> 
> > > It was also agreed that it doesn't make sense to do everything at once,
> > > but do it in some sensible order, without precluding elements of the
> > > design. Some of the tasks - break the proc structure into a proc and KSE,
> > > per CPU scheduling queues, implementing SA.
> 
> My own take on this was that implementing the scheduler changes
> would be a heck of a lot easier than the other work.  8-).  On
> the other hand, the general consensus was that it would be easier
> to do the other work than the scheduler changes, so the scheduler
> changes are going to need to wait.
> 

[whether to do a linuxthreads-type version first or not]

> 
> This is what looked right to me, and I think that Jason Evans
> agreed.  The intent was to not break out "Q" until later, as
> part of the scheduler work.
> 
> For one thing, we already have the Linux threads implemented
> (it can be loaded as a kernel module).  For another, kernel
> threads backing user space threads is a horrible implementation,
> and has all sorts of problems with thread group (process) affinity
> when deciding who to run next on a partially consumed quantum.
> Adding thread group affinity to a kernel scheduler is a sure way
> to shoot your leg off.

Actually the linuxthreads PORT (as opposed the running a linux treads
linux program under linux mode) requires NO kernel changes, as it's a
Native freeBSD binary library.

> 
> I think that this is at least natural, if not the order I would
> choose.  Much of the hardest work is allowing the process to be
> on a multiple run queues (or the same run queue, multiple times).
> This work could even be debugged on a UP machine, where you
> stuff the same process into a single scheduler several times
> (indeed, this is necessary for PTHREAD_SCOPE_SYSTEM thread
> priorities on a uniprocessor machine).
> 
> This seems to me to most resemble the shared library issues,
> where it's better to do it right the first time than it is to
> take a bogus implementation (e.g. the System V shared library
> style implementation, which ate a chunk of address space for
> each version of a single library, and the same for each library,
> as they had to be non-PIC code mapped to the same location in
> all processes).  Waiting on Jeffrey Hsu's patches for PIC in
> GCC (actually, released in support of LKMs, which needed PIC)
> was The Right Thing To Do(tm).
> 
> 
> > > If we stick to the POSIX threads interface strictly, we shouldn't be
> > > afraid of getting stuck with one of the implementations.
> > > 
> > > Other concerns that were expressed - crossing protection boundaries too
> > > often to pass messages about blocking and rescheduling. Matt Dillon
> > > responded that these messages could be batched and made more efficient.
> > 
> > I agree.  It seemed like the SA paper went to extremes in notifying
> > the UTS of unblocked threads.  I think we can do this at more opportune
> > times, like when a Q is resumed or on a user->kernel crossing.  I don't
> > know that we want to be interrupting a Q while it is running a thread
> > just to notify the UTS that another thread has unblocked in the kernel.
> 
> Julian had a nice discussion with Jason (and later Matt and me) on
> this topic.  The basic idea was to run the threads up to the user/kernel
> boundary, but not farther.  At that time, the call has effectively
> returned all but the CPU on which it is running to user space, and
> a nice optimization is that the kernel stack associated with the KSE
> can be released.  This would allow hundreds of thousands of the
> things, a possibility on a heavily loaded system.
> 
> Matt suggested a shared memory segment, rather than a message,
> where the status could be reflected to user space.  This strikes
> me as perhaps premature optimization, but would give us a place
> to put information about the "Q" of a process returning to user
> space, which is something that the UTS needs in order to allow
> it to seperate quanta of different priority for assignment to a
> new user space thread (or an explicit yield(2), if the high priority
> threads are all blocked -- though a priority lend to the lower
> priority thread holding the resource would probably be a better
> idea, it will require additional work in user space to implement).

Archie and I have in our mental implementation, made the call that greates
a new 'Q' (actually a type(3)) supply as arguments, the place where
the kernel can find a pointer to the register storage for the thread being
run, and a place to find a pointer for where to find a cookie for the
thread. 
Given these two bits of  information (permanently stored in the 'type(3)')
the kernel and the UTS can communicate in an unambiguous manner.






To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.10.9912131901420.31923-100000>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation