Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 26 Mar 2003 01:47:33 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Jeff Roberson <jroberson@chesapeake.net>
Cc:        Julian Elischer <julian@elischer.org>
Subject:   Re: 1:1 Threading implementation.
Message-ID:  <3E817735.A388A41C@mindspring.com>
References:  <20030326031245.O64602-100000@mail.chesapeake.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Jeff Roberson wrote:
> Well, I wasn't doing userland stuff until three days ago.  I think mini
> has just been very busy with work.  I suspect that you're going to need
> to start doing userland work or find someone to do it if you want to get
> it done soon.

In theory, the library API will be identical to the pthreads
standard, and will require no changes to programs written to
that standard.  Most threaded programs these days are written
to the standard.  Some threaded programs make invalid assumptions
about rescheduling following an involuntary context switch, or
ability to make particular blocking calls.

The first will still be a problem (e.g. Netscape's Java/JavaScript
GIF rendering engine is probably still serializing requests due to
non-thread reentrancy).

The second should not be an issue, either with your implementation
of the 1:1, or Jon Mini's implemenetation of the N:M model.



> > Wouldn't it have been easier to have one KSEGRP+KSE+thread per user
> > thread? Having one ksegrp and many KSEs requires changing the kernel
> > code where doing it the other way you could do it without making any
> > changes.
> 
> I don't understand?  There are relatively minor changes to the kernel to
> support this.  Since nice is a property of the process, it makes sense
> that there is only one ksegrp per process.  I'm starting to think that the
> ksegrp was overkill in general.

The KSEGRP is, effectively, a virtual processor interface, and
was/is intended for use by the scheduler to ensure CPU affinity
for individual threads, and CPU negaffinity for multiple threads
within a process.  In other words, according to the published
design documents, it's a scheduler artifact.

Personally, I've never seen the need for virtual processors, but
then I've always advocated "intentional start/intentional migration"
for the scheduler model (one of my arguments for a per CPU migration
queue, and a push-model, rather than a pull-model for redistribution
of an unbalanced load).

In a scheduler model where a sheduler *pulls* work, either from
another CPU ready-to-run queue, or from a single ready-to-run
queue that is global to the system (in either case, requiring
locks in the scheduler path, potentially highly contended locks),
the idea of a KSEGRP/"virtual processor" is necessary for globally
migratable and contendable "bookkeeping" objects.

So in the current scheduler implementations, KSEGRP is necessary;
in the 1:1 model, it's necessary, if only to ensure negaffinity
(4 CPU system, process with 4 threads, ensure each thread gets its
own CPU, and does not migrate away from it).

It's also minorly useful to distinguish PTHREAD_SCOPE_SYSTEM
priority groups, when running multiple threads on a single CPU
(either in the common single CPU case, or in the less common SMP
case), as a means of avoiding priority inversion deadlocks.  I
would like to see this done differently, which would get rid of
KSEGRP, but would add a scheduler architecture dependency, which
I think can't be gotten rid of easily.  It's a tradeoff (as usual).


> > i.e. on creation of a new process, shced_newproc() is called
> > and a KSE is added in there is the scheduler in question wants to use
> > KSEs. If it doesn't, no KSE would be added, but it's still possible that
> 
> Yes, I think we need more sched hooks here as well.  Having only
> sched_fork() makes things sort of gross.  We'll have to hook this all up
> later.


You could also take this idea much further.  Specifically, SVR4
flags system calls as "non-blocking", "blocking", and "potentially
blocking".  By doing this, they can lazy-bind context creation for
blocking operations on "blocking" and "potentially blocking" calls,
and avoid it altogether on "non-blocking" and sometimes avoid it on
"potentially blocking" calls.

This can result in a significant overhead savings, if the kernel
implementation evolves, but the user space implementation remains
fixed.

It's good to decouple these things from each other (IMO).


> > This was discussed recently as being the highlight of someone's
> > threading model (I think Linux but I am not sure who's).
> 
> Yes, linux was discussing this.  It's a pretty common trick.  Even NT does
> it but apparently NT allocates kernel resources for user locks.  I was
> pretty pleased that I got away without any per lock allocations.

Everyone does this.  Novell did it back in 1993.  Sun's turnstiles
are based on the tradeoff between spinning and waiting, and how
many times you have to do that before it's worth crossing the
protection domain, and blocking.

When we did this in 1993 (Novell's implementation was primarily
by Dave Hefner, who now works for Microsoft, I believe), we ended
up with 20,000 times the transcation per second performance of
Tuxedo, which was the commercial record holder up to that point.


> > > The reason we're doing this in parallel with the M:N effort is so that we
> > > can have reasonable threading sooner.  As I stated before, this project is
> > > complimentary to KSE and does not prohibit it from working.  I also think
> > > that the performance will be better or comparable in the majority of real
> > > applications.
> >
> > My only comment is that since mini is supposed to be doing the
> > M:N library, isn't this a bit of a distraction?
> 
> I'll let him comment on this.

I'll stick my nose in: I think it's a good idea, since TPTB have
recently made noises on a couple of FreeBSD lists about "rapidly
approaching deadlines for the KSE work".

Consider it insurance on your investment, people.


> > You should be creating a new KSEGRP (subproc) per thread.
> > I think you will find that if you do, things will fall out easier
> > and you won't break the next KSE changes.
> 
> I don't understand what I may break?

See above for KSEGRP reasoning.  I think it's representative,
but, if you have time, you may want to read the documentation
for the KSE project.  If other people want to comment or correct
my own comments in this regard (I have been largely an observer,
since after the second threads meeting where my async call gate
idea was brutally .. uh, "laid to rest" ;^)), they should feel
free to do so.


> > I'm not against having a separate 1:1 thread capability, but
> > all this work could have been well spent getting M:N threads
> > better supported and even getting it to
> > be able to run in 1:1 mode a s a byproduct..
> 
> I don't think M:N is the way to go.  After looking things over and
> considering where it is theoretically faster I do not think it is a
> worthwhile pursuit.
> 
> First off, it is many months away from being even beta quality.  I think
> the UTS is far more complicated than you may realize.  There are all sorts
> of synchronization issues that it was able to avoid before since only one
> thread could run at any time and there essentially was no preemption.  It
> now also has to deal with effecient scheduling decisions in a M:N model
> that it didn't have to worry about before.

I would not recommend abandoning the idea, personally.  There is a
huge -- and I mean *huge* -- amount of literature that likes the
N:M model.

There is also the fact that affinity and quantum are very hard to
maintain on a system with a heterogeneous load.  In other words,
1:1 looks good if the only thing you are running is a single
multithreaded proces, but looks a *lot* less good when you start
running real-world code instead of fictitious benchmarks that
try to make your threading look good (e.g. measuring only thread
context switches, with no process context switch stall barriers,
etc.).


> I feel that this is an overwhelming amount of complexity.  Because of this
> it will be buggy.  Sun claims that they still have open tickets on their
> M:N while their new 1:1 implementation is totally bug free.  How long have
> they been doing m:n?  I don't think that with our limited resources we're
> going to be able to do better.

You can't schedule resources.  They will work on what they want
to, and let anything they don't like just sit there and "rot".

The Sun claims are really specious, IMO.  They have it working,
but how does it compare to, say multiple processes that are
sharing descriptor tables, and not much else, in a work-to-do
model?

I can tell you from personal experience with such a model, that
it *VASTLY* outperforms a 1:1 kernel threading model, even if you
end up running multiple state-machine instances on multiple CPUs.
We got more than a 120X increase in NetWare for UNIX, simply by
changing the client dispatch streams MUX to dispatch to worker
processes instead of threads, in LIFO instead of FIFO order,
simply because it ensured that the process pages you cared about
were more likely to be in core.

1:1 threading is useful for one thing, and one thing only: SMP
scalability of single image processes.  And it's not the best at
doing that.

> Furthermore, m:n's basic advantage is less overhead from staying out of
> the kernel.

No, actually, it's the ability to fully utilize a quantum, and
to not have to make a decision between one of your own threads
and some other process, off the run queue, when making a decision
in the scheduler about what to run next.

If you favor the threads in your own process, then you potentially
starve other processes.

If you favor neither, and treat them like processes, you get none
of the supposed context switch benefits that were supposedly going
to result from using threads instead of processes in the first place.


> First, if your application has more threads than cpus it is written
> incorrectly.

This depends on what those threads are doing.  If they are all doing
the same work, then yes, you are right.  If they are doing different
jobs, then you are wrong; even if most of them are doing the same job,
and a few of them are doing different jobs, you are still wrong, since
job-loading is unlikely to be the same between threads.


> For people who are doing thread pools instead of event driven IO
> models they will encounter the same overhead with M:N as 1:1.

This is actually false.  In 1:1, your thread competes with all
other processes, in order to be the next at the top of the run
queue.  Statitically, you are doing more TLB flushes and shootdowns,
and more L1 and L2 cache chootdowns, than you would otherwise.

Solving this problem without intentional scheduling has been
proben to be N-P incomplete: it is not a problem which is
solvable in polonomyial time.


> I'm not sure what applications are entirely compute and have more threads
> than cpus.  These are the only ones which really theoretically benefit.  I
> don't think our threading model should be designed to optimize poorly
> thought out applications.

By that argument, threads should not be supported at all... 8-) 8-).


> This means that the constraints are different from when
> this architecture started to come about many (10 or so?) years ago.
> Trying to optimize context switches between threads just doesn't make
> sense when you do so much work per slice.

5/6 years ago, depending on who you ask.

But by your same arguments, CPU clock multipliers have grown
to the point that memory bus and I/O bus stalls are so
expensive that SMP makes no sense.


> Then if you look at the number of system calls and shenanigans a UTS must
> do to make proper scheduling decisions it doesn't look like such an
> advantage.

I agree with this one; my original model avoided the problem
entirely by making the POSIX blocking call behaviour a library
on to of an sync kernel interface.  By doing this, kernel
boundary crossings could be minimized automatically.

The pthreads code as it has existed so far, also does a lot of
unecessary kernel boundary crossings in order to handle signal
masking.  In fact, you could establish an intermediate handler
for all signals at the user threads scheduler level, and never
have to worry about most of that crap.

I think the kernel boundary crossing overhead, and the fact
that, in doing so, you tend to relinquish a significant
fraction of remaining quantum (by your own arguments) says
that protection domain crossings are to be avoided at all costs.


> In short, even if it is marginally faster, it doesn't seem like it is
> worth the effort and risk.  I don't want to discourage you from trying but
> this is why I stopped working on KSE proper and pursued the 1:1 model.

I'm glad you pursued it, even though I do not agree with your
reasoning on the value of N:M vs. 1:1.  I view it as "life
insurance" for the KSE code, which some people might be
otherwise tempted to rip out over some arbitrary deadline.

Thank you for your work here, and thank everyone else for
their work, too.

-- Terry



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E817735.A388A41C>