Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 16 Dec 1998 03:04:31 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        vanmaren@fast.cs.utah.edu (Kevin Van Maren)
Cc:        smp@FreeBSD.ORG
Subject:   Re: Pthreads and SMP
Message-ID:  <199812160304.UAA13431@usr05.primenet.com>
In-Reply-To: <199812151632.JAA26636@fast.cs.utah.edu> from "Kevin Van Maren" at Dec 15, 98 09:32:51 am

next in thread | previous in thread | raw e-mail | index | archive | help
> That aside, it is most certainly desirable to be able to run
> multiple threads in parallel.  The extent to which user threads
> are mapped onto processors is best controlled by some provided
> mechanism (such as pthread_setconcurrency and pthread_getconcurrency)
> rather than an inflexible policy such as "I believe it may be
> slow to run multiple threads at the same time".

These two interfaces are optional in a conforming pthreads.


> As for Terry's beef about the page table, I don't know how often
> a typical app gets its page table updated, but I wouldn't think
> that would be common except when a) you are paging and other
> performance penalties are likely to be in the noise or b) more
> memory is being allocated/accessed by the process.  It is only
> necessary to do a TLB-shootdown when restricting the mappings.
> It isn't a problems if a processor takes a trap because its TLB
> was out of date and the page is really valid; it simply loads
> the new info and continues.

The problem is write faults on pages that are in the page tables
of both processors because there is only one page table for all
of the threads in a single process.

This means that you need to invalidate or update cache contents
on CPUs athat are, in fact, not *using* the cache contents and
could care less.

This typically happens in copy on write faults and on stack growth
when you hit a guard page, especially if you are passing the
addresses of auto variable between threads.

It's my personal experience that people use threads because they
don't know how to program effectively, and using threads efficiently
requires a lot of saftey harness code in the OS for these people to
actually gain the benefit they think they will gain.


That said, all I was trying to point out are that there are constraints
on the efficiency of kernel threads that no one has addressed up to
this point except Sun Microsystems, and even then, I think they
screwed up the quantum model pretty badly (if the scheduler gives
me a quantum, it's *my* damn quantum, and if the scheduler will take
it away from me for making a blocking system call, then *screw* the
scheduler, I won't make blocking calls).  The name of the game is
to minimize context switch overhead.



> I believe we should add the necessary mechanisms to run threads
> in parallel and THEN look at the actual performance problems
> and address them.

These mechanisms already exist, as has been pointed out countless
times on this list.  They just aren't packaged up with their glue code
in a nice "pthread_create" routine some place, because doing that
without further kernel support would result in abysmal performance,
and would, in fact, be counter-productive.

The main reason for the poor performance are the issues I've outlined
here.

You can very easily go to the -current list archives and search for
"John Dyson" and get a copy of the glue code.  Or you could directly
ask jmb@freebsd.org for the code.


> If that means the scheduler needs to be improved, fine, we improve
> the scheduler.  If some applications run slower on multiple
> processors, we just have them call pthread_setconcurrency(1).  Shoot,
> Terry can make the default to be 1 on his machines.  Personally, I
> would like to be able to use pthread_create() instead of fork() to
> handle computation-bound requests.


Then feel free to integrate John's vfork based kernel threading into
your libc_r, and to add the appropriate pthread_setconcurrency()
functions to bring the implementation up to some documented standard,
instead of teh limbo between Draft 4 and Draft 10 where it currently
lives.

There's really nothing stopping you from using the code; it was
posted to the list.  It's just that it would be real silly to
abandon a working Draft 10 (standard) pthreads to chase after
what some people in this thread are claiming is a computational
holy grail, and which others in this thread have already had
experienvce with on Solaris 2.3 and below and SVR4.0.2 and
UnixWare 2.x.  I can tell you: it's not even a grail-shaped
beacon.



> Terry has a point about wanting to design the system to be
> fast from the beginning.  That is almost certainly better
> than to design something, realize it is way to slow, and then
> hack on it forever.  However, having this working in the
> short term and rewriting it for the long term doesn't upset
> me too much -- I just want it working, and it will certainly
> be good enough for a large range of applications (even it
> it isn't large enough or good enough for Terry).

It won't be better than a user space call conversion scheduler
for the vast majority of threaded applications.  I've been through
the benchmarks on the code that was posted to -current and on the
similar SVR4 N:N kernel threading model and the SVR4 M:N, M>N
"lets starve all the user space threads from getting quantum"
Solaris 2.3 and UnixWare 2.x model.

It's not a question of "not good enough", it's a question of "if
you aren't intending on following the implementation through to
completion, there's no reason to bother starting down the road
at all".

I guess what I'm trying to point out is that there is a crisis of
commitment; merely having kernel threads won't make the code go
faster.  SMP scalability is not merely the ability to block between
threads waiting on each others resources in the Big Giant Lock(tm)
in the kernel.  You can block on the User Space Call Conversion
Scheduler(tm) instead, and achieve exactly the same (lack of) effect.



If you are truly interested in pursing SMP scalability via kernel
threads, the way to do it is to take the Dyson vfork() code, run
it on your own machine, and work up from there.

Insisting that FreeBSD commit from a non-SMP scalable call conversion
model in favor of a non-SMP scalable context switch and cache busting
kernel threading model is not the way to go.

To get anywhere with that argument, you are going to have to be able
to beat the user space threads with your kernel space threads on a
uniprocessor system, and show imporvement, or at least no degradation,
on an SMP system from using the user space scheduler.

Kernel threads context switches are *not* lighter weight than process
context switches.  The cost is about equal, unless you have some way
of assuring CPU <-> threads group affinity.  Even then, you are talking
about starving other processes in favor of the thread group unless you
are very, very careful implementing your code.  The only place this
won't be true is a rigged benchmark on an otherwise idle machine,
such that your benchmark process never has to compete with any
other process, and so the page table is never pushed out to make
way for the page table for "init" or "syncd" or "nfsiod", etc.,
etc..

This is not a trivial problem to solve, and hitting the currently
limping-but-functional user space threads over the head with a
shovel and dragging the body away to make room for a different and
limping-but-even-less-functional new body won't cut it.


Ugh.  I need to drag my IEEE SMP and parallel processing literature
out of my back bedroom when I get home tonight, I guess.  Then I'll
be able to quote you chapter and verse.  8-(.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199812160304.UAA13431>