Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 17 Dec 1998 00:10:12 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        james@westongold.com (James Mansion)
Cc:        tlambert@primenet.com, james@westongold.com, mal@algonet.se, alk@pobox.com, peter@netplex.com.au, gpalmer@FreeBSD.ORG, marcelk@stack.nl, smp@FreeBSD.ORG
Subject:   Re: Pthreads and SMP
Message-ID:  <199812170010.RAA07668@usr09.primenet.com>
In-Reply-To: <32BABEF63EAED111B2C5204C4F4F5020183C@WGP01> from "James Mansion" at Dec 16, 98 06:11:20 am

next in thread | previous in thread | raw e-mail | index | archive | help
> > For a shared context server, the viability of the server is based
> > on how *little* context you actually have to contend for between
> > processors.
> 
> Look, I don't give a fig for irrelevant petty tasks like file serving.

Don't think about it like this.  The train of thought is too limiting.
Think of CPU cycles as if they were a resource.


> I'm sorry that you can't understand that actually the whole point of
> these silly computer things is to process data rather than just move
> it.
> 
> My personal interest is twofold:
>  - I want to compute while IO is under way

Use prefetch notification (via madvise) and use John Birrell's AIO
patches to -current.


>  - I want to speed computations (especially monte-carlo calcs) that
>    can be split, on SMP boxes

I have some familiarity with this problem, on SMP SPARC hardware,
since I have worked on SMP capable software for simulating
relativistically invariant particle collisions via monte-carlo,
with the resulting collisions being constrained to "reasonableness"
by the physics being modelled (a soloution of 12 Feynman-Dyson
diagrams).

>From personal experience with the LBL code for relativistically
invariant collisions on a dual P90 FreeBSD box, starting on June
19th, 1995 using the October 27th, 1995 snapshot for FreeBSD, when
I first got a FreeBSD box to run SMP using Jack Vogel's initial
code, I can tell you that SMP scalability is very poor even to
today using kernel threads vs. seperate processes.


> In the former case, even a good async IO implementation will not
> help without all the open and sync functions being available through
> aio - and they aren't.  And select/poll isn't even close, given the
> behaviour with 'local' file fds.

Right.  This is my main complaint with async I/O, especially as
has been codified by POSIX.

My personal preference?  Use async call gates, and return context
references for calls which block.  Divorce the kernel stack from
the user space process entirely.

A putative kernel threads implementaiton won't help you much in
this case, either, since it only buys you call contexts on which
to block.  Big deal, you get to be descheduled in favor of "syncd"
or "swapper" or some other process that doesn't have your page
table in its heart of hearts.  You might as well be running in
a seperate process, since the context switch overhead is going to
be the same, and, at least with a seperate process, you don't have
to pay for TLB shootdown.

I guess the I/O assumptions you are making about interleave are
assuming you are using SCSI, right?  ...since IDE devices will
serialize the I/O at the IDE interfaces, whereas the SCSI devices
can interleave I/O requests to the device using a tagged command
queue.  Otherwise, you might as well be running in a select loop.


> In the latter, I have a non-trivial data structure (possibly large)
> that will be read during computation.  Its stupid to build it from
> scratch in multiple processes, and threads serve well.

So does a single mmap'ed file or a SYSV SHM segment loaded by the
initial process.  What you seem to be complaining about here is
page table setup and teardown, since once the pages are in core,
the pages are shared (FreeBSD has a unified VM and buffer cache,
and the vnode *directly* backs the object, unlike SVR4 where you
would pay a buffer-cache to VM copy overhead for each process.


> In practice, DBMS systems are also compute bound, and also have
> significant shared state.  Its not just investment banking or
> engineering weenies who have to deal with this.

Actually, this is an architecture issue that has so far been
ignored when people go to implement their code.

I think the correct soloution for that *particular* problem
would be to export a dependency registration mechanism from the
Soft Updates code such that user transactions could be marked
to occur in dependency order as well.  This would be a general
win, and ensure that the DBMS I/O was *not* I/O bottlenecked
to the disk.  This type of thing is why the DBMS weenies all get
so hot and bothered about log structured and journalling FS
technology.

The "compute binding" of DBMS technology is in the transaction
latency, not in actual computational latency.


> > It's more an artifact of having a common page table between all
> > of the processors, such that the same invalidations affect all
> > of the processors instead of one processor.
> 
> And you have evidence that, in the case where primary cache lines
> are not being invalidated in a contended region, that this has a
> major effect?

Yes.  Empirical evidence from physics simulations, which is what I
assume you are planning on using the thing for, given your reference
to monte-carlo.

Specifically, the fact that IPI's are generated at L2 cache speed
instead of the actual processor speed.


> You don't think that the eveidence of scalability on real, cheap
> Intel-based kit with SMP DBMS shows that the effect need not be
> a major headache, and that useful gains can be made, even with
> the 8th CPU and even with NT?

No, I don't.  Most SMP DBMS scalability has been implemented using
what SVR4 calls "scheduling classes" -- and they've addressed these
issues archtecturally.

For example, if you were going to do a monte-carlo run of the
type discussed above, you would have to seperate the work into
discrete components.  DBMS's do this intrinsically, implementing
"work to do" engines that sperate context from connection, and
that's why they actually can get performance wins on SMP: they
are designed and tuned for the limitations of the SMP on which
they are expected to run.


> > The problem space mappable using SMP is not trivially small, but
> > it's a hell of a lot smaller than the problem space that can be
> > mapped with 32 times the computations per clock cycle on a linear
> > uniprocessor.
> 
> And this observation is practical given Intel's attack on 4- and
> 8-way volume servers?  I don't think so.

The only good Intel SMP boes are the ones that come out of their
server products group, traditionally, and Intel doesn't sell them
to people, they only kick them out as reference implemetnations.
It's a strange world.

The idea behind the referenced "attack" is the same idea behind
the claim that you can add a bunch of PC's together and get a
supercomputer.


> > It's my personal experience that people use threads because they
> > don't know how to program effectively, and using threads efficiently
> > requires a lot of saftey harness code in the OS for these people to
> > actually gain the benefit they think they will gain.
> 
> Its MY personal experience that you should get a life.  The OS is
> only there to make our lives easier, so we can do our real jobs,
> no?

Yes, it is.  And you can probably complain loudly enough that
people will implement the API's you want for kernel threading
in -current, instead of making you type a list search pattern for
the -current list archives to find the kernel threading glue
code on your own.  But there is going to be no advantage to
you for having done this, and FreeBSD is going to look bad as a
result, since it'll be nothing more than a facade.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199812170010.RAA07668>