Date: Thu, 17 Dec 1998 00:10:12 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: james@westongold.com (James Mansion) Cc: tlambert@primenet.com, james@westongold.com, mal@algonet.se, alk@pobox.com, peter@netplex.com.au, gpalmer@FreeBSD.ORG, marcelk@stack.nl, smp@FreeBSD.ORG Subject: Re: Pthreads and SMP Message-ID: <199812170010.RAA07668@usr09.primenet.com> In-Reply-To: <32BABEF63EAED111B2C5204C4F4F5020183C@WGP01> from "James Mansion" at Dec 16, 98 06:11:20 am
next in thread | previous in thread | raw e-mail | index | archive | help
> > For a shared context server, the viability of the server is based > > on how *little* context you actually have to contend for between > > processors. > > Look, I don't give a fig for irrelevant petty tasks like file serving. Don't think about it like this. The train of thought is too limiting. Think of CPU cycles as if they were a resource. > I'm sorry that you can't understand that actually the whole point of > these silly computer things is to process data rather than just move > it. > > My personal interest is twofold: > - I want to compute while IO is under way Use prefetch notification (via madvise) and use John Birrell's AIO patches to -current. > - I want to speed computations (especially monte-carlo calcs) that > can be split, on SMP boxes I have some familiarity with this problem, on SMP SPARC hardware, since I have worked on SMP capable software for simulating relativistically invariant particle collisions via monte-carlo, with the resulting collisions being constrained to "reasonableness" by the physics being modelled (a soloution of 12 Feynman-Dyson diagrams). >From personal experience with the LBL code for relativistically invariant collisions on a dual P90 FreeBSD box, starting on June 19th, 1995 using the October 27th, 1995 snapshot for FreeBSD, when I first got a FreeBSD box to run SMP using Jack Vogel's initial code, I can tell you that SMP scalability is very poor even to today using kernel threads vs. seperate processes. > In the former case, even a good async IO implementation will not > help without all the open and sync functions being available through > aio - and they aren't. And select/poll isn't even close, given the > behaviour with 'local' file fds. Right. This is my main complaint with async I/O, especially as has been codified by POSIX. My personal preference? Use async call gates, and return context references for calls which block. Divorce the kernel stack from the user space process entirely. A putative kernel threads implementaiton won't help you much in this case, either, since it only buys you call contexts on which to block. Big deal, you get to be descheduled in favor of "syncd" or "swapper" or some other process that doesn't have your page table in its heart of hearts. You might as well be running in a seperate process, since the context switch overhead is going to be the same, and, at least with a seperate process, you don't have to pay for TLB shootdown. I guess the I/O assumptions you are making about interleave are assuming you are using SCSI, right? ...since IDE devices will serialize the I/O at the IDE interfaces, whereas the SCSI devices can interleave I/O requests to the device using a tagged command queue. Otherwise, you might as well be running in a select loop. > In the latter, I have a non-trivial data structure (possibly large) > that will be read during computation. Its stupid to build it from > scratch in multiple processes, and threads serve well. So does a single mmap'ed file or a SYSV SHM segment loaded by the initial process. What you seem to be complaining about here is page table setup and teardown, since once the pages are in core, the pages are shared (FreeBSD has a unified VM and buffer cache, and the vnode *directly* backs the object, unlike SVR4 where you would pay a buffer-cache to VM copy overhead for each process. > In practice, DBMS systems are also compute bound, and also have > significant shared state. Its not just investment banking or > engineering weenies who have to deal with this. Actually, this is an architecture issue that has so far been ignored when people go to implement their code. I think the correct soloution for that *particular* problem would be to export a dependency registration mechanism from the Soft Updates code such that user transactions could be marked to occur in dependency order as well. This would be a general win, and ensure that the DBMS I/O was *not* I/O bottlenecked to the disk. This type of thing is why the DBMS weenies all get so hot and bothered about log structured and journalling FS technology. The "compute binding" of DBMS technology is in the transaction latency, not in actual computational latency. > > It's more an artifact of having a common page table between all > > of the processors, such that the same invalidations affect all > > of the processors instead of one processor. > > And you have evidence that, in the case where primary cache lines > are not being invalidated in a contended region, that this has a > major effect? Yes. Empirical evidence from physics simulations, which is what I assume you are planning on using the thing for, given your reference to monte-carlo. Specifically, the fact that IPI's are generated at L2 cache speed instead of the actual processor speed. > You don't think that the eveidence of scalability on real, cheap > Intel-based kit with SMP DBMS shows that the effect need not be > a major headache, and that useful gains can be made, even with > the 8th CPU and even with NT? No, I don't. Most SMP DBMS scalability has been implemented using what SVR4 calls "scheduling classes" -- and they've addressed these issues archtecturally. For example, if you were going to do a monte-carlo run of the type discussed above, you would have to seperate the work into discrete components. DBMS's do this intrinsically, implementing "work to do" engines that sperate context from connection, and that's why they actually can get performance wins on SMP: they are designed and tuned for the limitations of the SMP on which they are expected to run. > > The problem space mappable using SMP is not trivially small, but > > it's a hell of a lot smaller than the problem space that can be > > mapped with 32 times the computations per clock cycle on a linear > > uniprocessor. > > And this observation is practical given Intel's attack on 4- and > 8-way volume servers? I don't think so. The only good Intel SMP boes are the ones that come out of their server products group, traditionally, and Intel doesn't sell them to people, they only kick them out as reference implemetnations. It's a strange world. The idea behind the referenced "attack" is the same idea behind the claim that you can add a bunch of PC's together and get a supercomputer. > > It's my personal experience that people use threads because they > > don't know how to program effectively, and using threads efficiently > > requires a lot of saftey harness code in the OS for these people to > > actually gain the benefit they think they will gain. > > Its MY personal experience that you should get a life. The OS is > only there to make our lives easier, so we can do our real jobs, > no? Yes, it is. And you can probably complain loudly enough that people will implement the API's you want for kernel threading in -current, instead of making you type a list search pattern for the -current list archives to find the kernel threading glue code on your own. But there is going to be no advantage to you for having done this, and FreeBSD is going to look bad as a result, since it'll be nothing more than a facade. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-smp" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199812170010.RAA07668>