From owner-freebsd-hackers Mon Dec 4 21:45:47 2000 From owner-freebsd-hackers@FreeBSD.ORG Mon Dec 4 21:45:38 2000 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from crotchety.newsbastards.org (netcop.newsbastards.org [193.162.153.124]) by hub.freebsd.org (Postfix) with ESMTP id 2652537B400 for ; Mon, 4 Dec 2000 21:45:37 -0800 (PST) Received: (from news@localhost) by crotchety.newsbastards.org (8.11.1/8.11.1) id eB55jL453889; Tue, 5 Dec 2000 06:45:22 +0100 (CET) (envelope-from newsuser@free-pr0n.netscum.dk) Date: Tue, 5 Dec 2000 06:45:22 +0100 (CET) Message-Id: <200012050545.eB55jL453889@crotchety.newsbastards.org> X-Authentication-Warning: crotchety.newsbastards.org: news set sender to newsuser@free-pr0n.netscum.dk using -f Reply-To: freebsd-user@netscum.dk From: News History File User To: Matt Dillon Cc: hackers@freebsd.org, usenet@tdk.net Subject: Re: vm_pageout_scan badness References: <200012011918.eB1JIol53670@earth.backplane.com> <200012020525.eB25PPQ92768@newsmangler.inet.tele.dk> <200012021904.eB2J4An63970@earth.backplane.com> <200012030700.eB370XJ22476@newsmangler.inet.tele.dk> <200012040053.eB40rnm69425@earth.backplane.com> In-Reply-To: <200012040053.eB40rnm69425@earth.backplane.com> Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > ok, since I got about 6 requests in four hours to be Cc'd, I'm > throwing this back onto the list. Sorry for the double-response that > some people are going to get! Ah, good, since I've been deliberately avoiding reading mail in an attempt to get something useful done in my last days in the country, and probably wouldn't get around to reading it until I'm without Net access in a couple weeks... (Also, because your mailer seems to be ignoring the `Reply-To:' header I've been using, but I'd get a copy through the cc: list, in case you puzzled over why your previous messages bounced) > I am going to include some additional thoughts in the front, then break > to my originally private email response. I'll mention that I've discovered the miracle of man pages, and found the interesting `madvise' capability of `MADV_WILLNEED' that, from the description, looks very promising. Pity the results I'm seeing still don't match my expectations. Also, in case the amount of system memory on this machine might be insufficient to do what I want with the size of the history.hash/.index files, I've just gotten an upgrade to a full gig. Unfortunately, now performance is worse than it had been, so it looks I'll be butchering the k0deZ to see if I can get my way. Now, for `madvise' -- this is already used in the INN source in lib/dbz.c (where one would add MAP_NOSYNC to the MAP__FLAGS) as MADV_RANDOM -- this matches the random access pattern of the history hash table. Supposedly, MADV_WILLNEED will tell the system to avoid freeing these pages, which looks to be my holy grail of this week, plus the immediate mapping that certainly can't hurt. There's only a single madvise call in the INN source, but I see that the Diablo code does make two calls to it (although both WILLNEED and, unlike INN, SEQUENTIAL access -- this could be part of the cause of the apparent misunderstanding of the INN history file that I see below). Since it looks to my non-progammer eyes like I can't combine the behaviours in a single call, I followed Diablo's example to specify both RANDOM and the WILLNEED that I thought would improve things. The machine is, of course, as you can see from the timings, not optimized at all, since I've just thrown something together as a proof of concept having run into a brick wall with the codes under test with Slowaris, And because a departmental edict has come down that I must migrate all services off Free/NetBSD and onto Slowaris, I can't expect to get the needed hardware to beef up the system -- even though the MAP_NOSYNC option on the transit machine enabled it to whup the pants off a far more expensive chunk of Sun hardware. So I'm trying to be able to say `Look, see? see what you can do with FreeBSD' as I'm shown out the door. > I ran a couple of tests with MAP_NOSYNC to make sure that the > fragmentation issue is real. It definitely is. If you create a > file by ftruncate()ing it to a large size, then mmap() it SHARED + > NOSYNC, then modify the file via the mmap, massive fragmentation occurs I've heard it confirmed that even the newer INN does not mmap() the newly-created files for makehistory or expire. As reported to the INN-workers mailing list: : From: rmtodd@servalan.servalan.com (Richard Todd) : Newsgroups: mailing.unix.inn-workers : Subject: Re: expire/makehistory and mmap/madvise'd dbz filez : Date: 4 Dec 2000 06:30:47 +0800 : Message-ID: <90ehin$1ndk$1@FreeBSD.csie.NCTU.edu.tw> : : In servalan.mailinglist.inn-workers you write: : : >Moin moin : : >I'm engaged in a discussion on one of the FreeBSD developer lists : >and I thought I'd verify the present source against my memory of how : >INN 1.5 runs, to see if I might be having problems... : : >Anyway, the Makefile in the 1.5 expire directory has the following bit, : >that seems to be absent in present source, and I didn't see any : >obvious indication in the makedbz source as to how it's initializing : >the new files, which, if done wrong, could trigger some bugs, at least : >when `expire' is run. : : ># Build our own version of dbz.o for expire and makehistory, to avoid : ># any -DMMAP in DBZCFLAGS - using mmap() for dbz in expire can slow it : ># down really bad, and has no benefits as it pertains to the *new* .pag. : >dbz.o: ../lib/dbz.c : > $(CC) $(CFLAGS) -c ../lib/dbz.c : : >Is this functionality in the newest expire, or do I need to go a hackin'? : : Whether dbz uses mmap or not on a given invocation is controlled by the : dbzsetoptions() call; look for that call and setting of the INCORE_MEM : option in expire/expire.c and expire/makedbz.c. Neither expire nor : makedbz mmaps the new dbz indices it creates. The remaining condition I'm not positive about is the case of an overflow, that ideally would not be a case to consider, and is not the case on the machine now. > on the file. This is easily demonstrated by issuing a sequential read > on the file and noting that the system is not able to do any clustering > whatsoever and gets a measily 0.6MB/sec of throughput (on a disk > that can do 12-15MB/sec). (and the disk seeks wildly during the read). This is the sort of rate that I see when sync-like updates are being done to the on-disk history database files, as though the data being written to them is not well-sorted beforehand, and it's seeking all over the place... > When you create a large file and fill it with zero's, THEN mmap() it > SHARED + NOSYNC and write to it randomly via the mmap(), the file > remains laid on disk optimally. However, I noticed something interesting! For what it's worth, in case this might be a problem, before starting INN after adding the memory, I did a `cp -p' on both the history.hash and history.index files, just to start fresh and clean. It didn't seem to help much, if at all. > large file == at least 2 x main memory. Okay, but what about small files? The particular files where I have performance problems are each about 1/5 or less the size of main memory (108000000 and 72000000 bytes each) on this system, comparable to the sizes on the other well-running system with less memory (400+MB) you quoted below... > -- original response -- > Ok, lets concentrate on your hishave, artclean, artctrl, and overview > numbers. Hey! HEY!@! I'm not ready to do a complete tuning yet! ;-) > :-rw-rw-r-- 1 news news 436206889 Dec 3 05:22 history > :-rw-rw-r-- 1 news news 67 Dec 3 05:22 history.dir > :-rw-rw-r-- 1 news news 81000000 Dec 1 01:55 history.hash > :-rw-rw-r-- 1 news news 54000000 Nov 30 22:49 history.index > : > :More observations that may or may not mean anything -- before rebooting, > :I timed the `fsync' commands on the 108MB and 72MB history files, as > > note: the fsync command will not flush MAP_NOSYNC pages. Ummm... Uh... Really? It seems that it did... For after I gave the userland `fsync' command on these files just before rebooting, I did not see the lengthy disk activity I've normally associated with using NOSYNC at times of reboot. Normally I'll see three or four minutes of disk grinding following the `syncing disks -- done' messages, but this time, the history drive light flashed for a fraction of a second before continuing with the reboot. Also, the timestamps on the files got updated, just as they do after shutdown at reboot time... > :The time taken to do the `fsync' was around one minute for the two > :history files. And around 1 second for the BerkeleyDB file... > > This is an indication of file fragmentation, probably due to holes > in the history file being filled via the mmap() instead of filled via > write(). But... If the message I quoted is accurate (I don't trust myself to look at the source), or if I initialize the history database files with `cp' prior to starting and no expire or any process that would extend the files has run, wouldn't the time needed to `fsync' the files indicate that the in-memory data to be dumped isn't well-sorted before being written to disk? The data is accessed and updated randomly in the INN history mechanism... > :are a bit, well, wrong. The way I see it, by giving the MAP_NOSYNC > :flag, I'm sort of asking for preferential treatment, kinda like mlock, > > The pages are treated the way any VM page is treated... they'll > be cached based on use. I don't think this is the problem. I think it is a problem -- at least, it's enough of a performance hit that I'm willing to sacrifice a couple hundred megs of RAM to avoid hitting the disk -- our NetBSD machines, in spite of the bug, achieve stellar performance levels by leaving the history files alone for days. The problem is that the pages are accessed randomly, so that prior use is no indication of future use. Also, what seems to be happening is that a random selection of pages are being updated on-disk now, which takes, oh, looks like four or five seconds every so often. During this update time (which MAP_NOSYNC helps to avoid, sometimes), all other history access blocks, including that I'm doing as a reader. Then grabbing a random page that needs to be consulted but isn't still in memory gets expensive. Eeep, that last slowdown took 8 seconds... It's true that I haven't rebuilt things without MAP_NOSYNC to see how much time is spent on periodic sync activity, but what I'm seeing now, even with a gig of RAM and an ever-increasing list of null-routed networks to keep abusers away, is starting to resemble that level of poor performance. > Ok, lets look at a summary of your timing results: > > hishave overv artclean artctrl > 38857(26474) 112176(6077) 12264(6930) 2297(308) > 22114(28196) 136855(6402) 12757(7295) 1257(322) > 13614(24312) 156723(6071) 13232(6800) 324(244) > 9944(25198) 164223(6620) 13441(7753) 255(160) > 2777(50732) 24979(3788) 29821(4017) 131(51) > 31975(11904) 21593(3320) 25148(3567) 5935(340) > > Specifically, look at the last one where it blew up on you. hishave > and artctrl are much worse, overview and artclean are about the same. Artctrl, at six seconds out of five minutes, isn't of too much concern to me now. This would be (almost exclusively) cancel messages -- I haven't looked to see just how the CNFS cyclical buffer multiple-articles- in-one-large-file method of storage handles cancels... Certainly not by unlinking an inode as with the old news spool we all grew up to know and loathe. Artclean is an addition I've made that wasn't present in Joe Greco's timer (I don't remember if you had much experience with this, or if you were busy writing Diablo at this time, so if I'm explaining something that's obvious, just assume it's for the benefit of the audience who may appreciate a bit of background). It's an in-memory operation that one of the developers recently realized is pretty expensive, and it is dependant on the size of the article being processed. The top timings are about twice the normal average rate of ten articles/second (though I have seen well in excess of 35/sec sustained for a one hour period, arrrg) and are that way as I'm bringing in backlogs. For whatever reason, a backlogged feed gives priority to taking in smaller articles first, and as a result, even though more articles are processed, the time taken overall is less than when handling a smaller number of articles with an order of magnitude more volume. But this explanation is all pretty irrelevant to the immediate discussion... Anyway... > This is an indication of excessive seeking on the history disk. I > believe that this seeking may be due to file fragmentation. To me, sitting on the floor in front of the machine, it looks like it indicates there is activity on the history disk. The ideal situation (from my experience with both NetBSD and the transit machines) is that the history drive light stays off. Which it does, mostly, during the Good Timings after the bulk of the text data is read from the disk. The problem is, right now, rather than remaining idle and letting the other disks take a pounding, I'm seeing non-stop flickering of the history drive, which I don't see on the transit machines, and during the 5 to 10 seconds worth of blocked history lookups, the light is constantly lit as it seeks, which, since I've done a `cp' on the two files, I believe is a result of the randomness of the pages being written. > There is an easy way to test file fragmentation. Kill off everything > and do a 'dd if=history of=/dev/null bs=32k'. Do the same for > history.hash and history.index. Look at the iostat on the history > drive. Specifically, do an 'iostat 1' and look at the KB/t (kilobytes > per transfer). You should see 32-64KB/t. If you see 8K/t the file > is severely fragmented. Go through the entire history file(s) w/ dd... > the fragmentation may occur near the end. Okay, I'm doing this: The two hash-type files give me between 9 and 10K/t; the history text file gives me more like 60KB/t. Hmmm. It's not the best drive in the world, I know that, but I had set up a NetBSD transit machine with five-year-old drives and saw average history lookup times of less than 10 microseconds each, just because it wasn't hitting the disk much. On an old PPro 200. > If the file turns out to be fragmented, the only way to fix it is to > fix the code that extends the file. Instead of ftruncate()ing the file > and then appending to it via the mmap(), you should modify the > ftruncate() code to fill in the hole with write()'s before returning, > so the modifications via mmap() are modifying pages that already have > file-backing store rather then filling in holes. > > Then rewrite the history file (e.g. 'cp'), and restart innd. But... But... But... but... but but but but... but butbutbut*BANG*but but... (Sorry, running a bit low on fuel) The thing is, before I started this particular instance of INN, I had just made a copy of the two hash files, the ones that give the less-than-ideal throughput as shown. I didn't bother with the text file, as it's never even consulted in the recent history k0deZ (unless one uses what's now called `tagged hash' that uses the text file as a reference after hashes are looked up in the .pag file), and I've been sure that hasn't been a problem -- all accesses there are sequential. What I did note as interesting was that Diablo's k0deZ use the madvise MADV_WILLNEED which seems reasonable, but unlike the random access of INN's hash files, it hints at SEQUENTIAL access. Now, I'll be the first to confess to not having looked at your history k0deZ that you wrote for Diablo and just how it does work, but whatever is happening when handling the INN random history is definitely not very optimal. Now, there appear to be several things conspiring against me. It looks like the MAP_NOSYNC option to mmap is best suited for mmap'ed file-based IPC, where the contents of what's on the disk is less critical. When I use it with a history file, I run the risk of a crash (read: I get my leg wrapped in the power cord as I thrash about in my sleep) where most of the updates to the .index and .hash files go to the Big Blue Bitbucket to join most of my work over the last years. This isn't fatal, as the two files can be recreated from the text file when one is aware of a crash. I can live with it, as, at the moment, it seems to be the only way to get decent history performance without a significant hardware investment or accepting the will of my boss and using a different OS. Now, about madvise: I'm not quite sure how specifying `MADV_RANDOM' changes the behaviour and if it would have any affect on the flushing of the data to the disk when needed -- it appears to my addled brane that with a truly random set of data like the .index/.hash files, you can't attempt to predict what pages will be best to flush due to inactivity, since any random page should be equally likely to be accessed as any other. If anything, you'd want to write out sequential pages for speed. Now, if that's what is happening, then somehow things look to be getting messed up from the time I made the `cp' some hours ago that should have created virginally pure unfragmented database files... I had suggested that `MAP_NOSYNC' be a hint that the pages should get preferential treatment to be locked in memory. Maybe overkill for your typical IPC file, but perhaps the combination of MAP/MADV_NOSYNC and MADV_RANDOM should be a hint to prioritize these pages. Okay, now to MADV_WILLNEED -- I hoped that would be the needed kick to effectively give me the `mlock'-like performance, but even with a full gig of memory and a dozen relatively-behaved readers, there's still way too much history disk activity that really drags down overall performance. Okay, so maybe the combination of MAP/MADV_NOSYNC and MADV_WILLNEED should be the hint needed to avoid paging out those pages, even if they haven't been accessed in quite some time. Especially if it's in combination with MADV_RANDOM... The numbers you've seen aren't very impressive -- I'm quite sure that I was able to keep up easily with NetBSD, although the corruption that resulted there caused me to abandon ship quickly. As it is, I can't keep up with the overnight binaries peak in a full feed, as the history times are taking up two minutes out of every five now. Bleeecccchhh. The actual history lookups and updates that matter are all done within the memory taken up by the .index and .hash files. So, by keeping them in memory, one doesn't need to do any disk activity at all for lookups, and updates, well, so long as you commit them to the disk at shutdown, all should be okay. That's what I'm attempting to achieve. These lookups and updates are bleedin' expensive when disk activity rears its ugly head. Not to worry, I'm going to keep plugging to see if there is a way for me to lock these two files into memory so that they *stay* there, just to prove whether or not that's a significant performance improvement. I may have to break something, but hey... (sheesh, two minutes... :~-) thanks, barry bouwsma, obstinate stubborn newsbast'rd To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message