From owner-freebsd-hackers  Mon Dec  4 21:45:47 2000
From owner-freebsd-hackers@FreeBSD.ORG  Mon Dec  4 21:45:38 2000
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from crotchety.newsbastards.org (netcop.newsbastards.org [193.162.153.124])
	by hub.freebsd.org (Postfix) with ESMTP id 2652537B400
	for <hackers@freebsd.org>; Mon,  4 Dec 2000 21:45:37 -0800 (PST)
Received: (from news@localhost)
	by crotchety.newsbastards.org (8.11.1/8.11.1) id eB55jL453889;
	Tue, 5 Dec 2000 06:45:22 +0100 (CET)
	(envelope-from newsuser@free-pr0n.netscum.dk)
Date: Tue, 5 Dec 2000 06:45:22 +0100 (CET)
Message-Id: <200012050545.eB55jL453889@crotchety.newsbastards.org>
X-Authentication-Warning: crotchety.newsbastards.org: news set sender to newsuser@free-pr0n.netscum.dk using -f
Reply-To: freebsd-user@netscum.dk
From: News History File User <newsuser@free-pr0n.netscum.dk>
To: Matt Dillon <dillon@earth.backplane.com>
Cc: hackers@freebsd.org, usenet@tdk.net
Subject: Re: vm_pageout_scan badness
References: <200012011918.eB1JIol53670@earth.backplane.com> <200012020525.eB25PPQ92768@newsmangler.inet.tele.dk> <200012021904.eB2J4An63970@earth.backplane.com> <200012030700.eB370XJ22476@newsmangler.inet.tele.dk> <200012040053.eB40rnm69425@earth.backplane.com>
In-Reply-To: <200012040053.eB40rnm69425@earth.backplane.com>
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

>     ok, since I got about 6 requests in four hours to be Cc'd, I'm 
>     throwing this back onto the list.  Sorry for the double-response that
>     some people are going to get!

Ah, good, since I've been deliberately avoiding reading mail in an
attempt to get something useful done in my last days in the country,
and probably wouldn't get around to reading it until I'm without Net
access in a couple weeks...

(Also, because your mailer seems to be ignoring the `Reply-To:' header
I've been using, but I'd get a copy through the cc: list, in case you
puzzled over why your previous messages bounced)


>     I am going to include some additional thoughts in the front, then break
>     to my originally private email response.

I'll mention that I've discovered the miracle of man pages, and found
the interesting `madvise' capability of `MADV_WILLNEED' that, from the
description, looks very promising.  Pity the results I'm seeing still
don't match my expectations.

Also, in case the amount of system memory on this machine might be
insufficient to do what I want with the size of the history.hash/.index
files, I've just gotten an upgrade to a full gig.  Unfortunately, now
performance is worse than it had been, so it looks I'll be butchering
the k0deZ to see if I can get my way.

Now, for `madvise' -- this is already used in the INN source in lib/dbz.c
(where one would add MAP_NOSYNC to the MAP__FLAGS) as MADV_RANDOM --
this matches the random access pattern of the history hash table.
Supposedly, MADV_WILLNEED will tell the system to avoid freeing these
pages, which looks to be my holy grail of this week, plus the immediate
mapping that certainly can't hurt.

There's only a single madvise call in the INN source, but I see that the
Diablo code does make two calls to it (although both WILLNEED and, unlike
INN, SEQUENTIAL access -- this could be part of the cause of the apparent
misunderstanding of the INN history file that I see below).  Since it
looks to my non-progammer eyes like I can't combine the behaviours in a
single call, I followed Diablo's example to specify both RANDOM and the
WILLNEED that I thought would improve things.

The machine is, of course, as you can see from the timings, not optimized
at all, since I've just thrown something together as a proof of concept
having run into a brick wall with the codes under test with Slowaris,
And because a departmental edict has come down that I must migrate all
services off Free/NetBSD and onto Slowaris, I can't expect to get the
needed hardware to beef up the system -- even though the MAP_NOSYNC
option on the transit machine enabled it to whup the pants off a far
more expensive chunk of Sun hardware.  So I'm trying to be able to say
`Look, see? see what you can do with FreeBSD' as I'm shown out the door.


>     I ran a couple of tests with MAP_NOSYNC to make sure that the
>     fragmentation issue is real.  It definitely is.  If you create a
>     file by ftruncate()ing it to a large size, then mmap() it SHARED +
>     NOSYNC, then modify the file via the mmap, massive fragmentation occurs

I've heard it confirmed that even the newer INN does not mmap() the
newly-created files for makehistory or expire.  As reported to the
INN-workers mailing list:

: From: rmtodd@servalan.servalan.com (Richard Todd)
: Newsgroups: mailing.unix.inn-workers
: Subject: Re: expire/makehistory and mmap/madvise'd dbz filez
: Date: 4 Dec 2000 06:30:47 +0800
: Message-ID: <90ehin$1ndk$1@FreeBSD.csie.NCTU.edu.tw>
: 
: In servalan.mailinglist.inn-workers you write:
: 
: >Moin moin
: 
: >I'm engaged in a discussion on one of the FreeBSD developer lists
: >and I thought I'd verify the present source against my memory of how
: >INN 1.5 runs, to see if I might be having problems...
: 
: >Anyway, the Makefile in the 1.5 expire directory has the following bit,
: >that seems to be absent in present source, and I didn't see any
: >obvious indication in the makedbz source as to how it's initializing
: >the new files, which, if done wrong, could trigger some bugs, at least
: >when `expire' is run.
: 
: ># Build our own version of dbz.o for expire and makehistory, to avoid
: ># any -DMMAP in DBZCFLAGS - using mmap() for dbz in expire can slow it
: ># down really bad, and has no benefits as it pertains to the *new* .pag.
: >dbz.o:         ../lib/dbz.c
: >       $(CC) $(CFLAGS) -c ../lib/dbz.c
: 
: >Is this functionality in the newest expire, or do I need to go a hackin'?
: 
: Whether dbz uses mmap or not on a given invocation is controlled by the 
: dbzsetoptions() call; look for that call and setting of the INCORE_MEM 
: option in expire/expire.c and expire/makedbz.c.  Neither expire nor
: makedbz mmaps the new dbz indices it creates. 

The remaining condition I'm not positive about is the case of an
overflow, that ideally would not be a case to consider, and is not
the case on the machine now.


>     on the file.  This is easily demonstrated by issuing a sequential read
>     on the file and noting that the system is not able to do any clustering
>     whatsoever and gets a measily 0.6MB/sec of throughput (on a disk
>     that can do 12-15MB/sec).  (and the disk seeks wildly during the read).

This is the sort of rate that I see when sync-like updates are being
done to the on-disk history database files, as though the data being
written to them is not well-sorted beforehand, and it's seeking all
over the place...


>     When you create a large file and fill it with zero's, THEN mmap() it
>     SHARED + NOSYNC and write to it randomly via the mmap(), the file 
>     remains laid on disk optimally.  However, I noticed something interesting!

For what it's worth, in case this might be a problem, before starting
INN after adding the memory, I did a `cp -p' on both the history.hash
and history.index files, just to start fresh and clean.  It didn't seem
to help much, if at all.


>     large file == at least 2 x main memory.

Okay, but what about small files?  The particular files where I have
performance problems are each about 1/5 or less the size of main
memory (108000000 and 72000000 bytes each) on this system, comparable
to the sizes on the other well-running system with less memory (400+MB)
you quoted below...


>     -- original response --

>     Ok, lets concentrate on your hishave, artclean, artctrl, and overview
>     numbers.

Hey!  HEY!@!  I'm not ready to do a complete tuning yet! ;-)


> :-rw-rw-r--  1 news  news  436206889 Dec  3 05:22 history
> :-rw-rw-r--  1 news  news         67 Dec  3 05:22 history.dir
> :-rw-rw-r--  1 news  news   81000000 Dec  1 01:55 history.hash
> :-rw-rw-r--  1 news  news   54000000 Nov 30 22:49 history.index
> :
> :More observations that may or may not mean anything -- before rebooting,
> :I timed the `fsync' commands on the 108MB and 72MB history files, as
> 
>     note: the fsync command will not flush MAP_NOSYNC pages.

Ummm...  Uh...  Really?  It seems that it did...  For after I gave the
userland `fsync' command on these files just before rebooting, I did
not see the lengthy disk activity I've normally associated with using
NOSYNC at times of reboot.  Normally I'll see three or four minutes
of disk grinding following the `syncing disks -- done' messages, but
this time, the history drive light flashed for a fraction of a second
before continuing with the reboot.

Also, the timestamps on the files got updated, just as they do after
shutdown at reboot time...


> :The time taken to do the `fsync' was around one minute for the two
> :history files.  And around 1 second for the BerkeleyDB file...
> 
>     This is an indication of file fragmentation, probably due to holes
>     in the history file being filled via the mmap() instead of filled via
>     write().

But...  If the message I quoted is accurate (I don't trust myself to
look at the source), or if I initialize the history database files
with `cp' prior to starting and no expire or any process that would
extend the files has run, wouldn't the time needed to `fsync' the files
indicate that the in-memory data to be dumped isn't well-sorted before
being written to disk?  The data is accessed and updated randomly in
the INN history mechanism...


> :are a bit, well, wrong.  The way I see it, by giving the MAP_NOSYNC
> :flag, I'm sort of asking for preferential treatment, kinda like mlock,
> 
>      The pages are treated the way any VM page is treated... they'll
>      be cached based on use.  I don't think this is the problem.

I think it is a problem -- at least, it's enough of a performance hit
that I'm willing to sacrifice a couple hundred megs of RAM to avoid
hitting the disk -- our NetBSD machines, in spite of the bug, achieve
stellar performance levels by leaving the history files alone for days.

The problem is that the pages are accessed randomly, so that prior use
is no indication of future use.  Also, what seems to be happening is
that a random selection of pages are being updated on-disk now, which
takes, oh, looks like four or five seconds every so often.  During
this update time (which MAP_NOSYNC helps to avoid, sometimes), all
other history access blocks, including that I'm doing as a reader.
Then grabbing a random page that needs to be consulted but isn't still
in memory gets expensive.

Eeep, that last slowdown took 8 seconds...  It's true that I haven't
rebuilt things without MAP_NOSYNC to see how much time is spent on
periodic sync activity, but what I'm seeing now, even with a gig of
RAM and an ever-increasing list of null-routed networks to keep abusers
away, is starting to resemble that level of poor performance.


>     Ok, lets look at a summary of your timing results:
>     
>     hishave             overv           artclean        artctrl
>     38857(26474)        112176(6077)    12264(6930)     2297(308)
>     22114(28196)        136855(6402)    12757(7295)     1257(322)
>     13614(24312)        156723(6071)    13232(6800)     324(244)
>     9944(25198)         164223(6620)    13441(7753)     255(160)
>     2777(50732)         24979(3788)     29821(4017)     131(51)
>     31975(11904)        21593(3320)     25148(3567)     5935(340)
> 
>     Specifically, look at the last one where it blew up on you.  hishave
>     and artctrl are much worse, overview and artclean are about the same.

Artctrl, at six seconds out of five minutes, isn't of too much concern
to me now.  This would be (almost exclusively) cancel messages -- I
haven't looked to see just how the CNFS cyclical buffer multiple-articles-
in-one-large-file method of storage handles cancels...  Certainly not by
unlinking an inode as with the old news spool we all grew up to know and
loathe.

Artclean is an addition I've made that wasn't present in Joe Greco's
timer (I don't remember if you had much experience with this, or if you
were busy writing Diablo at this time, so if I'm explaining something
that's obvious, just assume it's for the benefit of the audience who
may appreciate a bit of background).  It's an in-memory operation that
one of the developers recently realized is pretty expensive, and it is
dependant on the size of the article being processed.  The top timings
are about twice the normal average rate of ten articles/second (though
I have seen well in excess of 35/sec sustained for a one hour period,
arrrg) and are that way as I'm bringing in backlogs.  For whatever
reason, a backlogged feed gives priority to taking in smaller articles
first, and as a result, even though more articles are processed, the
time taken overall is less than when handling a smaller number of
articles with an order of magnitude more volume.

But this explanation is all pretty irrelevant to the immediate
discussion...  Anyway...


>     This is an indication of excessive seeking on the history disk.  I
>     believe that this seeking may be due to file fragmentation.

To me, sitting on the floor in front of the machine, it looks like it
indicates there is activity on the history disk.  The ideal situation
(from my experience with both NetBSD and the transit machines) is that
the history drive light stays off.  Which it does, mostly, during the
Good Timings after the bulk of the text data is read from the disk.

The problem is, right now, rather than remaining idle and letting the
other disks take a pounding, I'm seeing non-stop flickering of the
history drive, which I don't see on the transit machines, and during
the 5 to 10 seconds worth of blocked history lookups, the light is
constantly lit as it seeks, which, since I've done a `cp' on the two
files, I believe is a result of the randomness of the pages being
written.


>     There is an easy way to test file fragmentation.  Kill off everything
>     and do a 'dd if=history of=/dev/null bs=32k'.  Do the same for 
>     history.hash and history.index.  Look at the iostat on the history
>     drive.  Specifically, do an 'iostat 1' and look at the KB/t (kilobytes
>     per transfer).  You should see 32-64KB/t.  If you see 8K/t the file
>     is severely fragmented.  Go through the entire history file(s) w/ dd...
>     the fragmentation may occur near the end.

Okay, I'm doing this:  The two hash-type files give me between 9 and
10K/t; the history text file gives me more like 60KB/t.  Hmmm.  It's
not the best drive in the world, I know that, but I had set up a NetBSD
transit machine with five-year-old drives and saw average history lookup
times of less than 10 microseconds each, just because it wasn't hitting
the disk much.  On an old PPro 200.


>     If the file turns out to be fragmented, the only way to fix it is to 
>     fix the code that extends the file.  Instead of ftruncate()ing the file
>     and then appending to it via the mmap(), you should modify the
>     ftruncate() code to fill in the hole with write()'s before returning,
>     so the modifications via mmap() are modifying pages that already have
>     file-backing store rather then filling in holes.
> 
>     Then rewrite the history file (e.g. 'cp'), and restart innd.

But...  But... But... but... but but but but... but butbutbut*BANG*but but...
(Sorry, running a bit low on fuel)  The thing is, before I started this
particular instance of INN, I had just made a copy of the two hash files,
the ones that give the less-than-ideal throughput as shown.  I didn't
bother with the text file, as it's never even consulted in the recent
history k0deZ (unless one uses what's now called `tagged hash' that
uses the text file as a reference after hashes are looked up in the .pag
file), and I've been sure that hasn't been a problem -- all accesses
there are sequential.

What I did note as interesting was that Diablo's k0deZ use the madvise
MADV_WILLNEED which seems reasonable, but unlike the random access of
INN's hash files, it hints at SEQUENTIAL access.  Now, I'll be the
first to confess to not having looked at your history k0deZ that you
wrote for Diablo and just how it does work, but whatever is happening
when handling the INN random history is definitely not very optimal.


Now, there appear to be several things conspiring against me.  It looks
like the MAP_NOSYNC option to mmap is best suited for mmap'ed file-based
IPC, where the contents of what's on the disk is less critical.  When I
use it with a history file, I run the risk of a crash (read: I get my
leg wrapped in the power cord as I thrash about in my sleep) where most
of the updates to the .index and .hash files go to the Big Blue Bitbucket
to join most of my work over the last years.  This isn't fatal, as the
two files can be recreated from the text file when one is aware of a
crash.  I can live with it, as, at the moment, it seems to be the only
way to get decent history performance without a significant hardware
investment or accepting the will of my boss and using a different OS.

Now, about madvise:  I'm not quite sure how specifying `MADV_RANDOM'
changes the behaviour and if it would have any affect on the flushing
of the data to the disk when needed -- it appears to my addled brane
that with a truly random set of data like the .index/.hash files, you
can't attempt to predict what pages will be best to flush due to
inactivity, since any random page should be equally likely to be
accessed as any other.  If anything, you'd want to write out sequential
pages for speed.

Now, if that's what is happening, then somehow things look to be getting
messed up from the time I made the `cp' some hours ago that should have
created virginally pure unfragmented database files...

I had suggested that `MAP_NOSYNC' be a hint that the pages should get
preferential treatment to be locked in memory.  Maybe overkill for your
typical IPC file, but perhaps the combination of MAP/MADV_NOSYNC and
MADV_RANDOM should be a hint to prioritize these pages.

Okay, now to MADV_WILLNEED -- I hoped that would be the needed kick
to effectively give me the `mlock'-like performance, but even with a
full gig of memory and a dozen relatively-behaved readers, there's
still way too much history disk activity that really drags down overall
performance.

Okay, so maybe the combination of MAP/MADV_NOSYNC and MADV_WILLNEED
should be the hint needed to avoid paging out those pages, even if
they haven't been accessed in quite some time.  Especially if it's in
combination with MADV_RANDOM...


The numbers you've seen aren't very impressive -- I'm quite sure that I
was able to keep up easily with NetBSD, although the corruption that
resulted there caused me to abandon ship quickly.  As it is, I can't
keep up with the overnight binaries peak in a full feed, as the history
times are taking up two minutes out of every five now.  Bleeecccchhh.

The actual history lookups and updates that matter are all done within
the memory taken up by the .index and .hash files.  So, by keeping
them in memory, one doesn't need to do any disk activity at all for
lookups, and updates, well, so long as you commit them to the disk at
shutdown, all should be okay.  That's what I'm attempting to achieve.
These lookups and updates are bleedin' expensive when disk activity
rears its ugly head.

Not to worry, I'm going to keep plugging to see if there is a way for
me to lock these two files into memory so that they *stay* there, just
to prove whether or not that's a significant performance improvement.
I may have to break something, but hey...


(sheesh, two minutes...  :~-)


thanks,
barry bouwsma, obstinate stubborn newsbast'rd


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message