Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 3 Dec 2000 16:53:49 -0800 (PST)
From:      Matt Dillon <dillon@earth.backplane.com>
To:        News History File User <newsuser@free-pr0n.netscum.dk>
Cc:        hackers@FreeBSD.ORG, usenet@tdk.net
Subject:   Re: vm_pageout_scan badness
Message-ID:  <200012040053.eB40rnm69425@earth.backplane.com>
References:  <200012011918.eB1JIol53670@earth.backplane.com> <200012020525.eB25PPQ92768@newsmangler.inet.tele.dk> <200012021904.eB2J4An63970@earth.backplane.com> <200012030700.eB370XJ22476@newsmangler.inet.tele.dk>

next in thread | previous in thread | raw e-mail | index | archive | help
    ok, since I got about 6 requests in four hours to be Cc'd, I'm 
    throwing this back onto the list.  Sorry for the double-response that
    some people are going to get!

    I am going to include some additional thoughts in the front, then break
    to my originally private email response.

    I ran a couple of tests with MAP_NOSYNC to make sure that the
    fragmentation issue is real.  It definitely is.  If you create a
    file by ftruncate()ing it to a large size, then mmap() it SHARED +
    NOSYNC, then modify the file via the mmap, massive fragmentation occurs
    on the file.  This is easily demonstrated by issuing a sequential read
    on the file and noting that the system is not able to do any clustering
    whatsoever and gets a measily 0.6MB/sec of throughput (on a disk
    that can do 12-15MB/sec).  (and the disk seeks wildly during the read).

    When you create a large file and fill it with zero's, THEN mmap() it
    SHARED + NOSYNC and write to it randomly via the mmap(), the file 
    remains laid on disk optimally.  However, I noticed something interesting!
    When I dd if=file of=/dev/null bs=32k the file the first time after
    randomly writing it and then fsync()ing it, I only get 4MB/sec of
    throughput.  If I dd the file a second time I get around 8MB/sec.  If
    I dd it the third time I get the platter speed - 12-15MB/sec.  The issue
    here has to do with the fact that the file is partially cached in the
    first two dd runs.

    The partially cached file shortcuts the I/O clustering code, preventing
    it from issueing read aheads once it hits a buffer that is already
    in the cache.  So if you have a spattering of cached blocks and then
    read a file sequentially, you actually get lower throughput then if
    you don't have *any* cached blocks and then read the file sequentially.
    Verrry interesting!  I think it may be beneficial to the clustering code
    to issue the full read-ahead even if some of the blocks in the middle
    are already cached.  The clustering code only operates when sequential
    operation is detected, so I don't think it can make things worse.

    large file == at least 2 x main memory.


    -- original response --

    Ok, lets concentrate on your hishave, artclean, artctrl, and overview
    numbers.

:-rw-rw-r--  1 news  news  436206889 Dec  3 05:22 history
:-rw-rw-r--  1 news  news         67 Dec  3 05:22 history.dir
:-rw-rw-r--  1 news  news   81000000 Dec  1 01:55 history.hash
:-rw-rw-r--  1 news  news   54000000 Nov 30 22:49 history.index
:
:More observations that may or may not mean anything -- before rebooting,
:I timed the `fsync' commands on the 108MB and 72MB history files, as

    note: the fsync command will not flush MAP_NOSYNC pages.

:The time taken to do the `fsync' was around one minute for the two
:history files.  And around 1 second for the BerkeleyDB file...

    This is an indication of file fragmentation, probably due to holes
    in the history file being filled via the mmap() instead of filled via
    write().

    In order for MAP_NOSYNC to be reasonable, you have to fix the code
    that extends a file via ftruncate()s to write() zero's into the 
    extended portion.

:data getting flushed to disk, then it seems like someone's priorities
:are a bit, well, wrong.  The way I see it, by giving the MAP_NOSYNC
:flag, I'm sort of asking for preferential treatment, kinda like mlock,
:even though that's not available to me as `news' user.

     The pages are treated the way any VM page is treated... they'll
     be cached based on use.  I don't think this is the problem.

    Ok, lets look at a summary of your timing results:
    
    hishave		overv		artclean	artctrl

    38857(26474)	112176(6077)	12264(6930)	2297(308)
    22114(28196)	136855(6402)	12757(7295)	1257(322)
    13614(24312)	156723(6071)	13232(6800)	324(244)
    9944(25198)		164223(6620)	13441(7753)	255(160)
    2777(50732)		24979(3788)	29821(4017)	131(51)
    31975(11904)	21593(3320)	25148(3567)	5935(340)

    Specifically, look at the last one where it blew up on you.  hishave
    and artctrl are much worse, overview and artclean are about the same.

    This is an indication of excessive seeking on the history disk.  I
    believe that this seeking may be due to file fragmentation.

    There is an easy way to test file fragmentation.  Kill off everything
    and do a 'dd if=history of=/dev/null bs=32k'.  Do the same for 
    history.hash and history.index.  Look at the iostat on the history
    drive.  Specifically, do an 'iostat 1' and look at the KB/t (kilobytes
    per transfer).  You should see 32-64KB/t.  If you see 8K/t the file
    is severely fragmented.  Go through the entire history file(s) w/ dd...
    the fragmentation may occur near the end.

    If the file turns out to be fragmented, the only way to fix it is to 
    fix the code that extends the file.  Instead of ftruncate()ing the file
    and then appending to it via the mmap(), you should modify the
    ftruncate() code to fill in the hole with write()'s before returning,
    so the modifications via mmap() are modifying pages that already have
    file-backing store rather then filling in holes.

    Then rewrite the history file (e.g. 'cp'), and restart innd.

						    -Matt




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200012040053.eB40rnm69425>