Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 10 Jun 2013 01:10:00 GMT
From:      Bruce Evans <brde@optusnet.com.au>
To:        freebsd-bugs@FreeBSD.org
Subject:   Re: kern/178997: Heavy disk I/O may hang system
Message-ID:  <201306100110.r5A1A0FM076378@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/178997; it has been noted by GNATS.

From: Bruce Evans <brde@optusnet.com.au>
To: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Mon, 10 Jun 2013 11:00:29 +1000 (EST)

 On Mon, 10 Jun 2013, Klaus Weber wrote:
 
 > On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote:
 >> On Fri, 31 May 2013, Klaus Weber wrote:
 
 This thread is getting very long, and I will only summarize a couple
 of things that I found last week here.  Maybe more later.
 
 o Everything seems to be working as well as intended (not very well)
    except in bufdaemon and friends.  Perhaps it is already fixed there.
    I forgot to check which version of FreeBSD you are using.  You may
    be missing some important fixes.  There were some by kib@ a few
    months ago, and some by jeff@ after this thread started.  I don't
    run any version of FreeBSD new enough to have these, and the version
    that I run also doesn't seem to have any serious bugs in bufdaemon.
    It just works mediocrely.
 
 o Writing in blocks of size less than the fs block size, as bonnie
    normally does, gives much the same rewriting effect as bonnie does
    explicitly, because the system is forced to read each block before
    doing a partial write to it.  This at best doubles the amount of
    i/o and halves the throughput of the writes.
 
 o There are some minor bugs in the read-before-write system code, and
    some differences in the "much the same" that may be important in
    some cases:
    - when bonnie does the read-before-write, the system normally uses
      cluster_read() on the read descriptor and thus uses the system's
      idea of sequentiality on the read descriptor.  For both ffs and
      msdosfs, this normally results in a reading small cluster (up to
      the end of the current read) followed by async read-ahead of a
      larger cluster or 2.  The separate clusters improve latency but
      reduce performance.  scottl@ recently committed a sysctl
      vfs.read_min for avoiding the earlier splitting.  Using it made
      some interesting but ultimately unimportant differences to the
      2-bonnie problem.
    - when the system does the read-before-write, ffs normally uses
      cluster_read() on the write descriptor and thus uses the system's
      idea of sequentially on the write descriptor.  ffs doesn't know
      the correct amount to read in this case, and it always asks
      for MAXBSIZE, which is both too small and too large.  This value
      is the amount that should be read synchronously.  It is too large
      since the normal amount is the application's block size which is
      normally smaller, and it is too small since MAXBSIZE is only
      half of the max cluster size.  The correct tradeoff of latency
      vs throughput is even less clear than for a user read, and further
      off from being dynamic.  msdosfs doesn't even use cluster_read()
      for this (my bad).  It uses plain bread().  This gives very low
      performance when the block size is small.  So msdosfs worked much
      better in the 2-bonnie benchmark than for rewrites generated by
      dd just writing with a small block size and conv=notrunc.  After
      fixing thus, msdosfs worked slightly better than ffs in all cases.
    - whoever does the read-before-write, cluster reading tends to generate
      a bad i/o pattern.  I saw patterns like the following (on ~5.2 where
      the max cluster size is only 64K, after arranging to mostly use this
      size):
        file1: read       64K offset 0
        file1: read ahead 64K offset 64K
        file2: read       64K offset 0
        file1: read ahead 64K offset 64K
        file1: write      64K offset 0
        file1: read       64K offset 128K
        file1: read ahead 64K offset 192K
        file2: write      64K offset 0
      The 2 files make the disk seek a lot, and the read-and-read-ahead
      gives even more seeks to get back to the write position.  My drives
      are old and have only about 2MB of cache.  Seeks with patterns like
      the above are apparently just large enough to break the drives'
      caching.  OTOH, if I use your trick of mounting with -noclusterw,
      the seeks are reduced signficantly and my throughput increases by
      almost a factor of 2, even though this gives writes of only 16K.
      Apparently the seeks are reduced just enough for the drives' caches
      to work well.  I think the same happens for you.  Your i/o system
      is better, but it only takes a couple of bonnies and perhaps the
      read pointers getting even further ahead of the write pointers
      to defeat the drive's caching.  Small timing differences probably
      allow the difference to build up.  Mounting with -noclusterw also
      gives some synchronization that will prevent this buildup.
    - when the system does the read-before-write, the sequential heuristic
      isn't necessarily clobbered, but it turns out that the clobbering
      gives the best possible behaviour, except for limitations and bugs
      in bufdaemon!...
 
 > [... good stuff clipped]
 > So it really seems that clustering does provide performance benefits,
 > but the RAID controller seems to able to able to make up for the lack
 > of clustering (either because clustering is disabled, or because it
 > does not work effectively due to interspersed reads and seeks on the
 > same file descriptor).
 
 Yes, the seek pattern caused by async-but-not-long-delayed writes
 (whether done by cluster_write() a bit later or bawrite() directly)
 combined with reading far ahead (whether done explicitly or implicitly)
 is very bad even for 1 file, but can often be compensated for by caching
 in the drives.  With 2 files or random writes on 1 file it is much worse,
 but appaerently mounting with -noclusterw limits it enough for the
 drives to compensate in the case of 2 bonnies.  I think the best we
 can do in general is delay writes as long as possible and then
 schedule them perfectly.  But scheduling them perfectly is difficult
 and only happens accidentally.
 
 >>> I am now looking at vfs_cluster.c to see whether I can find which part
 >>> is responsible for letting numdirtybuffers raise without bounds and
 >>> why only *re* writing a file causes problems, not the initial
 >>> writing. Any suggestions on where to start looking are very welcome.
 >>
 >> It is very complicated, but it was easy to find its comments saying that
 >> it tries not to force out the writes for non-seqential accesses.  I
 >> am currently trying the following workarounds:
 >
 > I have decided to start testing with only a single change from the
 > list of changes you provided:
 >
 >> % diff -u2 vfs_cluster.c~ vfs_cluster.c
 >> % @@ -726,8 +890,13 @@
 >> %  		 * are operating sequentially, otherwise let the buf or
 >> %  		 * update daemon handle it.
 >> % +		 *
 >> % +		 * Algorithm changeback: ignore seqcount here, at least for
 >> % +		 * now, to work around readers breaking it for writers.  It
 >> % +		 * is too late to start ignoring after write pressure builds
 >> % +		 * up, since not writing out here is the main source of the
 >> % +		 * buildup.
 >> %  		 */
 >> %  		bdwrite(bp);
 >> % -		if (seqcount > 1)
 >> % -			cluster_wbuild_wb(vp, lblocksize, vp->v_cstart,
 >> vp->v_clen + 1);
 >> % +		cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen +
 >> 1);
 >> %  		vp->v_clen = 0;
 >> %  		vp->v_cstart = lbn + 1;
 >
 > And sure enough, a kernel with only this one line change[*] is able to
 > handle reading and re-writing a file through a single file descriptor
 > just fine, good performance, no hangs, vfs.numdirtybuffers remains
 > low:
 
 After more testing, I found that this was almost perfectly backwards for
 my hardware!  I think for your hardware it allows the drives to
 compensate, much like with -noclusterw but with a slightly improved
 throughput due to the larger writes.  But with my drives, it mostly
 just gives more seeks.  After changing this back and being more careful
 with the comparisons, I found that best results are obtained (in ~5.2)
 by letting numdirtybuffers build up.  The breakage of the sequential
 heuristic cause the above to never force out the cluster immediately
 for the 2-bonnie case.  I get similar behaviour by always using delayed
 writes in ffs_write().  This might depend on setting B_CLUSTEROK in more
 cases, so that the clustering always gets done later.
 
 Typical throughputs for me:
 - my drives can do 55MB/sec max and get 48 for writing 1 file with large
    blocks using dd
 - 48 drops to half of 20-24 with read-before-write for 1 file.  That's
    a 4-fold reduction.  One half is for the doubled i/o and the other half
    is for the seeks.
 - half of 20-24 drops to half of 10-12 with 2 files and read-before-write
    of each, in the best case.  That's an 8-fold reduction.  Another factor
    of 2 is apparently lost to more seeks.
 - half of 10-12 drops to half of 5-6, as in the previous point but in the
    worst case.  That's a 16-fold reduction.  The worst case is with my
    modification above.  It maximizes the seeks.  My original idea for a
    fix (in the above diff) gave this case.  It gave almost perfect
    clustering and almost no buildup of numdirtybuffers, but throughput
    was still worst.  (My drives can do 16K blocks at full speed provided
    the blocks are contiguous, so they don't benifit much from clustering
    except for its side effect of reducing seeks to other blocks in between
    accessing the contiguous ones.)
 Some of this typical behaviour is not very dependent on block sizes.  The
 drives become seek-bound, and anything that doubles the number of seeks
 halves the throughput.
 
 Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201306100110.r5A1A0FM076378>