Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 31 May 2013 18:31:50 +0200
From:      Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-bugs@FreeBSD.org, freebsd-gnats-submit@FreeBSD.org, Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Subject:   Re: kern/178997: Heavy disk I/O may hang system
Message-ID:  <20130531163150.GA21070@unix-admin.de>
In-Reply-To: <20130528211950.V2606@besplex.bde.org>
References:  <201305261951.r4QJpn9Z071712@oldred.FreeBSD.org> <20130527135103.H919@besplex.bde.org> <20130527185732.GA95744@unix-admin.de> <20130528211950.V2606@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Sorry for the late reply, testing took longer than expected.

(I have combined your replies from separate mails into one, and
reordered some of the text.)

On Tue, May 28, 2013 at 10:03:10PM +1000, Bruce Evans wrote:
> On Mon, 27 May 2013, Klaus Weber wrote:
> >On Mon, May 27, 2013 at 03:57:56PM +1000, Bruce Evans wrote:
> >>On Sun, 26 May 2013, Klaus Weber wrote:

> However, I have never been able to reproduce serious fragmentation problems
> from using too-large-block sizes, or demonstrate significant improvements
> from avoiding the known fragmentation problem by increasing BKVASIZE.
> Perhaps my systems are too small, or have tuning or local changes that
> accidentally avoid the problem.
> 
> Apparently you found a way to reproduce the serious fragmentaion
> problems.  Try using a block size that doesn't ask for the problem.
> (...)
> The reduced fsck time and perhaps the reduced number of cylinder groups
> are the main advantages of large clusters.  vfs-level clustering turns
> most physical i/o's into 128K-blocks (especially for large files) so
> there is little difference between the i/o speed for all fs block sizes
> unless the fs block size is very small.

I have now repeated the tests with several variations of block- and
fragment sizes. In all cases, I did two tests:

1) dd if=/dev/zero of=/mnt/t1/100GB-1.bin bs=100m count=1000
2) bonnie++ -s 64g -n 0 -f -D -d /mnt/t1
    bonnie++ -s 64g -n 0 -f -D -d /mnt/t2

The dd is simply to give a rough idea of the performance impact of the
fs parameters, with the two bonnie++ processes, I was mainly
interested in performance and hangs when both bonnie++ processes are
in their "Rewriting" phase. I have also tested variations where the
block:fragment ratio does not follow the 8:1 recommendation.

64/8k, kernel unpatched:
dd: 1218155712 bytes/sec
bonnie++: around 300 MB/sec, then drops to 0 and system hangs

32/4k, kernel unpatched:
dd: 844188881 bytes/sec
bonnie++: jumps between 25 and 900 MB/sec, no hang

16/2k, kernel unpatched:
dd: 517996142 bytes/sec
bonnie++: mostly 20-50 MB/sec, with 3-10 second "bursts" of 
                  400-650 MB/sec, no hang

64/4k, kernel unpatched:
dd: 1156041159 bytes/sec
bonnie++: hangs system quickly once both processes are in rewriting

32/8k, kernel unpatched:
dd: 938072430 bytes/sec
bonnie++: 29-50 MB/sec, with 3-10 second "bursts" of 
                  up to 650 MB/sec, no hang (but I canceled the test
		  after an hour or so).

So while a file system created with the current (32/4k) or old (16/2k)
defaults  does prevent the hangs, it also reduces the sequential write
performance to 70% and 43% of an 64/8k fs.

The problem seems to be 64k block size, not the 8k fragment size.

In all cases, the vfs.numdirtybuffers count remained fairly small as
long as the bonnie++ processes were writing the testfiles. It rose to
vfs.hidirtybuffers (slowler with only one process in "Rewriting", much
faster when both processes are rewriting).

> >00-04-57.log:vfs.numdirtybuffers: 52098
> >00-05-00.log:vfs.numdirtybuffers: 52096
> >[ etc. ]
> 
> This is a rather large buildup and may indicate a problem.  Try reducing
> the dirty buffer watermarks. 

I have tried to tune system parameters as per your advice, in an attempt
to get a 64/8k fs running stable and with reasonable write performance.
(dd: results omitted for brevity, all normal for 64/8k fs)

all with 64/8k, kernel unpatched:
vfs.lodirtybuffers=250
vfs.hidirtybuffers=1000
vfs.dirtybufthresh=800

bonnie++: 40-150 MB/sec, no hang

vfs.numdirtybuffers raises to 1000 when both processes are rewriting.


vfs.lodirtybuffers=1000
vfs.hidirtybuffers=4000
vfs.dirtybufthresh=3000

bonnie++: 380-50 MB/sec, no hang

For the next tests, I kept lo/hidirtybuffers at 1000/4000, and only
varied dirtybufthresh:
1200: bonnie++: 80-370 MB/sec
1750: bonnie++: around 600 MB/sec
1900: bonnie++: around 580 MB/sec. vfs.numdirtybuffers=3800 (i.e. does
          reach vfs.hidirtybuffers anymore!)
(no hangs in any of the tests).

I then re-tested with lo/hidirtybuffers at their defaults, and only
dirtybufthresh set to slightly less half of hidirtybuffers:

vfs.lodirtybuffers=26069
vfs.hidirtybuffers=52139
vfs.dirtybufthresh=26000

dd: 1199121549 bytes/sec
bonnie++: 180-650 MB/sec, mostly around 500, no hang


By testing with 3, 4 and 5 bonnie++ processes running simultaneously,
I found that
(vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly
less than vfs.hidirtybuffers for reasonable performance without hangs.

vfs.numdirtybuffers rises to
(vfs.dirtybufthresh) * (number of bonnie++ process) 
and as long as this is below vfs.hidirtybuffers, the system will not
hang.


> >>I found that
> >>the problem could be fixed by killing cluster_write() by turning it into
> >>bdwrite() (by editing the running kernel using ddb, since this is easier
> >>than rebuilding the kernel).  I was trying many similar things since I
> >>had a theory that cluster_write() is useless.  [...]
> >
> >If that would provide a useful datapoint, I could try if that make a
> >difference on my system. What changes would be required to test this?
> >
> >Surely its not as easy as replacing the function body of
> >cluster_write() in vfs_cluster.c with just "return bdwrite(bp);"?
> 
> That should work for testing, but it is safer to edit ffs_write()
> and remove the block where it calls cluster_write() (or bawrite()),
> so that it falls through to call bdwrite() in most cases.

I was not sure whether to disable the "bawrite(bp);" in the else part
as well. Here is what I used for the next test (in ffs_write):

  } else if (vm_page_count_severe() ||
              buf_dirty_count_severe() ||
              (ioflag & IO_ASYNC)) {
          bp->b_flags |= B_CLUSTEROK;
          bawrite(bp);
          /* KWKWKW       } else if (xfersize + blkoffset == fs->fs_bsize) {
          if ((vp->v_mount->mnt_flag & MNT_NOCLUSTERW) == 0) {
                  bp->b_flags |= B_CLUSTEROK;
                  cluster_write(vp, bp, ip->i_size, seqcount);
          } else {
                  bawrite(bp);
                  } KWKWKW */
  } else if (ioflag & IO_DIRECT) {
          bp->b_flags |= B_CLUSTEROK;
          bawrite(bp);
  } else {
          bp->b_flags |= B_CLUSTEROK;
          bdwrite(bp);
  }

dd: 746804775 bytes/sec

During the dd tests, iostat shows a weird, sawtooth-like behavior:
 64.00 26730 1670.61   0  0 12  3 85
 64.00 13308 831.73   0  0  4  1 95
 64.00 5534 345.85   0  0 10  1 89
 64.00  12  0.75   0  0 16  0 84
 64.00 26544 1658.99   0  0 10  2 87
 64.00 12172 760.74   0  0  3  1 95
 64.00 8190 511.87   0  0  8  1 91
 64.00  10  0.62   0  0 14  0 86
 64.00 22578 1411.11   0  0 14  3 83
 64.00 12634 789.63   0  0  3  1 95
 64.00 11695 730.96   0  0  6  2 92
 48.00   7  0.33   0  0 13  0 87
 64.00 11801 737.58   0  0 17  1 82
 64.00 19113 1194.59   0  0  6  2 92
 64.00 15996 999.77   0  0  4  2 94
 64.00   3  0.19   0  0 13  0 87
 64.00 10202 637.63   0  0 16  1 83
 64.00 20443 1277.71   0  0  8  2 90
 64.00 15586 974.10   0  0  4  1 95
 64.00 682 42.64   0  0 13  0 87

With two bonnie++ processes in "Writing intelligently" phase, iostat
jumped between 9 and 350 MB/sec. I had cancelled the test before the
first bonnie++ process reached "Rewriting..." phase, due to the dismal
performance.

Already during the "Writing intelligently" phase,  vfs.numdirtybuffers
reaches vfs.hidirtybuffers (during previous tests, vfs.numdirtybuffers
only raises to high numbers in the "Rewriting..." phase).


After reverting the source change, I have decided to try mounting the
file system with "-o noclusterr,noclusterw", and re-test. This is
equivalent to disabling only the if-part of the expression in the
source snippet above.

dd: 1206199580 bytes/sec
bonnie++: 550-700 MB/sec, no hang

During the tests, vfs.numdirtybuffers remains low, lo/hidirtybuffers
and dirtybufthresh are at their defaults:
vfs.dirtybufthresh: 46925
vfs.hidirtybuffers: 52139
vfs.lodirtybuffers: 26069
vfs.numdirtybuffers: 15

So it looks like you were spot-on by suspecting cluster_write(). 
Further tests confirmed that "-o noclusterw" is sufficient to prevent
the hangs and provide good performance. "-o noclusterr" on its own
makes no difference; the system will hang.


I have also have tested with write-clustering enabled, but with
vfs.write_behind=0 and vfs.write_behind=2, respectively. In both
cases, the system hangs with two bonnie++ in "Rewriting...".


I have also tested with BKVASIZE set to 65536. As you explained, this
reduced the number of buffers:
vfs.dirtybufthresh: 11753
vfs.hidirtybuffers: 13059
vfs.lodirtybuffers: 6529

dd results remain unchanged from a BKVASIZE of 16k. bonnie++'s iostats
with 2 process in "Rewriting..." jump between 70 and 800 MB/sec and
numdirtybuffers reaches max:

vfs.numdirtybuffers: 13059

Even though numdirtybuffers reaches hidirtybuffers, the system does
not hang, but performance is not very good.

With BKVASIZE set to 65536 _and_ the fs mounted "-o noclusterw",
the performance is same as with BKVASIZE of 16k, and the system does
not hang.

I have now reverted BKVASIZE to its default, as the main factor for a
stable and fast system seems to be the noclusterw mount option.


> >>Apparently you found a way to reproduce the serious fragmentaion
> >>problems.
> >
> >A key factor seems to be the "Rewriting" operation. I see no problem
> >during the "normal" writing, nor could I reproduce it with concurrent
> >dd runs.
> 
> I don't know exactly what bonnie rewrite bmode does.  Is it just read/
> [modify]/write of sequential blocks with a fairly small block size?
> Old bonnie docs say that the block size is always 8K.  One reason I
> don't like bonnie.  Clustering should work fairly normally with that.
> Anything with random seeks would break clustering.

Here is the relevant part from bonnie++'s source (in C++):
---------------
bufindex = 0;
for(words = 0; words < num_chunks; words++)
{ // for each chunk in the file
  dur.start();
  if (file.read_block(PVOID(buf)) == -1)
    return 1;
  bufindex = bufindex % globals.io_chunk_size();
  buf[bufindex]++;
  bufindex++;
  if (file.seek(-1, SEEK_CUR) == -1)
    return 1;
  if (file.write_block(PVOID(buf)) == -1)
    return io_error("re write(2)");
-----------
globals.io_chunk_size() is 8k (by default and in all of my tests), and
bonnie++ makes sure that buf is page-aligned. 

So you are correct: bonnie++ re-reads the file that was created
previously in the "Writing intelligently..." phase in blocks, modifies
one byte in the block, and writes the block back.

Something in this specific workload is triggering the huge buildup of
numdirtybuffers when write-clustering is enabled.


I am now looking at vfs_cluster.c to see whether I can find which part
is responsible for letting numdirtybuffers raise without bounds and
why only *re* writing a file causes problems, not the initial
writing. Any suggestions on where to start looking are very welcome.

Klaus



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130531163150.GA21070>