Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 9 Jun 2013 23:20:01 GMT
From:      Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
To:        freebsd-bugs@FreeBSD.org
Subject:   Re: kern/178997: Heavy disk I/O may hang system
Message-ID:  <201306092320.r59NK1X8046787@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/178997; it has been noted by GNATS.

From: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
To: Bruce Evans <brde@optusnet.com.au>
Cc: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>,
	freebsd-gnats-submit@freebsd.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Mon, 10 Jun 2013 01:07:21 +0200

 Mime-Version: 1.0
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 In-Reply-To: <20130604052658.K1039@besplex.bde.org>
 User-Agent: Mutt/1.4.2.3i
 
 On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote:
 > On Fri, 31 May 2013, Klaus Weber wrote:
 
 Again, I have reordered some sections of your mail in my reply. I have
 also removed freebsd-bugs from Cc: - it seems that replies to PRs get
 forwarded there automatically, and judging from the web interface to
 the mailing list archives, the previous mails were all duplicated.)
 
 > >So you are correct: bonnie++ re-reads the file that was created
 > >previously in the "Writing intelligently..." phase in blocks, modifies
 > >one byte in the block, and writes the block back.
 > >
 > >Something in this specific workload is triggering the huge buildup of
 > >numdirtybuffers when write-clustering is enabled.
 > 
 > I can explain this.  Readers and writers share the offset for the
 > sequential heuristic (it is per-open-file), so cluster_write() cannot
 > tell that bonnie's writes are sequential.  The sequential_heuristic()
 > function sees the offset moving back and forth, which normally means
 > random access, so it tells cluster_write() that the access is random,
 > allthiugh all bonnie seeks back after each read() so that all the
 > writes() are sequential. 
 
 Indeed. Using the the test program you provided, I can confirm that
 reading and writing to the same file via _different_ file descriptors
 works without hangs and with good performance (similar to "-o
 noclusterw", see below for details). vfs.numdirtybuffers count
 remained low, and write rates pretty stable.
 
 Since your program is more flexible than bonnie++, I have done all
 tests in  this mail with it.
 
 > cluster_write() rarely if ever pushes out
 > random writes, so the write pressure builds up to a saturation value
 > almost instantly.  Then all subsequent writes work even more poorly
 > due to the pressure.  Truly random writes would give similar buildup
 > of pressure.
 
 I have briefly tested and confirmed this as well (see below). Mounting
 a file system with "-o async" shows the "quickly raising
 vfs.numdirtybuffers" symptom already during writing (not re-writing) a
 file, e.g. with dd. I could not get the system to hang, but I did not
 test this very thoroughly.
 
 
 > >So while a file system created with the current (32/4k) or old (16/2k)
 > >defaults  does prevent the hangs, it also reduces the sequential write
 > >performance to 70% and 43% of an 64/8k fs.
 > XXX
 > I think bufdaemon and other write-pressure handling code just don't work
 > as well as upper layers for clustering and ordering the writes.  They
 > are currently broken, as shown by the hangs.  Clustering shouldn't be
 > generating so much pressure.  I can explain why it does (see below).
 > But random i/o can easily generate the same pressure, so bufdaemon etc.
 > need to handle it.  Also, pressure should be the usual case for async
 > mounts, and bufdaemon etc. should be able to do clustering and ordering
 > _better_ than upper layers, since they have more blocks to work with.
 > cluster_write() intentionally leaves as much as possible to bufdaemon
 > in order to potentially optimize it.
 > 
 > >In all cases, the vfs.numdirtybuffers count remained fairly small as
 > >long as the bonnie++ processes were writing the testfiles. It rose to
 > >vfs.hidirtybuffers (slowler with only one process in "Rewriting", much
 > >faster when both processes are rewriting).
 > 
 > The slow buildup is caused by fragmentation.  If the files happen to
 > be laid out contiguously, then cluster_write() acts like bawrite(),
 > but better.  With your 64K-blocks and FreeBSD's 128K max cluster size,
 > it normally combines just 2 64K-blocks to create a 128K-cluster, and
 > does 1 write instead of 2.  After this write, both methods have the
 > leave number of dirty buffers (none for data, but a couple for metadata).
 >
 > [Analysis of bonnie++'s write pattern]
 >
 > However, with fragmentation, cluster_write() "leaks" a dirty buffer
 > although it is full, while bawrite() just writes all full buffers.  The
 > leakage is because cluster_write() wants physical contiguity.  This is
 > too much to ask for.  The leak may be fast if the file system is very
 > fragmented.
 
 In all of my (older) tests with bonnie++, the file system was newfs'ed
 before mounting it, so there was no fragmentation for me.
 
 
 > >I have tried to tune system parameters as per your advice, in an attempt
 > >to get a 64/8k fs running stable and with reasonable write performance.
 > >(dd: results omitted for brevity, all normal for 64/8k fs)
 > 
 > I got mostly worse behaviour by tuning things related to dirtybuffers.
 
 Me too, performance-wise. The only exception was reducing
 vfs.dirtybufthresh so that vfs.numdirtybuffers no longer reaches
 vfs.hidirtybuffers; this is needed to prevent the hangs. The exact
 value does not seem to matter much.
 
 
 > >[... You managed to fix the hangs, at a cost of too much performance. ]
 > 
 > >By testing with 3, 4 and 5 bonnie++ processes running simultaneously,
 > >I found that
 > >(vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly
 > >less than vfs.hidirtybuffers for reasonable performance without hangs.
 > >
 > >vfs.numdirtybuffers rises to
 > >(vfs.dirtybufthresh) * (number of bonnie++ process)
 > >and as long as this is below vfs.hidirtybuffers, the system will not
 > >hang.
 
 Actually, further testing revealed that it is not linear with the
 number of processes, but with the number of files being
 rewritten. With bonnie++, you cannot make this distiction; two
 bonnie++ processes cannot re-write the same file.
 
 However, with the test program you provided you can actually re-write
 the same file from two (or more) processes. Testing this shows that
 the correct formula is actually 
 
 "vfs.numdirtybuffers rises to
 (vfs.dirtybufthresh) * (number of files being re-written)
 and as long as this is below vfs.hidirtybuffers, the system will not
 hang."
 
 So, any number of processes can re-write the same file, and the number
 of processes has no effect on vfs.numdirtybuffers.
 
 
 > I wonder why it is linear in the number of processes.  I saw indications
 > of similar behaviour (didn't test extensively).  1 bonnie created
 > about 2/3 of vfs.hidirtybuffers and 2 bonnies saturated at
 > vfs.hidirtybuffers.  This is with vfs.hidirtbuffers much smaller than
 > yours.
 
 Does this correlate with vfs.dirtybufthresh on your system as well?
 (i.e. is vfs.dirtybufthresh about 2/3 of your vfs.hidirtybuffers?
 
 
 > I tried a bdwrite() for the async mount case.  Async mounts should give
 > delayed writes always, but it is a bugfeature that they give delayed
 > writes for critical metadata but normally async() writes for data (via
 > cluster_write()).  Using bdwrite here let me control the behaviour
 > using a combination of mount flags.  It became clearer that the
 > bugfeature is not just a bug.  Without cluster_write() forcing out
 > writes for async mounts, the write pressure is much the same as from
 > bonnie rewrite.
 
 I have tested this as well, and I can confirm that with a file system
 mounted with "-o async" I see the problematic behavior with "normal"
 writes (not re-writes) as well: while writing a large file (e.g. via
 dd), vfs.numdirtybuffers rises quickly to vfs.dirtybufthresh). On
 non-async file systems, I could only provoke this via re-writing.
 
 
 
 > >After reverting the source change, I have decided to try mounting the
 > >file system with "-o noclusterr,noclusterw", and re-test.
 > > [...]
 > >Further tests confirmed that "-o noclusterw" is sufficient to prevent
 > >the hangs and provide good performance.
 > 
 > I thought -noclusterw would be worse until now...
 
 I wanted to check the performance effects of write clustering more
 thoroughly, so I have tested 3 variants (with the test program you
 provided, using a 64 GB file (2x RAM)):
 
 1) Re-writing a single file with two file descriptors, file system
    mounted with default options.
 
 2) Re-writing a single file with two file descriptors, file system
    mounted with "-o noclusterw".
 
 3) Re-writing a single file via a single file descriptor, file system
    mounted with  "-o noclusterw" (required in this case to prevent the
    system hanging).
 
 First, I have tested on a logical volume on the RAID array (3 test
 runs each):
 
 1) (2 desc, w/ clustering)
 0.669u 97.119s 3:23.34 48.0%	5+187k 524441+524311io 0pf+0w
 0.472u 95.501s 3:22.40 47.4%	5+188k 523095+524311io 0pf+0w
 0.606u 93.263s 3:21.13 46.6%	5+188k 522999+524311io 0pf+0w
 
 2) (2 desc, -o noclusterw)
 0.524u 94.180s 3:19.76 47.4%	5+189k 524442+1048576io 0pf+0w
 0.622u 95.200s 3:23.30 47.1%	5+187k 523090+1048576io 0pf+0w
 0.475u 94.232s 3:19.96 47.3%	5+186k 522767+1048576io 0pf+0w
 
 3) (1 desc, -o noclusterw)
 0.922u 95.917s 3:33.74 45.3%	5+187k 524442+1048576io 0pf+0w
 0.679u 95.392s 3:32.62 45.1%	5+187k 522256+1048576io 0pf+0w
 0.976u 93.639s 3:33.68 44.2%	5+187k 521902+1048576io 0pf+0w
 
 As you can see, the performance differences are very small. If you
 consider the differences significant, then "-o noclusterw" is actually
 a tiny bit faster than clustering. You also see clustering at work,
 the number of writes doubles for cases 2) and 3).
 
 I have then repeated the tests on a single disk, connected directly to
 mainboard SATA port, to take the RAID controller out of the
 equation. (With its CPU and on-board RAM, it certainly has the ability
 to cluster and re-order writes after the kernel sends them.) Since the
 tests take a long time to complete, I have only done a single test run
 for each case.
 
 
 1) (2 desc, w/ clustering)
 3.497u 575.616s 24:39.49 39.1%	5+187k 524445+524314io 0pf+0w
 
 2) (2 desc, -o noclusterw)
 5.960u 876.735s 33:35.75 43.7%	5+187k 524445+1048576io 0pf+0w
 
 3) (1 desc, -o noclusterw)
 7.014u 741.382s 29:56.98 41.6%	5+188k 524445+1048576io 0pf+0w
 
 Here, clustering does seem to have a positive effect, and the
 differences are more pronounced.
 
 So it really seems that clustering does provide performance benefits,
 but the RAID controller seems to able to able to make up for the lack
 of clustering (either because clustering is disabled, or because it
 does not work effectively due to interspersed reads and seeks on the
 same file descriptor).
 
 
 > >I am now looking at vfs_cluster.c to see whether I can find which part
 > >is responsible for letting numdirtybuffers raise without bounds and
 > >why only *re* writing a file causes problems, not the initial
 > >writing. Any suggestions on where to start looking are very welcome.
 > 
 > It is very complicated, but it was easy to find its comments saying that
 > it tries not to force out the writes for non-seqential accesses.  I
 > am currently trying the following workarounds:
 
 I have decided to start testing with only a single change from the
 list of changes you provided:
  
 > % diff -u2 vfs_cluster.c~ vfs_cluster.c
 > % @@ -726,8 +890,13 @@
 > %  		 * are operating sequentially, otherwise let the buf or
 > %  		 * update daemon handle it.
 > % +		 *
 > % +		 * Algorithm changeback: ignore seqcount here, at least for
 > % +		 * now, to work around readers breaking it for writers.  It
 > % +		 * is too late to start ignoring after write pressure builds
 > % +		 * up, since not writing out here is the main source of the
 > % +		 * buildup.
 > %  		 */
 > %  		bdwrite(bp);
 > % -		if (seqcount > 1)
 > % -			cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, 
 > vp->v_clen + 1);
 > % +		cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 
 > 1);
 > %  		vp->v_clen = 0;
 > %  		vp->v_cstart = lbn + 1;
 
 And sure enough, a kernel with only this one line change[*] is able to
 handle reading and re-writing a file through a single file descriptor
 just fine, good performance, no hangs, vfs.numdirtybuffers remains
 low:
 
 4) (1 desc, cluster_wbuild_wb unconditionally)
 0.864u 96.145s 3:32.34 45.6%	5+188k 524441+524265io 0pf+0w
 0.976u 94.081s 3:29.70 45.3%	5+187k 489756+524265io 0pf+0w
 0.903u 95.676s 3:33.38 45.2%	5+188k 523124+524265io 0pf+0w
 
 Same test on a single disk, as above:
 
 4) (1 desc, cluster_wbuild_wb unconditionally)
 8.399u 822.055s 33:01.01 41.9%	5+187k 524445+524262io 0pf+0w
 
 Unfortunately, for the single disk case, the performance is the same
 as for the "-o noclusterw" case (even though the writes _are_
 clustered now, see the number of write operation).
 
 
 [*] I still need to confirm that this is really the only change in the
 kernel, but I'm running out of time now. I will test this next weekend.
 
 > This works fairly well.  Changing the algorithm back in all cases reduces
 > performance, but here we have usually finished with the buffer.  Even if
 > seqcount is fixed so that it isn't clobbered by reads, it might be right
 > to ignoe it here, so as to reduce write pressure, until write pressure is
 > handled better.
 
 For me, this change makes the difference between a working system with
 good performance vs a system that limps along and occasionally hangs
 completely; for this specific workload.
 
 Thank you for this!
 
 Next weekend, I want to test some scenarios with and without this
 change. Most importantly, I want to investigate
 
 a) behavior with random I/O (any suggestions on a benchmark or test
 program that can generate "representative" random I/O? IOzone?)
 
 b) behavior with sequential writes
 
 c) behavior with re-writing files, both via one and two file
 descriptors
 
 d) maybe test the other changes to vfs_cluster.c you provided, to see
 whether they make a difference.
 
 e) Investigate the role of vfs.dirtybufthresh. When vfs.numdirtybuffer
 reaches this number, the system switches to a mode where
 vfs.numdirtybuffer no longer increases and seems to handle the load
 just fine. I want to understand why the system sees no need to limit
 numdirtybuffer earlier. Maybe this can be used to improve re-write
 performance to the "2 desc, with clustering" case.
 
 f) I have experienced system freezes when unmounting my test file
 system after running the respective tests. I'm not sure whether this
 related to the current problem, or whether this caused by some test
 patch that I accidently left in the code. I need to check wether this
 reproducible, and whether there is a specific workload that causes
 this.
 
 Klaus



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201306092320.r59NK1X8046787>