Date: Sun, 9 Jun 2013 23:20:01 GMT From: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de> To: freebsd-bugs@FreeBSD.org Subject: Re: kern/178997: Heavy disk I/O may hang system Message-ID: <201306092320.r59NK1X8046787@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/178997; it has been noted by GNATS. From: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de> To: Bruce Evans <brde@optusnet.com.au> Cc: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>, freebsd-gnats-submit@freebsd.org Subject: Re: kern/178997: Heavy disk I/O may hang system Date: Mon, 10 Jun 2013 01:07:21 +0200 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130604052658.K1039@besplex.bde.org> User-Agent: Mutt/1.4.2.3i On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote: > On Fri, 31 May 2013, Klaus Weber wrote: Again, I have reordered some sections of your mail in my reply. I have also removed freebsd-bugs from Cc: - it seems that replies to PRs get forwarded there automatically, and judging from the web interface to the mailing list archives, the previous mails were all duplicated.) > >So you are correct: bonnie++ re-reads the file that was created > >previously in the "Writing intelligently..." phase in blocks, modifies > >one byte in the block, and writes the block back. > > > >Something in this specific workload is triggering the huge buildup of > >numdirtybuffers when write-clustering is enabled. > > I can explain this. Readers and writers share the offset for the > sequential heuristic (it is per-open-file), so cluster_write() cannot > tell that bonnie's writes are sequential. The sequential_heuristic() > function sees the offset moving back and forth, which normally means > random access, so it tells cluster_write() that the access is random, > allthiugh all bonnie seeks back after each read() so that all the > writes() are sequential. Indeed. Using the the test program you provided, I can confirm that reading and writing to the same file via _different_ file descriptors works without hangs and with good performance (similar to "-o noclusterw", see below for details). vfs.numdirtybuffers count remained low, and write rates pretty stable. Since your program is more flexible than bonnie++, I have done all tests in this mail with it. > cluster_write() rarely if ever pushes out > random writes, so the write pressure builds up to a saturation value > almost instantly. Then all subsequent writes work even more poorly > due to the pressure. Truly random writes would give similar buildup > of pressure. I have briefly tested and confirmed this as well (see below). Mounting a file system with "-o async" shows the "quickly raising vfs.numdirtybuffers" symptom already during writing (not re-writing) a file, e.g. with dd. I could not get the system to hang, but I did not test this very thoroughly. > >So while a file system created with the current (32/4k) or old (16/2k) > >defaults does prevent the hangs, it also reduces the sequential write > >performance to 70% and 43% of an 64/8k fs. > XXX > I think bufdaemon and other write-pressure handling code just don't work > as well as upper layers for clustering and ordering the writes. They > are currently broken, as shown by the hangs. Clustering shouldn't be > generating so much pressure. I can explain why it does (see below). > But random i/o can easily generate the same pressure, so bufdaemon etc. > need to handle it. Also, pressure should be the usual case for async > mounts, and bufdaemon etc. should be able to do clustering and ordering > _better_ than upper layers, since they have more blocks to work with. > cluster_write() intentionally leaves as much as possible to bufdaemon > in order to potentially optimize it. > > >In all cases, the vfs.numdirtybuffers count remained fairly small as > >long as the bonnie++ processes were writing the testfiles. It rose to > >vfs.hidirtybuffers (slowler with only one process in "Rewriting", much > >faster when both processes are rewriting). > > The slow buildup is caused by fragmentation. If the files happen to > be laid out contiguously, then cluster_write() acts like bawrite(), > but better. With your 64K-blocks and FreeBSD's 128K max cluster size, > it normally combines just 2 64K-blocks to create a 128K-cluster, and > does 1 write instead of 2. After this write, both methods have the > leave number of dirty buffers (none for data, but a couple for metadata). > > [Analysis of bonnie++'s write pattern] > > However, with fragmentation, cluster_write() "leaks" a dirty buffer > although it is full, while bawrite() just writes all full buffers. The > leakage is because cluster_write() wants physical contiguity. This is > too much to ask for. The leak may be fast if the file system is very > fragmented. In all of my (older) tests with bonnie++, the file system was newfs'ed before mounting it, so there was no fragmentation for me. > >I have tried to tune system parameters as per your advice, in an attempt > >to get a 64/8k fs running stable and with reasonable write performance. > >(dd: results omitted for brevity, all normal for 64/8k fs) > > I got mostly worse behaviour by tuning things related to dirtybuffers. Me too, performance-wise. The only exception was reducing vfs.dirtybufthresh so that vfs.numdirtybuffers no longer reaches vfs.hidirtybuffers; this is needed to prevent the hangs. The exact value does not seem to matter much. > >[... You managed to fix the hangs, at a cost of too much performance. ] > > >By testing with 3, 4 and 5 bonnie++ processes running simultaneously, > >I found that > >(vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly > >less than vfs.hidirtybuffers for reasonable performance without hangs. > > > >vfs.numdirtybuffers rises to > >(vfs.dirtybufthresh) * (number of bonnie++ process) > >and as long as this is below vfs.hidirtybuffers, the system will not > >hang. Actually, further testing revealed that it is not linear with the number of processes, but with the number of files being rewritten. With bonnie++, you cannot make this distiction; two bonnie++ processes cannot re-write the same file. However, with the test program you provided you can actually re-write the same file from two (or more) processes. Testing this shows that the correct formula is actually "vfs.numdirtybuffers rises to (vfs.dirtybufthresh) * (number of files being re-written) and as long as this is below vfs.hidirtybuffers, the system will not hang." So, any number of processes can re-write the same file, and the number of processes has no effect on vfs.numdirtybuffers. > I wonder why it is linear in the number of processes. I saw indications > of similar behaviour (didn't test extensively). 1 bonnie created > about 2/3 of vfs.hidirtybuffers and 2 bonnies saturated at > vfs.hidirtybuffers. This is with vfs.hidirtbuffers much smaller than > yours. Does this correlate with vfs.dirtybufthresh on your system as well? (i.e. is vfs.dirtybufthresh about 2/3 of your vfs.hidirtybuffers? > I tried a bdwrite() for the async mount case. Async mounts should give > delayed writes always, but it is a bugfeature that they give delayed > writes for critical metadata but normally async() writes for data (via > cluster_write()). Using bdwrite here let me control the behaviour > using a combination of mount flags. It became clearer that the > bugfeature is not just a bug. Without cluster_write() forcing out > writes for async mounts, the write pressure is much the same as from > bonnie rewrite. I have tested this as well, and I can confirm that with a file system mounted with "-o async" I see the problematic behavior with "normal" writes (not re-writes) as well: while writing a large file (e.g. via dd), vfs.numdirtybuffers rises quickly to vfs.dirtybufthresh). On non-async file systems, I could only provoke this via re-writing. > >After reverting the source change, I have decided to try mounting the > >file system with "-o noclusterr,noclusterw", and re-test. > > [...] > >Further tests confirmed that "-o noclusterw" is sufficient to prevent > >the hangs and provide good performance. > > I thought -noclusterw would be worse until now... I wanted to check the performance effects of write clustering more thoroughly, so I have tested 3 variants (with the test program you provided, using a 64 GB file (2x RAM)): 1) Re-writing a single file with two file descriptors, file system mounted with default options. 2) Re-writing a single file with two file descriptors, file system mounted with "-o noclusterw". 3) Re-writing a single file via a single file descriptor, file system mounted with "-o noclusterw" (required in this case to prevent the system hanging). First, I have tested on a logical volume on the RAID array (3 test runs each): 1) (2 desc, w/ clustering) 0.669u 97.119s 3:23.34 48.0% 5+187k 524441+524311io 0pf+0w 0.472u 95.501s 3:22.40 47.4% 5+188k 523095+524311io 0pf+0w 0.606u 93.263s 3:21.13 46.6% 5+188k 522999+524311io 0pf+0w 2) (2 desc, -o noclusterw) 0.524u 94.180s 3:19.76 47.4% 5+189k 524442+1048576io 0pf+0w 0.622u 95.200s 3:23.30 47.1% 5+187k 523090+1048576io 0pf+0w 0.475u 94.232s 3:19.96 47.3% 5+186k 522767+1048576io 0pf+0w 3) (1 desc, -o noclusterw) 0.922u 95.917s 3:33.74 45.3% 5+187k 524442+1048576io 0pf+0w 0.679u 95.392s 3:32.62 45.1% 5+187k 522256+1048576io 0pf+0w 0.976u 93.639s 3:33.68 44.2% 5+187k 521902+1048576io 0pf+0w As you can see, the performance differences are very small. If you consider the differences significant, then "-o noclusterw" is actually a tiny bit faster than clustering. You also see clustering at work, the number of writes doubles for cases 2) and 3). I have then repeated the tests on a single disk, connected directly to mainboard SATA port, to take the RAID controller out of the equation. (With its CPU and on-board RAM, it certainly has the ability to cluster and re-order writes after the kernel sends them.) Since the tests take a long time to complete, I have only done a single test run for each case. 1) (2 desc, w/ clustering) 3.497u 575.616s 24:39.49 39.1% 5+187k 524445+524314io 0pf+0w 2) (2 desc, -o noclusterw) 5.960u 876.735s 33:35.75 43.7% 5+187k 524445+1048576io 0pf+0w 3) (1 desc, -o noclusterw) 7.014u 741.382s 29:56.98 41.6% 5+188k 524445+1048576io 0pf+0w Here, clustering does seem to have a positive effect, and the differences are more pronounced. So it really seems that clustering does provide performance benefits, but the RAID controller seems to able to able to make up for the lack of clustering (either because clustering is disabled, or because it does not work effectively due to interspersed reads and seeks on the same file descriptor). > >I am now looking at vfs_cluster.c to see whether I can find which part > >is responsible for letting numdirtybuffers raise without bounds and > >why only *re* writing a file causes problems, not the initial > >writing. Any suggestions on where to start looking are very welcome. > > It is very complicated, but it was easy to find its comments saying that > it tries not to force out the writes for non-seqential accesses. I > am currently trying the following workarounds: I have decided to start testing with only a single change from the list of changes you provided: > % diff -u2 vfs_cluster.c~ vfs_cluster.c > % @@ -726,8 +890,13 @@ > % * are operating sequentially, otherwise let the buf or > % * update daemon handle it. > % + * > % + * Algorithm changeback: ignore seqcount here, at least for > % + * now, to work around readers breaking it for writers. It > % + * is too late to start ignoring after write pressure builds > % + * up, since not writing out here is the main source of the > % + * buildup. > % */ > % bdwrite(bp); > % - if (seqcount > 1) > % - cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, > vp->v_clen + 1); > % + cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + > 1); > % vp->v_clen = 0; > % vp->v_cstart = lbn + 1; And sure enough, a kernel with only this one line change[*] is able to handle reading and re-writing a file through a single file descriptor just fine, good performance, no hangs, vfs.numdirtybuffers remains low: 4) (1 desc, cluster_wbuild_wb unconditionally) 0.864u 96.145s 3:32.34 45.6% 5+188k 524441+524265io 0pf+0w 0.976u 94.081s 3:29.70 45.3% 5+187k 489756+524265io 0pf+0w 0.903u 95.676s 3:33.38 45.2% 5+188k 523124+524265io 0pf+0w Same test on a single disk, as above: 4) (1 desc, cluster_wbuild_wb unconditionally) 8.399u 822.055s 33:01.01 41.9% 5+187k 524445+524262io 0pf+0w Unfortunately, for the single disk case, the performance is the same as for the "-o noclusterw" case (even though the writes _are_ clustered now, see the number of write operation). [*] I still need to confirm that this is really the only change in the kernel, but I'm running out of time now. I will test this next weekend. > This works fairly well. Changing the algorithm back in all cases reduces > performance, but here we have usually finished with the buffer. Even if > seqcount is fixed so that it isn't clobbered by reads, it might be right > to ignoe it here, so as to reduce write pressure, until write pressure is > handled better. For me, this change makes the difference between a working system with good performance vs a system that limps along and occasionally hangs completely; for this specific workload. Thank you for this! Next weekend, I want to test some scenarios with and without this change. Most importantly, I want to investigate a) behavior with random I/O (any suggestions on a benchmark or test program that can generate "representative" random I/O? IOzone?) b) behavior with sequential writes c) behavior with re-writing files, both via one and two file descriptors d) maybe test the other changes to vfs_cluster.c you provided, to see whether they make a difference. e) Investigate the role of vfs.dirtybufthresh. When vfs.numdirtybuffer reaches this number, the system switches to a mode where vfs.numdirtybuffer no longer increases and seems to handle the load just fine. I want to understand why the system sees no need to limit numdirtybuffer earlier. Maybe this can be used to improve re-write performance to the "2 desc, with clustering" case. f) I have experienced system freezes when unmounting my test file system after running the respective tests. I'm not sure whether this related to the current problem, or whether this caused by some test patch that I accidently left in the code. I need to check wether this reproducible, and whether there is a specific workload that causes this. Klaus
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201306092320.r59NK1X8046787>