FreeBSD Mail Archives

Date:      Mon, 27 May 2013 06:00:01 GMT
From:      Bruce Evans <brde@optusnet.com.au>
To:        freebsd-bugs@FreeBSD.org
Subject:   Re: kern/178997: Heavy disk I/O may hang system
Message-ID:  <201305270600.r4R601ab055177@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help

The following reply was made to PR kern/178997; it has been noted by GNATS.

From: Bruce Evans <brde@optusnet.com.au>
To: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Cc: freebsd-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Mon, 27 May 2013 15:57:56 +1000 (EST)

 On Sun, 26 May 2013, Klaus Weber wrote:

 >> Environment:
 > FreeBSD filepile 9.1-STABLE FreeBSD 9.1-STABLE #5 r250475: Sun May 12 19:14:21 CEST 2013     root@filepile:/usr/obj/usr/src/sys/FILEPILE  amd64
 > (Kernel config has some drivers removed and debugging options added. Tested with GENERIC as well)
 >> Description:
 > During heavy disk I/O (two bonnie++ processes working on the same disk simultaneously) causes an extreme degradation in disk throughput (combined throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows when both bonnie++ processes are in the "Rewriting..." phase.

 Please use the unix newline character in mail.

 A huge drop in disk throughput is normal with multiple processes,
 unfortunately.  Perhaps not quite this much.

 > Sometimes, the systems "hangs" (throughput reaches 0 MB/sec); in that case, the system does not recover from this state. "Hang" means that the system responds to ping etc., processes that are already running will continue, but every command that requires disk accesses (even on disks other than the disk being tested) just hangs. If shutdown can be started without disk access (i.e. still in cache), the system will begin the shutdown, but fails to sync all buffers.

 Hangs are bugs though.

 I have been arguing with kib@ about some methods of handling heavy disk
 i/o being nonsense since they either make the problem worse (by switching
 to direct unclustered writes, so that slow i/o goes even slower) or have
 no effect except to complicate the analysis of the problem (because they
 are old hackish methods, and newer better methods make the cases handled
 by the old methods unreachable).  But the analysis is complicated, and
 we couldn't agree on anything.

 One thing I noticed more recently than my last mail to kib about this
 is that even a single heavy disk write (say dd bs=256m count=2
 </dev/zero >foo) can block another non-write-bound process (say iostat
 1 or systat -v 1 to watch the progress) from starting up (sometimes
 even when it has just been run to load it into cache(s)) until the dd
 completes 10 seconds later.  This was on a 2-core system running a
 fairly old version of FreeBSD (8.0-CURRENT from ~2008).  I found that
 the problem could be fixed by killing cluster_write() by turning it into
 bdwrite() (by editing the running kernel using ddb, since this is easier
 than rebuilding the kernel).  I was trying many similar things since I
 had a theory that cluster_write() is useless.  Actually, cluster_write()
 gives some minor optimizations, but this misbehaviour indicates that it
 has a resource starvation bug.  My theory for what the bug is is that
 cluster_write() and cluster_read() share the limit resource of pbufs.
 pbufs are not managed as carefully as normal buffers.  In particular,
 there is nothing to limit write pressure from pbufs like there is for
 normal buffers.  However, this behaviour doesn't happen on the much older
 version of FreeBSD that I normally use (5.2-CURRENT from 2004 with many
 local modifications).  It starts iostat to watch dd instantly, even running
 on a single core system.

 > gpart create -s gpt da0
 > gpart add -s 500G -t freebsd-ufs -l test0 -a 1M da0
 > newfs -b 64k -f 8k /dev/da0p1
 > mount /dev/da0p1 /mnt && mkdir /mnt/t{1,2} && chmod g+w /mnt/t{1,2}

 The default for newfs is -b 32k.  This asks for buffer cache fragmentation.
 Someone increased the default from 16k to 32k without changing the buffer
 cache's preferred size (BKVASIZE = 16K).  BKVASIZE has always been too
 small, but on 32-bit arches kernel virtual memory is too limited to have
 a larger BKVASIZE by default.  BKVASIZE is still 16K on all arches
 although this problem doesn't affetc 64-bit arches.

 -b 64k is worse.

 > When both bonnie++ processes are in their "Rewriting" phase, the system hangs within a few seconds. Both bonnie++ processes are in state "nbufkv". bufdaemon takes about 40% CPU time and is in state "qsleep" when not active.

 You got the buffer cache fragmentation that you asked for.

 Buffer cache fragmentation caused serious problems 10-20 years ago, but the
 ones causing hangs were supposed to be fixed about 12 years ago and the ones
 causing extreme slowness for just the software parts of handling
 fragmentation were fixed a little later.  But there apparently still some.

 However, I have never been able to reproduce serious fragmentation problems
 from using too-large-block sizes, or demonstrate significant improvements
 from avoiding the known fragmentation problem by increasing BKVASIZE.
 Perhaps my systems are too small, or have tuning or local changes that
 accidentally avoid the problem.

 Apparently you found a way to reproduce the serious fragmentaion
 problems.  Try using a block size that doesn't ask for the problem.
 Increasing BKVASIZE would take more work than this, since although it
 was intended to be a system parameter which could be changed to reduce
 the fragmentation problem, one of the many bugs in it is that it was
 never converted into a "new" kernel option.  Another of the bugs in
 it is that doubling it halves the number of buffers, so doubling it
 does more than use twice as much kva.  This severely limited the number
 of buffers back when memory sizes were 64MB.  It is not a very
 significant limitation if the memory size is 1GB or larger.

 > I have also briefly tested a single disk instead of a RAID volume. In that case, I could not get the system to hang, put performance drops from ~150 MB/sec to ~1-3 MB/sec. I do not know if that indicates a problem with the arcmsr driver, or whether I cannot create "enough" I/O load with a single disk:

 I get ~39MB/sec with 1 "dd bs=128n count=2 /dev/zero >foo" writing to
 a nearly full old ffs file system on an old PATA disk, and 2 times
 20MB/sec with 2 dd's.  This is is almost as good as possible.  The
 configuration was somewhat flawed:
 - the writes all fitted in memory because my disks are too full to hold
    more.  This shouldn't be a problem for write benchmarks, because write
    pressure is (too) limited by writing out dirty buffers (too) early to
    limit write pressure, so writes are mostly physical.
 - the fs block size was 16K.  This not a flaw, just the right size, but
    other sizes should be tested.
 - the file system was mounted async.  This should have little affect,
    since for large writes all data is written async anyway (the write
    pressure handling (causes async writes instead of delayed writes even
    for not so large writes).
 - ffs was modified to remove its bogus write pressure handling.  This
    should have no affect, but I didn't verify it.

 > I have, however, logged the sysctl vas subtree every second while the problem occurs:
 > As long as both bonnie++ are in "Writing intelligently..." phase, vfs.numdirtybuffers stays mostly between 80-100. When the first bonnie++ goes to "Rewriting..." (at 3:40 below) starts to increase (about 150-350 new dirty buffers per second). When the second bonnie++ goes to "Rewriting..." as well (at 4:24), this speeds up a lot (about 3000-5000 new dirty buffers/sec). A few seconds later, vfs.numdirtybuffers reaches its limit, and remains there until the machine is rebooted:
 >
 > 00-03-28.log:vfs.numdirtybuffers: 107
 > 00-03-29.log:vfs.numdirtybuffers: 106
 > 00-03-30.log:vfs.numdirtybuffers: 105
 > ...
 > [first bonnie++ starts Rewriting]
 > 00-03-41.log:vfs.numdirtybuffers: 294
 > 00-03-42.log:vfs.numdirtybuffers: 404
 > 00-03-43.log:vfs.numdirtybuffers: 959
 > 00-03-44.log:vfs.numdirtybuffers: 1071

 There are various write pressure handling mechanisms that have different
 thresholds and get in each way a bit.

 The dirty buffer watermarks are presumably much higher than 105 or even
 1071.  Other mechanisms apparently work well and limit the number of
 dirty buffers to only 107 with 1 writer.  Writes to memory can go much
 faster than writes to the disk even when the latter go at full speed, but
 some mechanism blocks writers (too) early so the 1 writer soon reaches a
 steady state with not many dirty buffers.

 > 00-03-45.log:vfs.numdirtybuffers: 1155
 > 00-03-46.log:vfs.numdirtybuffers: 1253
 > 00-03-47.log:vfs.numdirtybuffers: 1366
 > 00-03-48.log:vfs.numdirtybuffers: 1457
 > ...
 > 00-04-21.log:vfs.numdirtybuffers: 8156
 > 00-04-22.log:vfs.numdirtybuffers: 8362
 > 00-04-23.log:vfs.numdirtybuffers: 8558
 > [second bonnie goes Rewriting as well]
 > 00-04-24.log:vfs.numdirtybuffers: 11586
 > 00-04-25.log:vfs.numdirtybuffers: 16325
 > 00-04-26.log:vfs.numdirtybuffers: 24333
 > ...
 > 00-04-54.log:vfs.numdirtybuffers: 52096
 > 00-04-57.log:vfs.numdirtybuffers: 52098
 > 00-05-00.log:vfs.numdirtybuffers: 52096
 > [ etc. ]

 This is a rather large buildup and may indicate a problem.  Try reducing
 the dirty buffer watermarks.  Their default values are mostly historical
 nonsense.  They are just some fixed fraction (about 1/4) of the number
 of buffers.  With multiple GB of memory, the number of buffers may be
 huge.  On a system with 24GB, vfs.bbuf is 157800, and vfs.dirtybufthresh
 is 35523.  But the size of main memory has nothing to do with the
 correct limit on the number of dirty buffers (unless the latter is not
 being used to limit write pressure, but only to limit starvation of
 buffers for readers.  This is perhaps all it is supposed to do, but see
 below).  We want buildups of dirty buffers to be limited so that the
 system never gets too far behind on physical writes.  The correct limit
 depends on the physical write capabilities of the machine and on which
 write channels have pressure on them.  Default global limits don't work
 at all for this.  You can fix the defaults but you can't distribute the
 limits properly accross channels, or easily keep the limits up to date
 as channels are added.  It is easy to add slow USB channels.  These
 won't increase the write capability much, but it is very easy for a
 slow disk or channel to it to grab many dirty buffers and keep them for
 a long time so that faster disks are starved.

 Other mechanisms handle write pressure better than the simple dirty
 buffer count one can (others count the amount of i/o, which is more
 relevant to how long it will take).  So assume that any affect of
 the dirty buffer count mechanism on write pressure is unintentional.
 There are still the following problems with it:
 - as above, slow channels can starve fast channels, thus giving a
    severe side effect on the write pressure (at best the users of the
    fast channels block so that the are slowed down to the speed of
    the slow channels)
 - buffer starvation for readers can happen anyway.  vfs.nbuf is the
    number of buffers that are allocated, but not all of these can be
    used when some file systems use buffer sizes larger than about
    BKVASIZE.  The worst case is when all file systems a block size of
    64KB.  Then only about 1/4 of the allocated buffers can be used.
    The number that can be used is limited by vfs.maxbufspace and
    related variables.  The default for vfs.maxbufspace is controlled
    indirectly by BKVASIZE (e.g., maxbufspace = 2GB goes with nbuf =
    128K; the ratio is 16K = BKVASIZE).

 So when you ask for buffer cache fragmentation with -b 64k, you get
 even more.  The effective nbuf is reduced by a factor of 4.  This
 means that all the dirty buffer count watermarks except one are
 slightly above the effective nbuf, so if the higher watermarks
 cannot quite be reached and when they are nearly reached readers
 are starved of buffers.

 > vfs.hidirtybuffers: 52139
 > vfs.lodirtybuffers: 26069
 > (the machine has 32 GB RAM)

 Fairly consistent with my examples of a machine with 24GB RAM.
 The upper watermark is 33% higher, etc.  Your nbuf will be more like
 200K than 150K.  You reached numdirtybuffers = 52096.  At 64K each,
 that's about 3GB.  maxbufspace will be about the same, and bufspace
 (the amount used) will be not much smaller.  There will be little
 space for buffers for readers, and the 2 writers apparently manage
 to starve even each other, perhaps for similar reasons.  The backlog
 of 3GB would take 20 seconds to clear even at 150MB/sec, and at 1MB/sec
 it would take almost an hour, so it is not much better than a hang.

 > I have found http://freebsd.1045724.n5.nabble.com/7-1-RELEASE-I-O-hang-td3966945.html , which seems to describe the same issue (also in combination with a RAID controller, albeit a different one).

 I think the hardware is usually not the problem (at least after kib's fix
 for the above), except when it is large it can stress the software more.

 Bruce

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201305270600.r4R601ab055177>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation