Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 27 May 2013 20:57:32 +0200
From:      Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-bugs@FreeBSD.org, freebsd-gnats-submit@FreeBSD.org, Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Subject:   Re: kern/178997: Heavy disk I/O may hang system
Message-ID:  <20130527185732.GA95744@unix-admin.de>
In-Reply-To: <20130527135103.H919@besplex.bde.org>
References:  <201305261951.r4QJpn9Z071712@oldred.FreeBSD.org> <20130527135103.H919@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

First of all, thank you very much for looking into this, and for your
detailed explanations. Much appreciated.

On Mon, May 27, 2013 at 03:57:56PM +1000, Bruce Evans wrote:
> On Sun, 26 May 2013, Klaus Weber wrote:
> >>Description:
> >During heavy disk I/O (two bonnie++ processes working on the same disk 
> >simultaneously) causes an extreme degradation in disk throughput (combined 
> >throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows 
> >when both bonnie++ processes are in the "Rewriting..." phase.

 
> Please use the unix newline character in mail.

My apologies. I submitted the report via the web-interface and did not
realize that it would come out this way.

> A huge drop in disk throughput is normal with multiple processes,
> unfortunately.  Perhaps not quite this much.

Yes, the slight performance drop from ~1 GB/sec to ~600 or even 
~400 MB/sec does not concern me (this might even be an issue with the
arcmsr driver or the controller's firmware; I found some hints that
this might be the case.)

However, a performance drop by 3 orders of a magnitude for just two
concurrent processes seemed to indicate a problem.
 
> Hangs are bugs though.
> 
> I have been arguing with kib@ about some methods of handling heavy disk
> i/o being nonsense since they either make the problem worse (by switching
> to direct unclustered writes, so that slow i/o goes even slower) or have
> no effect except to complicate the analysis of the problem (because they
> are old hackish methods, and newer better methods make the cases handled
> by the old methods unreachable).  But the analysis is complicated, and
> we couldn't agree on anything.

The machine where I experience this problem is not in production
yet. I can reboot it at any time, test patches, etc. Just let me know
if I can do anything helpful. The only limitation is that I usually
have access to the server only on weekends.

> [single process with heavy write access can block other
non-write-bound process]

I have not experienced this so far, but I can test this when I have
access to the server next time.

> I found that
> the problem could be fixed by killing cluster_write() by turning it into
> bdwrite() (by editing the running kernel using ddb, since this is easier
> than rebuilding the kernel).  I was trying many similar things since I
> had a theory that cluster_write() is useless.  [...]

If that would provide a useful datapoint, I could try if that make a
difference on my system. What changes would be required to test this?

Surely its not as easy as replacing the function body of
cluster_write() in vfs_cluster.c with just "return bdwrite(bp);"?

> My theory for what the bug is is that
> cluster_write() and cluster_read() share the limit resource of pbufs.
> pbufs are not managed as carefully as normal buffers.  In particular,
> there is nothing to limit write pressure from pbufs like there is for
> normal buffers.  

Is there anything I can do to confirm rebut this? Is the number of
pbufs in use visible via a sysctl, or could I add debug printfs that
are triggered when certain limits are reached?

> >newfs -b 64k -f 8k /dev/da0p1
> 
> The default for newfs is -b 32k.  This asks for buffer cache fragmentation.
> Someone increased the default from 16k to 32k without changing the buffer
> cache's preferred size (BKVASIZE = 16K).  BKVASIZE has always been too
> small, but on 32-bit arches kernel virtual memory is too limited to have
> a larger BKVASIZE by default.  BKVASIZE is still 16K on all arches
> although this problem doesn't affetc 64-bit arches.
> 
> -b 64k is worse.

Thank you for this explanation. I was not aware that -b 64k (or even
the default values to newfs) would have this effect. I will repeat the
tests with 32/4k and 16/2k, although I seem to remember that 64/8k
provided a significant performance boost over the defaults. This, and
the reduced fsck times was my original motivation to go with the
larger values.

Given the potentially drastic effects of block sizes other than 16/2k,
maybe a warning should be added to the newfs manpage? I only found the
strong advice to maintain a 8:1 buffer:fragment ratio.


> >When both bonnie++ processes are in their "Rewriting" phase, the system 
> >hangs within a few seconds. Both bonnie++ processes are in state "nbufkv". 
> >bufdaemon takes about 40% CPU time and is in state "qsleep" when not 
> >active.
> 
> You got the buffer cache fragmentation that you asked for.

Looking at vfs_bio.c, I see that it has defrag-code in it. Should I
try adding some debug output to this code to get some insight why this
does not work, or not as effective as it should?

> Apparently you found a way to reproduce the serious fragmentaion
> problems.  

A key factor seems to be the "Rewriting" operation. I see no problem
during the "normal" writing, nor could I reproduce it with concurrent
dd runs.

> Try using a block size that doesn't ask for the problem.

Will do, although for production use I would really prefer a 64/8k
system, due to the higher performance.

> Increasing BKVASIZE would take more work than this, since although it
> was intended to be a system parameter which could be changed to reduce
> the fragmentation problem, one of the many bugs in it is that it was
> never converted into a "new" kernel option.  Another of the bugs in
> it is that doubling it halves the number of buffers, so doubling it
> does more than use twice as much kva.  This severely limited the number
> of buffers back when memory sizes were 64MB.  It is not a very
> significant limitation if the memory size is 1GB or larger.

Should I try to experiment with BKVASIZE of 65536? If so, can I
somehow up the number of buffers again? Also, after modifying
BKVASIZE, is it sufficient to compile and install a new kernel, or do
I have to build and install the entire world?


> I get ~39MB/sec with 1 "dd bs=128n count=2 /dev/zero >foo" writing to
> a nearly full old ffs file system on an old PATA disk, and 2 times
> 20MB/sec with 2 dd's.  This is is almost as good as possible. 

I agree. Performance with dd, or during bonnie++'s "Writing
intelligently" phaase is very reasonable, with both a single process
and with two processes simultaneously.

Something specific to the "Rewriting..." workload is triggering the
problem.


> >[second bonnie goes Rewriting as well]
> >00-04-24.log:vfs.numdirtybuffers: 11586
> >00-04-25.log:vfs.numdirtybuffers: 16325
> >00-04-26.log:vfs.numdirtybuffers: 24333
> >...
> >00-04-54.log:vfs.numdirtybuffers: 52096
> >00-04-57.log:vfs.numdirtybuffers: 52098
> >00-05-00.log:vfs.numdirtybuffers: 52096
> >[ etc. ]
> 
> This is a rather large buildup and may indicate a problem.  Try reducing
> the dirty buffer watermarks.  Their default values are mostly historical
> nonsense. 

You mean the vfs.(hi|lo)dirtybuffers? Will do. What would be
reasonable starting values for experimenting? 800/200?

> [Helpful explanation snipped]  

> - buffer starvation for readers can happen anyway.  vfs.nbuf is the
>   number of buffers that are allocated, but not all of these can be
>   used when some file systems use buffer sizes larger than about
>   BKVASIZE.  The worst case is when all file systems a block size of
>   64KB.  Then only about 1/4 of the allocated buffers can be used.
>   The number that can be used is limited by vfs.maxbufspace and
>   related variables.  The default for vfs.maxbufspace is controlled
>   indirectly by BKVASIZE (e.g., maxbufspace = 2GB goes with nbuf =
>   128K; the ratio is 16K = BKVASIZE).
> 
> So when you ask for buffer cache fragmentation with -b 64k, you get
> even more.  The effective nbuf is reduced by a factor of 4.  This
> means that all the dirty buffer count watermarks except one are
> slightly above the effective nbuf, so if the higher watermarks
> cannot quite be reached and when they are nearly reached readers
> are starved of buffers.

That makes sense - and bonnie++ is both reader and writer in the
rewriting phase, and thus may even "starve itself"?


> >vfs.hidirtybuffers: 52139
> >vfs.lodirtybuffers: 26069
> >(the machine has 32 GB RAM)
> 
> Fairly consistent with my examples of a machine with 24GB RAM.
> The upper watermark is 33% higher, etc.  Your nbuf will be more like
> 200K than 150K.  You reached numdirtybuffers = 52096.  At 64K each,
> that's about 3GB.  maxbufspace will be about the same, and bufspace
> (the amount used) will be not much smaller.  There will be little
> space for buffers for readers, and the 2 writers apparently manage
> to starve even each other, perhaps for similar reasons.  The backlog
> of 3GB would take 20 seconds to clear even at 150MB/sec, and at 1MB/sec
> it would take almost an hour, so it is not much better than a hang.

OK, I will experiment with these parameters when I have access to
the system again (which won't be before Thursday, unfortunately).

Thanks again so far,

  Klaus



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130527185732.GA95744>