Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Apr 2010 19:02:10 +0300
From:      Andriy Gapon <avg@freebsd.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        arch@freebsd.org, Rick Macklem <rmacklem@uoguelph.ca>
Subject:   Re: (in)appropriate uses for MAXBSIZE
Message-ID:  <4BC34402.1050509@freebsd.org>
In-Reply-To: <20100411114405.L10562@delplex.bde.org>
References:  <4BBEE2DD.3090409@freebsd.org>	<Pine.GSO.4.63.1004090941200.14439@muncher.cs.uoguelph.ca>	<4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
on 11/04/2010 05:56 Bruce Evans said the following:
> On Fri, 9 Apr 2010, Andriy Gapon wrote:
[snip]
>> I have lightly tested this under qemu.
>> I used my avgfs:) modified to issue 4*MAXBSIZE bread-s.
>> I removed size > MAXBSIZE check in getblk (see a parallel thread
>> "panic: getblk:
>> size(%d) > MAXBSIZE(%d)").
> 
> Did you change the other known things that depend on this?  There is the
> b_pages limit of MAXPHYS bytes which should be checked for in another
> way

I changed the check the way I described in the parallel thread.

> and the soft limits for hibufspace and lobufspace which only matter
> under load conditions.

And what these should be?
hibufspace and lobufspace seem to be auto-calculated.  One thing that I noticed
and that was a direct cause of the problem described below, is that difference
between hibufspace and lobufspace should be at least the maximum block size
allowed in getblk() (perhaps it should be strictly equal to that value?).
So in my case I had to make that difference MAXPHYS.

>> And I bumped MAXPHYS to 1MB.
>>
>> Some results.
>> I got no panics, data was read correctly and system remained stable,
>> which is good.
>> But I observed reading process (dd bs=1m on avgfs) spending a lot of
>> time sleeping
>> on needsbuffer in getnewbuf.  needsbuffer value was VFS_BIO_NEED_ANY.
>> Apparently there was some shortage of free buffers.
>> Perhaps some limits/counts were incorrectly auto-tuned.
> 
> This is not surprising, since even 64K is 4 times too large to work
> well.  Buffer sizes of larger than BKVASIZE (16K) always cause
> fragmentation of buffer kva.  Recovering from fragmentation always
> takes a lot of CPU, and if you are unlucky it will also take a lot of
> real time (stalling waiting for free buffer kva).  Buffer sizes larger
> than BKVASIZE also reduce the number of available buffers significantly
> below the number of buffers configured.  This mainly takes a lot of
> CPU to reconsitute buffers.  BKVASIZE being less than MAXBSIZE is a
> hack to reduce the amount of kva statically allocated for buffers for
> systems that cannot support enough kva to work right (mainly i386's).
> It only works well when it is not actually used (when all buffers have
> size <= BKVASIZE = 16K, as would be enforced by reducing MAXBSIZE to
> BKVASIZE).  This hack and the complications to support it are bogus on
> systems that support enough kva to work right.

So, BKVASIZE is the best read size from the point of view of buffer space usage?
E.g. a single MAXBSIZE=64K read results in a single 64K GEOM read requests, but
leads to buffer space map fragmentation, because of size > BKVASIZE.
On the other hand, four sequential reads of BKVASIZE=16K bytes are perfect from
buffer space point of view (no fragmentation potential) but they result in 4 GEOM
I/O requests.
The thing is that a single read requires a single contiguous virtual address space
chunk.  Would it be possible to take the best of both worlds by somehow allowing a
single large I/O request to work with several buffers (with b_kvasize == BKVASIZE)
in a iovec-like style?
Have I just reinvented bicycle? :)
Probably not, because an answer to my question is probably 'not (without lots of
work in lots of places)' as well.

I see that breadn() certainly doesn't work that way.  As I understand, it works
like bread() for one block plus starts something like 'asynchronous breads()' for
a given count of other blocks.

I am not sure about details of how cluster_read() works, though.
Could you please explain the essence of it?
Thank you!

Perhaps, there are other approaches to the fragmentation issue.  Like, for
example, using sort of zones for different block sizes.  But that all adds
complications and takes away performance of the easy cases.
-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4BC34402.1050509>