Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 26 May 2005 09:02:59 -0400
From:      Sven Willenberger <sven@dmv.com>
To:        Bruce Evans <bde@zeta.org.au>
Cc:        freebsd-amd64@FreeBSD.org
Subject:   Re: BKVASIZE for large block-size filesystems
Message-ID:  <1117112579.15065.30.camel@lanshark.dmv.com>
In-Reply-To: <20050526090743.S75084@delplex.bde.org>
References:  <1117055183.13183.57.camel@lanshark.dmv.com> <20050526090743.S75084@delplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 2005-05-26 at 10:38 +1000, Bruce Evans wrote:
> On Wed, 25 May 2005, Sven Willenberger wrote:
> 
> > [originally posted to freebsd-stable, realized that some amd64-specific
> > info may be needed here too]
> 
> It's not very amd64-specific due to bugs.  BKVASIZE and algorithms that
> use it are tuned for i386's.  This gives mistuning for arches that have
> more kernel virtual address space.
> 
> > FreeBSD5.4-Stable amd64 on a dual-opteron system with LSI-Megaraid 400G+
> > partion. The filesystem was created with: newfs -b 65536 -f 8192 -e
> > 15835 /dev/amrd2s1d
> >
> > This is the data filesystem for a PostgreSQL database; as the default
> > page size (files) is 8k, the above newfs scheme has 8k fragments which
> > should fit nicely with the PostgreSQL page size. Now by default param.h
> 
> Fragments don't work very well.  It might be better to fit files to the
> block size.  If all files had size 8K, then -b 8192 -f 8192 would work
> best (slightly better than -b 8192 -f 1024, that slightly better than
> the current defaults, and all much better than -b 65536 -f 8192).
> 

Oh, how I wish I would have known that prior to creating the filesystem.
I wanted to avoid -b 8192 -f 1024 because of the small fragment size; I
had assumned that fragment size matching the file page size used by the
database would be ideal. Since the manpages seem to imply that anything
other than 8:1 ratio of blocksize to fragment size would be detrimental
I stayed away from -b 8192 -f 8192. I am curious as to what the concept
behind fragments are then (versus my picture) and why they "don't work
very well" ...

> > defines BKVASIZE as 16384 (which has been pointed out in other posts as
> > being *not* twice the default blocksize of 16k). I have modified it to
> > be set at 32768 but still see a high and increasing value of
> > vfs.bufdefragcnt which makes sense given the blocksize of the major
> > filesystem in use.
> 
> Yes, a block size larger than BKVASIZE will cause lots of fragmentation.
> I'm not sure if this is still a large pessimization.
> 
> > My question is are there any caveats about increasing BKVASIZE to 65536?
> > The system has 8G of RAM and I understand that nbufs decreases with
> > increasing BKVASIZE;
> 
> The decrease in nbufs is a bug.  It defeats half of the point of increasing
> BKVASIZE: if most buffers have size 64K, then increasing BKVASIZE from 16K
> to 64K gives approximately nbuf/4 buffers all of size 64K instead of nbuf
> buffers, with nbuf/4 of them of size 64K and 3*nbuf/4 of them unusable.
> Thus it avoids some resource wastage at a cost of possibly not using enough
> resources for effective caching.  However, little is lost if most buffers
> have size 64K.  Then the reduced nbuf consumes all of the kva resources that
> we are willing to allocate.  The problem is when file systems are mixed and
> ones with a block size of 64K are not used much or at all.  The worst case
> is when all blocks have size 512, which can happen for msdosfs.  Then up
> to (BKVASIZE - 512) / BKVASIZE of the kva resource is wasted (> 99% for
> BKVASIZE = 65536 but only 97% for BKVASIZE = 16384).
> 
> To fix the bug, change BKVASIZE in kern_vfs_bio_buffer_alloc() to 16384
> and consider adjusting the machbcache tunable (see below).
> 

Ahh, so this is literal replace the word "BKVASIZE" in that function
with the word "16384". I am assuming that I can leave other instances of
BKVASIZE and BKVAMASK in that file (vfs_bio.c) alone then?

> > how can I either determine if the resulting nbufs
> > will be sufficient or calculate what is needed based on RAM and system
> > usage?
> 
> nbuf is not directly visible except using a debugger, but vfs.maxbufspace
> gives it indirectly -- divide the latter by BKVASIZE to get nbuf.  A few
> thousand for it is plenty.
> 
> I used to use BKVASIZE = 65536, and fixed the bug as above, and also doubled
> nbuf in kern_vfs_bio_buffer_alloc(), and also configured VM_BCACHE_SIZE_MAX
> to 512M so that the elevated nbuf was actually used, but the need for
> significantly increasing the default nbuf (at least with BKVASIZE = 16384)
> went away many years ago when memory sizes started exceeding 256M or so.
> My doubling of nbuf broke a few years later when memory sizes started
> exceeding 1GB.  i386's just don't have enough virtual address space to use
> a really large nbuf, so when there is enough physical memory the default
> nbuf is as large as possible.  I was only tuning BKVASIZE and
> VM_BCACHE_SIZE_MAX to benchmark file systems with large block sizes, but
> the performance with large block sizes was poor even with this tuning so
> I lost interest in it.  Now I just use the defaults and the bug fix
> reduces to a spelling change.  nbuf defeaults to about 7000 on my machines
> with 1GB of memory.  This is plenty.  With BKVASIZE = 64K and without the
> fix, it would be 1/4 as much, which seems a little low.
> 
> nbuf is also limited by kernel virtual memory.  amd's have more (I'm not
> sure how much), and they should have so much more that the bcache part
> is effectively infinity, but it is or was actually only twice as much
> as on i386's (default VM_BCACHE_SIZE_MAX = 200MB on i386's and 400MB
> on amd64's).  Even i386's can spare more provided the memory is not
> needed for other things, e.g., networking.  The default of 400MB on
> amd64's combined with BKVASIZE  gives a limit on nbuf of 400MB/64K = 6400
> which is plently, so you shouldn't need to change the bcache tunable.
> 

I shall leave that tunable alone then.

> > Also, will increasing BKVASIZE require a complete make buildworld or, if
> > not, how can I remake the portions of system affected by BKVASIZE?
> 
> It's not a properly supported option, so the way to change it is to
> edit it in the sys/param.h source file.  After changing it there,
> the everything will be rebuilt as necessary by makeworld and/or
> rebuilding kernels.  Unfortunately, almost everything will be rebuilt
> because too many things depend on sys/param.h.  When testing
> changes to BKVASIZE, I used to cheat by preserving the timestamp of
> sys/param.h and manually recompiling only the necessary things.  Very
> little depends on BKVASIZE.  IIRC, there used to be 2 object files
> per kernel, but now there is only 1 (vfs_bio.o).
> 
> Bruce

Sounds good; I appreciate the input and the explanations -- really
cleared up a good bit of stuff for me. Thanks,

Sven




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1117112579.15065.30.camel>