Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 25 Oct 2008 22:09:17 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Thierry Herbelot <thierry@herbelot.com>
Cc:        freebsd-fs@FreeBSD.org, hackers@FreeBSD.org
Subject:   Re: question about sb->st_blksize in src/sys/kern/vfs_vnops.c
Message-ID:  <20081025203549.C76165@delplex.bde.org>
In-Reply-To: <200810241818.37262.thierry@herbelot.com>
References:  <200810241818.37262.thierry@herbelot.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 24 Oct 2008, Thierry Herbelot wrote:

> the [SUBJ] file contains the following extract (around line 705) :
>
>     * Default to PAGE_SIZE after much discussion.
>     * XXX: min(PAGE_SIZE, vp->v_bufobj.bo_bsize) may be more correct.
>     */
>
>    sb->st_blksize = PAGE_SIZE;
>
> which arrived around four years ago, with revision 1.211 (see
> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_vnops.c.diff?r1=1.210;r2=1.211;f=h)

Indeed, this was completely broken long ago (in 1.211).  Before then, and
after 1.128, some cases worked as intended if not perfectly:
- regular files: file systems still set va_blksize to their idea of the
   best i/o size (normally to the file system block size, which is
   normally larger than PAGE_SIZE and probably better in all cases) and
   this was used here.  However, for regular files, the fs block size
   and the application's i/o size are almost irrelevant in most cases
   due to vfs clustering.  Most large i/o's are done physically with
   the cluster size (which due to a related bug suite ends up being
   hard-coded to MAXPHYS (128K) at a minor cost when this is different
   from the best size).
- disk files: non-broken device drivers set si_iosize_best to their idea
   of the best i/o size (normally to the max i/o size, which is normally
   better than PAGE_SIZE) and this was used here.  The bogus default
   of BLKDEV_IOSIZE was used for broken drivers (this is bogus because it
   was for the buffer cache implementation for block devices which no
   longer exist and was too small for them anyway).
- non-disk character-special files: the default of PAGE_SIZE was used.
   The comment about defaulting to PAGE_SIZE was added in 1.128 and is
   mainly for this case.  Now the comment is nonsense since the value is
   fixed, not a default.
- other file types (fifos, pipes, sockets, ...): these got the default of
   PAGE_SIZE too.

In rev.1.1, st_blksize was set to va_blksize in all cases.  So file systems
were supposed to set va_blksize reasonably in all cases, but this is not
easy and they did nothing good except for regular files.

Versions between 1.2 and 1.127 did weird things like defaulting to DFLTPHYS
(64K) for most cdevs but using a small size like BLKDEV_IOSIZE (2K) for disks.
This gave nonsense like 64K buffers for slow tty devices (keyboards) and
2K buffers for fast disks.  At least for programs that trust st_blksize
o be reasonable.  Fortunately, st_blsize is rarely used...

> the net effect of this change is to decrease the block buffer size used in
> libc/stdio from 16 kbytes (derived from the underlying ufs partition) to
> PAGE_SIZE ==4 kbytes (fixed value), and consequently the I/O bandwidth is
> lowered (this is on a slow Flash).

... except it is used by stdio.  (Another mess here is that stdio mostly
doesn't use its own BUFSIZ.  It trusts st_blksize if fstat() to determine
st_blksize works.  Of course, the existence of BUFSIZ is a related
historical mistake -- no fixed size can work best for all cases.  But
when BUFSIZ is used, it is an even worse default than PAGE_SIZE.)

It's interesting that you can see the difference.  Clustering is especially
good for hiding slowness on slow devices.  Maybe you are using a configuration
that makes clustering ineffective.  Mounting the file system with -o sync
or equivalently, doing a sync after every (too-small) write would do it.
Otherwise, writes are normally delated until the next cluster boundary.

> I have patched the kernel with a larger, fixed value (simply 4*PAGE_SIZE, to
> revert to the block size previoulsly used), and the kernel and world seem to
> be running fine.
>
> Seeing the XXX coment above, I'm a bit worried about keeping this new
> st_blksize value.
>
> are there any drawbacks with running with this bigger buffer size value ?

Mostly it doesn't matter, since buffering (clustering) hides the differences.
Without clustering, 16K is a much better default for disks than 4K, though
not as good as the non-default va_blksize for regular files.  Newer disks
might prefer 32K or 64k, but then the fs block size should also be increased
from 16K.  Otherwise, increasing the block size usually reduces performance,
by thrashing caches or increasing latencies.  With modern cache sizes and disk
speeds, you won't see these effects for a block size of 64K, so defaulting to
64K would be reasonable for disks.  It would be silly for keyboards, but with
modern memory sizes you would notice this even less than when it was that in
old versions.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20081025203549.C76165>