From owner-freebsd-fs@FreeBSD.ORG Sat Oct 25 19:46:24 2008 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C018C1065671; Sat, 25 Oct 2008 19:46:24 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx10.syd.optusnet.com.au (fallbackmx10.syd.optusnet.com.au [211.29.132.251]) by mx1.freebsd.org (Postfix) with ESMTP id 66B8A8FC22; Sat, 25 Oct 2008 19:46:23 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail01.syd.optusnet.com.au (mail01.syd.optusnet.com.au [211.29.132.182]) by fallbackmx10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m9PB9NuK027440; Sat, 25 Oct 2008 22:09:23 +1100 Received: from c122-106-151-199.carlnfd1.nsw.optusnet.com.au (c122-106-151-199.carlnfd1.nsw.optusnet.com.au [122.106.151.199]) by mail01.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m9PB9HtJ029625 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 25 Oct 2008 22:09:20 +1100 Date: Sat, 25 Oct 2008 22:09:17 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Thierry Herbelot In-Reply-To: <200810241818.37262.thierry@herbelot.com> Message-ID: <20081025203549.C76165@delplex.bde.org> References: <200810241818.37262.thierry@herbelot.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@FreeBSD.org, hackers@FreeBSD.org Subject: Re: question about sb->st_blksize in src/sys/kern/vfs_vnops.c X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Oct 2008 19:46:24 -0000 On Fri, 24 Oct 2008, Thierry Herbelot wrote: > the [SUBJ] file contains the following extract (around line 705) : > > * Default to PAGE_SIZE after much discussion. > * XXX: min(PAGE_SIZE, vp->v_bufobj.bo_bsize) may be more correct. > */ > > sb->st_blksize = PAGE_SIZE; > > which arrived around four years ago, with revision 1.211 (see > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_vnops.c.diff?r1=1.210;r2=1.211;f=h) Indeed, this was completely broken long ago (in 1.211). Before then, and after 1.128, some cases worked as intended if not perfectly: - regular files: file systems still set va_blksize to their idea of the best i/o size (normally to the file system block size, which is normally larger than PAGE_SIZE and probably better in all cases) and this was used here. However, for regular files, the fs block size and the application's i/o size are almost irrelevant in most cases due to vfs clustering. Most large i/o's are done physically with the cluster size (which due to a related bug suite ends up being hard-coded to MAXPHYS (128K) at a minor cost when this is different from the best size). - disk files: non-broken device drivers set si_iosize_best to their idea of the best i/o size (normally to the max i/o size, which is normally better than PAGE_SIZE) and this was used here. The bogus default of BLKDEV_IOSIZE was used for broken drivers (this is bogus because it was for the buffer cache implementation for block devices which no longer exist and was too small for them anyway). - non-disk character-special files: the default of PAGE_SIZE was used. The comment about defaulting to PAGE_SIZE was added in 1.128 and is mainly for this case. Now the comment is nonsense since the value is fixed, not a default. - other file types (fifos, pipes, sockets, ...): these got the default of PAGE_SIZE too. In rev.1.1, st_blksize was set to va_blksize in all cases. So file systems were supposed to set va_blksize reasonably in all cases, but this is not easy and they did nothing good except for regular files. Versions between 1.2 and 1.127 did weird things like defaulting to DFLTPHYS (64K) for most cdevs but using a small size like BLKDEV_IOSIZE (2K) for disks. This gave nonsense like 64K buffers for slow tty devices (keyboards) and 2K buffers for fast disks. At least for programs that trust st_blksize o be reasonable. Fortunately, st_blsize is rarely used... > the net effect of this change is to decrease the block buffer size used in > libc/stdio from 16 kbytes (derived from the underlying ufs partition) to > PAGE_SIZE ==4 kbytes (fixed value), and consequently the I/O bandwidth is > lowered (this is on a slow Flash). ... except it is used by stdio. (Another mess here is that stdio mostly doesn't use its own BUFSIZ. It trusts st_blksize if fstat() to determine st_blksize works. Of course, the existence of BUFSIZ is a related historical mistake -- no fixed size can work best for all cases. But when BUFSIZ is used, it is an even worse default than PAGE_SIZE.) It's interesting that you can see the difference. Clustering is especially good for hiding slowness on slow devices. Maybe you are using a configuration that makes clustering ineffective. Mounting the file system with -o sync or equivalently, doing a sync after every (too-small) write would do it. Otherwise, writes are normally delated until the next cluster boundary. > I have patched the kernel with a larger, fixed value (simply 4*PAGE_SIZE, to > revert to the block size previoulsly used), and the kernel and world seem to > be running fine. > > Seeing the XXX coment above, I'm a bit worried about keeping this new > st_blksize value. > > are there any drawbacks with running with this bigger buffer size value ? Mostly it doesn't matter, since buffering (clustering) hides the differences. Without clustering, 16K is a much better default for disks than 4K, though not as good as the non-default va_blksize for regular files. Newer disks might prefer 32K or 64k, but then the fs block size should also be increased from 16K. Otherwise, increasing the block size usually reduces performance, by thrashing caches or increasing latencies. With modern cache sizes and disk speeds, you won't see these effects for a block size of 64K, so defaulting to 64K would be reasonable for disks. It would be silly for keyboards, but with modern memory sizes you would notice this even less than when it was that in old versions. Bruce