From owner-freebsd-current@FreeBSD.ORG Wed Mar 28 23:51:37 2007 Return-Path: X-Original-To: current@freebsd.org Delivered-To: freebsd-current@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6C07316A401; Wed, 28 Mar 2007 23:51:37 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226]) by mx1.freebsd.org (Postfix) with ESMTP id 0F23813C459; Wed, 28 Mar 2007 23:51:37 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout2.pacific.net.au (Postfix) with ESMTP id DE1A2129096; Thu, 29 Mar 2007 09:51:30 +1000 (EST) Received: from besplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (Postfix) with ESMTP id 208B38C26; Thu, 29 Mar 2007 09:51:34 +1000 (EST) Date: Thu, 29 Mar 2007 09:51:32 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Ulrich Spoerlein In-Reply-To: <7ad7ddd90703280238r5dd3f30ftc1641926ecdf44a8@mail.gmail.com> Message-ID: <20070329080917.B3626@besplex.bde.org> References: <7ad7ddd90703280238r5dd3f30ftc1641926ecdf44a8@mail.gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Mailman-Approved-At: Thu, 29 Mar 2007 00:52:48 +0000 Cc: current@freebsd.org, net@freebsd.org Subject: Re: NFS write() calls lead to read() calls? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Mar 2007 23:51:37 -0000 On Wed, 28 Mar 2007, Ulrich Spoerlein wrote: > hostA # scp 500MB hostB:/net/share/ > ... > If I run the scp again, I can see X MB/s going out from HostA, 2*X > MB/s coming in on HostB and X MB/s out plus X MB/s in on HostC. What's > happening is, that HostB issues one NFS READ call for every WRITE > call. The traffic flows like this: > > -----> -----> > A B C > <----- > > If I rm(1) the file on the NFS share, then the first scp(1) will not > show this behaviour. It is only when overwritting files, that this > happens. At least under FreeBSD-~5.2 with an old version of scp, this is caused by blocksize bugs in the kernel and/or scp, and an open mode bug or feature in scp. The blocksize used by scp is 4K. This is smaller than the nfs block size of 8K, so nfs has to read-ahead 1 8K block for each pair of 4K- blocks written so as to have non-garbage in the top half of each 8K- block after writing 4K to the bottom half. It only has to read-ahead if there is something there, but repeated scp's ensure this by not truncating the file on open (open mode (O_WRONLY | O_CREAT) without O_TRUNC according to truss(1)). > The real weirdness comes into play, when I simply cp(1) from HostB > itself like this: > > hostB # cp 500MB /net/share/ > > I can do this over and over again, and _never_ get any noteworthy > amount of NFS READ calls, only WRITE. The network traffic is also, as > you would expect. > > Then I tested using ssh(1) instead of scp(1), like this: > > hostA # cat 500MB | ssh hostB "cat >/net/share/500MB" > > This works, too. Probably, because sh(1) is truncating the file? cp truncates the file on open (open mode (O_WRONLY | O_TRUNC_ without O_CREAT according to truss(1)). cp also uses a block size of 64K, so it wouldn't cause read-ahead even if it didn't truncate. There are many possible wrong block sizes: - on my server, the block size according to st_blksize is 16K (ffs default). - on my client, the block size according to st_blksize is 512 due to bugs in nfs. There is an open PR or two about this. In nfs2, the file system's block size on the server is passed to the client for each file and used for st_blksize but nothing else, but in nfs3, the block size that is put in st_blksize by the client is hard-coded to the arbitrary (usually bad) value NFS_FABLKSIZE = 512. The correct block size to put in st_blksize in both cases seems to be the least common multiple of the nfs buffer size and the server block size, since if the application's i/o size is smaller than the nfs buffer size then there will be excessive block size conversions in the nfs client, and if the i/o size is smaller than the server's block size then there will be excessive block size conversions in the server's file system. nfs's buffer size is the maximum of the read size, the write size and the page size. This is usually 8K, so it is mismatched with the usual ffs server block size of 16K. The inefficiencies from this are less noticeable than the inefficiencies from a mismatch with the nfs buffer size. - scp for some reason doesn't use the advertised best blocksize of st_blksize = 512. It uses 4K, which is almost as bad since it is smaller than the nfs buffer size. - cp doesn't use the advertised best block size. It uses mmap() for regular files smaller than 8M and a hard-coded block size of MAXBSIZE = 64K for large regular files and all non-regular files. - the above is in FreeBSD-~5.2 (and FreeBSD-[1-4]). st_blksize is much more bogus and broken in -current. In -current, the value in va_blocksize that is carefully initialized for regular files by ffs and not so carefully initialized by nfs or for non-regular files, is not actually used even for regular files. vn_stat() now uses the hard-coded (usually bad) value of PAGE_SIZE. Thus st_blksize is useless, and ignoring it and using a larger hard-coded value in cp is a feature -- MAXBSIZE is too large in many cases, but a too-large value normally only wastes a little space while a too-small value normally wastes a lot of time. MAXBSIZE is a good value for large files (e.g., large regular files and raw disks). OTOH, even PAGE_SIZE is a waste of space for slow devices like keyboards. - stdio is the main thing that is naive enough to believe that st_blksize is still useful. The block size of BUFSIZ = 1024 in stdio.h is another way to get a pessimal block size, but stdio itself mainly uses it for strings and for what it thinks are. It misclassifies all cdevs as ttys and thus uses a better block size than st_blksize = for cdevs that are actually ttys and a slightly worse block size than st_blksize = and a much worse block size than cp's MAXBSIZE for cdevs that are actually disks. Not truncating the file in scp might be a feature for avoiding clobbering the whole file when the copying fails early, but it doesn't seem to be documented. Bruce