From owner-freebsd-current@FreeBSD.ORG  Wed Mar 28 23:51:37 2007
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
X-Original-To: current@freebsd.org
Delivered-To: freebsd-current@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 6C07316A401;
	Wed, 28 Mar 2007 23:51:37 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226])
	by mx1.freebsd.org (Postfix) with ESMTP id 0F23813C459;
	Wed, 28 Mar 2007 23:51:37 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au
	[61.8.2.162])
	by mailout2.pacific.net.au (Postfix) with ESMTP id DE1A2129096;
	Thu, 29 Mar 2007 09:51:30 +1000 (EST)
Received: from besplex.bde.org (katana.zip.com.au [61.8.7.246])
	by mailproxy1.pacific.net.au (Postfix) with ESMTP id 208B38C26;
	Thu, 29 Mar 2007 09:51:34 +1000 (EST)
Date: Thu, 29 Mar 2007 09:51:32 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@besplex.bde.org
To: Ulrich Spoerlein <uspoerlein@gmail.com>
In-Reply-To: <7ad7ddd90703280238r5dd3f30ftc1641926ecdf44a8@mail.gmail.com>
Message-ID: <20070329080917.B3626@besplex.bde.org>
References: <7ad7ddd90703280238r5dd3f30ftc1641926ecdf44a8@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Mailman-Approved-At: Thu, 29 Mar 2007 00:52:48 +0000
Cc: current@freebsd.org, net@freebsd.org
Subject: Re: NFS write() calls lead to read() calls?
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Mar 2007 23:51:37 -0000

On Wed, 28 Mar 2007, Ulrich Spoerlein wrote:

> hostA # scp 500MB hostB:/net/share/
> ...
> If I run the scp again, I can see X MB/s going out from HostA, 2*X
> MB/s coming in on HostB and X MB/s out plus X MB/s in on HostC. What's
> happening is, that HostB issues one NFS READ call for every WRITE
> call. The traffic flows like this:
>
> ----->   ----->
> A        B        C
>          <-----
>
> If I rm(1) the file on the NFS share, then the first scp(1) will not
> show this behaviour. It is only when overwritting files, that this
> happens.

At least under FreeBSD-~5.2 with an old version of scp, this is caused
by blocksize bugs in the kernel and/or scp, and an open mode bug or
feature in scp.  The blocksize used by scp is 4K.  This is smaller
than the nfs block size of 8K, so nfs has to read-ahead 1 8K block for
each pair of 4K- blocks written so as to have non-garbage in the top
half of each 8K- block after writing 4K to the bottom half.  It only
has to read-ahead if there is something there, but repeated scp's
ensure this by not truncating the file on open (open mode (O_WRONLY |
O_CREAT) without O_TRUNC according to truss(1)).

> The real weirdness comes into play, when I simply cp(1) from HostB
> itself like this:
>
> hostB # cp 500MB /net/share/
>
> I can do this over and over again, and _never_ get any noteworthy
> amount of NFS READ calls, only WRITE. The network traffic is also, as
> you would expect.
>
> Then I tested using ssh(1) instead of scp(1), like this:
>
> hostA # cat 500MB | ssh hostB "cat >/net/share/500MB"
>
> This works, too. Probably, because sh(1) is truncating the file?

cp truncates the file on open (open mode (O_WRONLY | O_TRUNC_ without
O_CREAT according to truss(1)).  cp also uses a block size of 64K, so
it wouldn't cause read-ahead even if it didn't truncate.

There are many possible wrong block sizes:

- on my server, the block size according to st_blksize is 16K (ffs default).

- on my client, the block size according to st_blksize is 512 due to bugs
   in nfs.  There is an open PR or two about this.  In nfs2, the file
   system's block size on the server is passed to the client for each file
   and used for st_blksize but nothing else, but in nfs3, the block
   size that is put in st_blksize by the client is hard-coded to the
   arbitrary (usually bad) value NFS_FABLKSIZE = 512.  The correct block
   size to put in st_blksize in both cases seems to be the least common
   multiple of the nfs buffer size and the server block size, since if
   the application's i/o size is smaller than the nfs buffer size then
   there will be excessive block size conversions in the nfs client,
   and if the i/o size is smaller than the server's block size then
   there will be excessive block size conversions in the server's file
   system.  nfs's buffer size is the maximum of the read size, the write
   size and the page size.  This is usually 8K, so it is mismatched
   with the usual ffs server block size of 16K.  The inefficiencies
   from this are less noticeable than the inefficiencies from a mismatch
   with the nfs buffer size.

- scp for some reason doesn't use the advertised best blocksize of
   st_blksize = 512.  It uses 4K, which is almost as bad since it is
   smaller than the nfs buffer size.

- cp doesn't use the advertised best block size.  It uses mmap() for
   regular files smaller than 8M and a hard-coded block size of
   MAXBSIZE = 64K for large regular files and all non-regular files.

- the above is in FreeBSD-~5.2 (and FreeBSD-[1-4]).  st_blksize is
   much more bogus and broken in -current.  In -current, the value in
   va_blocksize that is carefully initialized for regular files by ffs
   and not so carefully initialized by nfs or for non-regular files,
   is not actually used even for regular files.  vn_stat() now uses the
   hard-coded (usually bad) value of PAGE_SIZE.  Thus st_blksize is
   useless, and ignoring it and using a larger hard-coded value in cp
   is a feature -- MAXBSIZE is too large in many cases, but a too-large
   value normally only wastes a little space while a too-small value
   normally wastes a lot of time.  MAXBSIZE is a good value for large
   files (e.g., large regular files and raw disks).  OTOH, even PAGE_SIZE
   is a waste of space for slow devices like keyboards.

- stdio is the main thing that is naive enough to believe that st_blksize
   is still useful.  The block size of BUFSIZ = 1024 in stdio.h is another
   way to get a pessimal block size, but stdio itself mainly uses it for
   strings and for what it thinks are.  It misclassifies all cdevs as ttys
   and thus uses a better block size than st_blksize = <kernel nonsense>
   for cdevs that are actually ttys and a slightly worse block size than
   st_blksize = <kernel nonsense> and a much worse block size than cp's
   MAXBSIZE for cdevs that are actually disks.

Not truncating the file in scp might be a feature for avoiding clobbering
the whole file when the copying fails early, but it doesn't seem to be
documented.

Bruce