From owner-freebsd-fs@freebsd.org  Mon Mar 28 23:59:35 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id CE28BAE08BE
 for <freebsd-fs@mailman.ysv.freebsd.org>; Mon, 28 Mar 2016 23:59:35 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3])
 by mx1.freebsd.org (Postfix) with ESMTP id BCE3E12AD
 for <freebsd-fs@freebsd.org>; Mon, 28 Mar 2016 23:59:35 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: by mailman.ysv.freebsd.org (Postfix)
 id BC3E2AE08BD; Mon, 28 Mar 2016 23:59:35 +0000 (UTC)
Delivered-To: fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id BBDC2AE08BC
 for <fs@mailman.ysv.freebsd.org>; Mon, 28 Mar 2016 23:59:35 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail107.syd.optusnet.com.au (mail107.syd.optusnet.com.au
 [211.29.132.53])
 by mx1.freebsd.org (Postfix) with ESMTP id 6B4E712AC;
 Mon, 28 Mar 2016 23:59:34 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from c110-21-41-193.carlnfd1.nsw.optusnet.com.au
 (c110-21-41-193.carlnfd1.nsw.optusnet.com.au [110.21.41.193])
 by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 067DCD45414;
 Tue, 29 Mar 2016 10:59:26 +1100 (AEDT)
Date: Tue, 29 Mar 2016 10:59:26 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
cc: fs@freebsd.org, rmacklem@freebsd.org, glebius@freebsd.org
Subject: Re: nfs pessimized by vnode pages changes
In-Reply-To: <20160327144755.Y4269@besplex.bde.org>
Message-ID: <20160329090209.Q1020@besplex.bde.org>
References: <20160327144755.Y4269@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=c+ZWOkJl c=1 sm=1 tr=0
 a=73JWPhLeruqQCjN69UNZtQ==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10
 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=M5xT4KusGiiYpgd5uT0A:9
 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Mar 2016 23:59:35 -0000

On Sun, 27 Mar 2016, Bruce Evans wrote:

> I debugged another pessimization of nfs.
>
> ncl_getpages() is now almost always called with a count of 1 page, due
> to the change changing the count from faultcount to 1 in r292373 in
> vm_fault().  The only exception seems to be for the initial pagein for
> exec -- this is still normally the nfs rsize.  ncl_getpages() doesn't
> do any readahead stuff like vnode_pager_generic_getpages() does, so it 
> normally does read RPCs of size 1 page instead of the the nfs rsize.
> This gives the following increases in read RPCs for makeworld of an
> old world:
> - with rsize = 16K, from 24k to 39k (the worst case would be 4 times as many)
> - with rsize = 8K, from 39k to 44k (the worst case would be 2 times as many).
>
> Also, nfs_getpages() has buggy logic which works accidentally if the
> count is 1:
>
> X diff -c2 ./fs/nfsclient/nfs_clbio.c~ ./fs/nfsclient/nfs_clbio.c
> X *** ./fs/nfsclient/nfs_clbio.c~	Sun Mar 27 01:31:38 2016
> X --- ./fs/nfsclient/nfs_clbio.c	Sun Mar 27 02:35:32 2016
> X ***************
> X *** 135,140 ****
> X   	 */
> X   	VM_OBJECT_WLOCK(object);
> X ! 	if (pages[npages - 1]->valid != 0 && --npages == 0)
> X   		goto out;
> X   	VM_OBJECT_WUNLOCK(object);
> X X --- 135,155 ----
> X   	 */
> X   	VM_OBJECT_WLOCK(object);
> X ! #if 0
> X ! 	/* This matches the comment. but doesn't work (has little effect). */
> X ! 	if (pages[0]->valid != 0)
> X   		goto out;
>
> The comment still says that the code checks the requested page, but
> that is no longer passed to the function in a_reqpage.  The first page
> is a better geuss of the requested page than the last one, but when
> npages is 1 these pages are the same.
>
> X + #else
> X + 	if (pages[0]->valid != 0)
> X + 		printf("ncl_getpages: page 0 valid; npages %d\n", npages);
> X + 	for (i = 0; i < npages; i++)
> X + 		if (pages[i]->valid != 0)
> X + 			printf("ncl_getpages: page %d valid; npages %d\n",
> X + 			    i, npages);
> X + 	for (i = 0; i < npages; i++)
> X + 		if (pages[i]->valid != 0)
> X + 			npages = i;
> X + 	if (npages == 0)
> X + 		goto out;
> X + #endif
>
> Debugging and more forceful guessing code.  This makes little difference
> except of course to spam the console.
>
> X   	VM_OBJECT_WUNLOCK(object);
> X X ***************
> X *** 199,202 ****
> X --- 214,220 ----
> X   			KASSERT(m->dirty == 0,
> X   			    ("nfs_getpages: page %p is dirty", m));
> X + 			printf("ncl_getpages: partial page %d of %d %s\n",
> X + 			    i, npages,
> X + 			    pages[i]->valid != 0 ? "valid" : "invalid");
> X   		} else {
> X   			/*
> X ***************
> X *** 210,215 ****
> X --- 228,239 ----
> X   			 */
> X   			;
> X + 			printf("ncl_getpages: short page %d of %d %s\n",
> X + 			    i, npages,
> X + 			    pages[i]->valid != 0 ? "valid" : "invalid");
> X   		}
> X   	}
> X + 	for (i = 0; i < npages; i++)
> X + 		printf("ncl_getpages: page %d of %d %s\n",
> X + 		    i, npages, pages[i]->valid != 0 ? "valid" : "invalid");
> X   out:
> X   	VM_OBJECT_WUNLOCK(object);
>
> Further debugging code.  Similar debugging code in the old working version
> shows that normal operation for paging in a 15K file with an rsize of 16K is:
> - call here with npages = 4
> - page in 3 full pages and 1 partial page using 1 RPC
> - call here again with npages = 1 for the partial page
> - use the optimization of returning early for this page -- don't do another
>  RPC
>
> The buggy version does:
> - call here with npages = 1; page in 1 full page using 1 RPC
> - call here with npages = 1; page in 1 full page using 1 RPC
> - call here with npages = 1; page in 1 full page using 1 RPC
> - call here with npages = 1; page in 1 partial page using 1 RPC
> - call here again with npages = 1 for the partial page; the optimization
>  works as before.
>
> The partial page isn't handled very well, but at least there is no extra
> physical i/o for it, at least if it is at EOF.  vfs clustering handles
> partial pages even worse than this.
>
> Bruce

This seems to be very difficult to fix.  Even vnode_pager_generic_getpages()
does input 1 page at a time unless the fs supports bmap, and nfs doesn't
support bmap.

Memory mapped input was already much slower than read().  I get the following
speeds for reading an 8MB file with a cold cache on the client and a warm
cache on the server (8MB is the limit below which cp(1) does "optimized" i/o
using mmap):
- read(): 64MB/sec (doesn't depend much on rsize; the network can't go
   much faster than 70MB/sec; probably does depend on low network latency
   (~100 sec))
- FreeBSD-10 mmap(), rsize=16K: 40MB/sec
- FreeBSD-10 mmap(), rsize=8K: 32MB/sec
- FreeBSD-9 mmap(), rsize=8K: 16MB/sec

vnode_pager_generic_getpages() has always had an even better pessimization
if the fs does support bmap but the fs block size is smaller than PAGE_SIZE.
Then it calls vnode_pager_input_smlfs() once for each page to reduce the
input size to the fs block size which can be as low as 512.  However, if
the fs doesn't support bmap, then it calls vnode_pager_input_old() once
for each page.  vnode_pager_input_old() uses VOP_READ(), so it is not much
slower than read().

vnode_pager_input_old() has chances of being better than the specialized
pager methods in all cases, since read() should do vfs clustering which
does more than read-ahead.  vfs clustering doesn't do read-behind AFAIR,
but perhaps its better read-ahead is more important.

All methods have problems guessing how much read-ahead is good.  The
read-ahead and read-behind should certainly be enough to give fill out
an fs-block.

I couldn't get full-fs-block input to work right in ncl_getpages().  In
the old version, vm_fault_hold() almost always calls it with the pages
for exactly 1 full block, because vm_fault_additional_pages() somehow
finds this many pages.

nfs doesn't use vfs clustering and wouldn't benefit from it since its
block size is the same as its maximum i/o size.  The block size can
now be MAXBCACHEBUF which can be larger than MAXBSIZE, but I think using
that is a wrong way to do things -- a default small block size plus
clustering works better by not using large blocks for everything.
ncl_getpages() could probably expand the read size to MAXPHYS (or whatever
the pbuf size limit is) just as easily as to the rsize.

Bruce