Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 9 Oct 1996 10:32:13 +0100 (BST)
From:      Doug Rabson <dfr@render.com>
To:        Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp>
Cc:        freebsd-bugs@freebsd.org, freebsd-current@freebsd.org
Subject:   Re: NFS from Solaris Server
Message-ID:  <Pine.BSF.3.95.961009101127.10204Z-100000@minnow.render.com>
In-Reply-To: <15438.844844779@sat.t.u-tokyo.ac.jp>

next in thread | previous in thread | raw e-mail | index | archive | help
[hackers removed from Cc list]

On Wed, 9 Oct 1996, Hidetoshi Shimokawa wrote:

> [snip]
> 
> I am looking into this problem since yesterday, adding some debuging
> code into kernel. The following is some results around this.
> 
> The system is,
> 
>     PentiumPro              100M Ether        Sun Ultra   Wide SCSI
>     FreeBSD-current        <-------------->  Solaris 2.5  --------- Disk
>     NFSv2 client                             NFSv2 server   6MB/s(iozone)
>                                                               
> and I mesured performance by iozone.
> With the default setting (with 4 nfsiods), I can get only 300KB/s.
> - With NFSv3, I got around 500KB-600KB.
> - SS20 <-> Sun Ultra gets around 800KB/s.)
> 
> 1) A faster client(pentium class) gets less performance than a slower
> client(486 class).
> 
> 2) After I killed all async daemon (nfsiod), I got 400KB/s, this seems
> funny :-).
> 
> 3) By added some debugging code, I found that the performance
> reduction happens when the nfsiods are all busy and the buffer
> is marked as B_DELWRI(delayed write) in nfs_asyncio() (/sys/nfs/nfs_bio.c).
> This explains 1).
> 
> 4) I changed the code so that nfs_asyncio returns with EIO before
> marks B_DELWRI, then I got 800-900KB/s.
> I think this algorithm is essentially same as 2.1.5 or BSDI 2.1.
> 
> 5) It is interesting that the change above doesn't improve v3 performance.
> 
> 6) I don't know how delayed write scheme is efficient, but at this
> point, it is a bottleneck. It is because, after a nfsbiod starts 
> processiong the delayed write buffer in nfssvc_iod() (nfs_syscalls.c),
> other nfsbiods stop its work.  I confirmed by debugging code in
> kernel, but it can be also easily observed by

I was just looking at the code and returning EIO from nfs_asyncio would
work.  The effect it would have is to perform the rpc synchronously in the
context of the writer instead of asynchronously in the context of one of
the iods.  If the server was able to avoid the 'write to stable storage'
requirement of NFSv2 (either because of NVRAM or because it is cheating)
then this would not be too inefficient.

I am not sure why the delayed write handling would lock out the other
iods.  Maybe it is caused by the loop in nfssvc_iod which tries to flush
out all the delayed write buffers in the vnode.  It may be that the
writing process is adding buffers at the same rate as the iod is writing
them and the race is locking out the other iods.  It might be worth
experimenting with this code to limit the number of buffers it processes
with the loop.

> 
> % iozone & sleep 2;  cat /proc/"nfsiod's pid"/status
> 
> nfsiod 15057 1 15056 0 -1,-1 noflags 844842823,10788 0,0 0,78906 sbwait 0 0 0,0,0,5,2,3,4,20,31
> nfsiod 15058 1 15056 0 -1,-1 noflags 844842823,10874 0,0 0,30899 nfsrcvlk 0 0 0,0,0,5,2,3,4,20,31
> nfsiod 15059 1 15056 0 -1,-1 noflags 844842823,10939 0,0 0,3486 nfsrcvlk 0 0 0,0,0,5,2,3,4,20,31
> nfsiod 15060 1 15056 0 -1,-1 noflags 844842823,11001 0,0 0,1694 nfsrcvlk 0 0 0,0,0,5,2,3,4,20,31
> nfsiod 15061 1 15056 0 -1,-1 noflags 844842823,11061 0,0 0,1395 nfsrcvlk 0 0 0,0,0,5,2,3,4,20,31
> nfsiod 15062 1 15056 0 -1,-1 noflags 844842823,11119 0,0 0,1601 nfsrcvlk 0 0 0,0,0,5,2,3,4,20,31
> nfsiod 15063 1 15056 0 -1,-1 noflags 844842823,11176 0,0 0,1339 nfsrcvlk 0 0 0,0,0,5,2,3,4,20,31
> nfsiod 15064 1 15056 0 -1,-1 noflags 844842823,11233 0,0 0,1277 nfsrcvlk 0 0 0,0,0,5,2,3,4,20,31
> 
> All nfsiods except one are locked at nfs_rcvlock (nfs_socket.c) for long time.
> By tcpdump, server did reply but nfsiod can not process it for locking.
> I'm not familiar with NFS and kernel programming but
> it seems that current code has locking problem.
> 
> I don't know how to fix it, please help, NFS and kernel experts!
> 
> (Isn't it needed to be non-interruptable for the code between nfs_send and
> nfs_rcvlock in nfs_request() and nfs_reply()?)
> 
> I also like to know why NFSv3 client is so slow.

NFSv3 in current has a bug where uncommitted unstably written buffers can
later be rewritten to the server synchronously.  This patch fixes this bug
as well as improving the code which sends commit rpcs to the server to
reduce the number of rpcs needed.  It also marks buffers of uncommitted
data so that they can be cluster-committed automatically by the bio
system.

Index: nfs_bio.c
===================================================================
RCS file: /home/ncvs/src/sys/nfs/nfs_bio.c,v
retrieving revision 1.25
diff -u -r1.25 nfs_bio.c
--- nfs_bio.c	1996/09/19 18:20:54	1.25
+++ nfs_bio.c	1996/10/09 09:10:07
@@ -905,9 +905,11 @@
 		    iomode = NFSV3WRITE_FILESYNC;
 		bp->b_flags |= B_WRITEINPROG;
 		error = nfs_writerpc(vp, uiop, cr, &iomode, &must_commit);
-		if (!error && iomode == NFSV3WRITE_UNSTABLE)
+		if (!error && iomode == NFSV3WRITE_UNSTABLE) {
 		    bp->b_flags |= B_NEEDCOMMIT;
-		else
+		    if (bp->b_dirtyoff == 0 && bp->b_dirtyend == bp->b_bufsize)
+			bp->b_flags |= B_CLUSTEROK;
+		} else
 		    bp->b_flags &= ~B_NEEDCOMMIT;
 		bp->b_flags &= ~B_WRITEINPROG;
 
Index: nfs_vnops.c
===================================================================
RCS file: /home/ncvs/src/sys/nfs/nfs_vnops.c,v
retrieving revision 1.35
diff -u -r1.35 nfs_vnops.c
--- nfs_vnops.c	1996/09/19 18:21:01	1.35
+++ nfs_vnops.c	1996/10/09 09:10:15
@@ -1210,7 +1210,10 @@
 		tsiz -= len;
 	}
 nfsmout:
-	*iomode = committed;
+	if (vp->v_mount->mnt_flag & MNT_ASYNC)
+		*iomode = NFSV3WRITE_FILESYNC;
+	else
+		*iomode = committed;
 	if (error)
 		uiop->uio_resid = tsiz;
 	return (error);
@@ -2607,6 +2610,9 @@
 	int error = 0, wccflag = NFSV3_WCCRATTR;
 	struct mbuf *mreq, *mrep, *md, *mb, *mb2;
 	
+#ifdef NFS_DEBUG
+	printf("nfs_commit(%x, %d, %d, %x, %x)\n", vp, (int) offset, cnt, cred, procp);
+#endif
 	if ((nmp->nm_flag & NFSMNT_HASWRITEVERF) == 0)
 		return (0);
 	nfsstats.rpccnt[NFSPROC_COMMIT]++;
@@ -2757,13 +2763,14 @@
 	struct nfsmount *nmp = VFSTONFS(vp->v_mount);
 	int s, error = 0, slptimeo = 0, slpflag = 0, retv, bvecpos;
 	int passone = 1;
-	u_quad_t off = (u_quad_t)-1, endoff = 0, toff;
+	u_quad_t off, endoff, toff;
 	struct ucred* wcred = NULL;
-#ifndef NFS_COMMITBVECSIZ
-#define NFS_COMMITBVECSIZ	20
-#endif
-	struct buf *bvec[NFS_COMMITBVECSIZ];
+	struct buf **bvec = NULL;
+	int bvecsize = 0, bveccount;
 
+#ifdef NFS_DEBUG
+	printf("nfs_flush(%x, %x, %d, %x, %d)\n", vp, cred, waitfor, p, commit);
+#endif
 	if (nmp->nm_flag & NFSMNT_INT)
 		slpflag = PCATCH;
 	if (!commit)
@@ -2776,12 +2783,41 @@
 	 * job.
 	 */
 again:
+	off = (u_quad_t)-1;
+	endoff = 0;
 	bvecpos = 0;
 	if (NFS_ISV3(vp) && commit) {
 		s = splbio();
+		/*
+		 * Count up how many buffers waiting for a commit.
+		 */
+		bveccount = 0;
+		for (bp = vp->v_dirtyblkhd.lh_first; bp; bp = nbp) {
+			nbp = bp->b_vnbufs.le_next;
+			if ((bp->b_flags & (B_BUSY | B_DELWRI | B_NEEDCOMMIT))
+			    == (B_DELWRI | B_NEEDCOMMIT))
+				bveccount++;
+		}
+		/*
+		 * Allocate space to remember the list of bufs to commit.  It is
+		 * important to use M_NOWAIT here to avoid a race with nfs_write.
+		 * If we can't get memory (for whatever reason), we will end up
+		 * committing the buffers one-by-one in the loop below.
+		 */
+		if (bveccount > bvecsize) {
+			if (bvec != NULL)
+				free(bvec, M_TEMP);
+			bvec = (struct buf **)
+				malloc(bveccount * sizeof(struct buf *),
+				       M_TEMP, M_NOWAIT);
+			if (bvec == NULL)
+				bvecsize = 0;
+			else
+				bvecsize = bveccount;
+		}
 		for (bp = vp->v_dirtyblkhd.lh_first; bp; bp = nbp) {
 			nbp = bp->b_vnbufs.le_next;
-			if (bvecpos >= NFS_COMMITBVECSIZ)
+			if (bvecpos >= bvecsize)
 				break;
 			if ((bp->b_flags & (B_BUSY | B_DELWRI | B_NEEDCOMMIT))
 				!= (B_DELWRI | B_NEEDCOMMIT))
@@ -2822,10 +2858,14 @@
 		 * one call for all of them, otherwise commit each one
 		 * separately.
 		 */
-		if (wcred != NOCRED)
+		if (wcred != NOCRED) {
+#ifdef NFS_DEBUG
+printf("nfs_flush: calling nfs_commit(%x, %d, %d, %x, %x)\n",
+	vp, (int) off, (int) (endoff - off), wcred, p);
+#endif
 			retv = nfs_commit(vp, off, (int)(endoff - off),
 					  wcred, p);
-		else {
+		} else {
 			retv = 0;
 			for (i = 0; i < bvecpos; i++) {
 				off_t off, size;
@@ -2879,8 +2919,10 @@
 				"nfsfsync", slptimeo);
 			splx(s);
 			if (error) {
-			    if (nfs_sigintr(nmp, (struct nfsreq *)0, p))
-				return (EINTR);
+			    if (nfs_sigintr(nmp, (struct nfsreq *)0, p)) {
+				error = EINTR;
+				goto done;
+			    }
 			    if (slpflag == PCATCH) {
 				slpflag = 0;
 				slptimeo = 2 * hz;
@@ -2892,6 +2934,9 @@
 			panic("nfs_fsync: not dirty");
 		if ((passone || !commit) && (bp->b_flags & B_NEEDCOMMIT))
 			continue;
+#ifdef NFS_DEBUG
+printf("nfs_flush: writing bp=%x, bp->b_flags=%x\n", bp, bp->b_flags);
+#endif
 		bremfree(bp);
 		if (passone || !commit)
 		    bp->b_flags |= (B_BUSY|B_ASYNC);
@@ -2912,8 +2957,10 @@
 			error = tsleep((caddr_t)&vp->v_numoutput,
 				slpflag | (PRIBIO + 1), "nfsfsync", slptimeo);
 			if (error) {
-			    if (nfs_sigintr(nmp, (struct nfsreq *)0, p))
-				return (EINTR);
+			    if (nfs_sigintr(nmp, (struct nfsreq *)0, p)) {
+				error = EINTR;
+				goto done;
+			    }
 			    if (slpflag == PCATCH) {
 				slpflag = 0;
 				slptimeo = 2 * hz;
@@ -2928,6 +2975,9 @@
 		error = np->n_error;
 		np->n_flag &= ~NWRITEERR;
 	}
+done:
+	if (bvec)
+		free(bvec, M_TEMP);
 	return (error);
 }
 
@@ -3129,8 +3179,9 @@
 	 * an actual write will have to be scheduled via. VOP_STRATEGY().
 	 * If B_WRITEINPROG is already set, then push it with a write anyhow.
 	 */
-	if (oldflags & (B_NEEDCOMMIT | B_WRITEINPROG) == B_NEEDCOMMIT) {
+	if ((oldflags & (B_NEEDCOMMIT | B_WRITEINPROG)) == B_NEEDCOMMIT) {
 		off = ((u_quad_t)bp->b_blkno) * DEV_BSIZE + bp->b_dirtyoff;
+		vfs_busy_pages(bp, 1);
 		bp->b_flags |= B_WRITEINPROG;
 		retv = nfs_commit(bp->b_vp, off, bp->b_dirtyend-bp->b_dirtyoff,
 			bp->b_wcred, bp->b_proc);
@@ -3139,8 +3190,10 @@
 			bp->b_dirtyoff = bp->b_dirtyend = 0;
 			bp->b_flags &= ~B_NEEDCOMMIT;
 			biodone(bp);
-		} else if (retv == NFSERR_STALEWRITEVERF)
+		} else if (retv == NFSERR_STALEWRITEVERF) {
 			nfs_clearcommit(bp->b_vp->v_mount);
+			vfs_unbusy_pages(bp);
+		}
 	}
 	if (retv) {
 		if (force)


--
Doug Rabson, Microsoft RenderMorphics Ltd.	Mail:  dfr@render.com
						Phone: +44 171 734 3761
						FAX:   +44 171 734 6426




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.3.95.961009101127.10204Z-100000>