From owner-freebsd-bugs Wed Oct 9 03:16:13 1996 Return-Path: owner-bugs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id DAA11963 for bugs-outgoing; Wed, 9 Oct 1996 03:16:13 -0700 (PDT) Received: from minnow.render.com (render.demon.co.uk [158.152.30.118]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id DAA11902; Wed, 9 Oct 1996 03:15:51 -0700 (PDT) Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id KAA16977; Wed, 9 Oct 1996 10:32:14 +0100 Date: Wed, 9 Oct 1996 10:32:13 +0100 (BST) From: Doug Rabson To: Hidetoshi Shimokawa cc: freebsd-bugs@freebsd.org, freebsd-current@freebsd.org Subject: Re: NFS from Solaris Server In-Reply-To: <15438.844844779@sat.t.u-tokyo.ac.jp> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-bugs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk [hackers removed from Cc list] On Wed, 9 Oct 1996, Hidetoshi Shimokawa wrote: > [snip] > > I am looking into this problem since yesterday, adding some debuging > code into kernel. The following is some results around this. > > The system is, > > PentiumPro 100M Ether Sun Ultra Wide SCSI > FreeBSD-current <--------------> Solaris 2.5 --------- Disk > NFSv2 client NFSv2 server 6MB/s(iozone) > > and I mesured performance by iozone. > With the default setting (with 4 nfsiods), I can get only 300KB/s. > - With NFSv3, I got around 500KB-600KB. > - SS20 <-> Sun Ultra gets around 800KB/s.) > > 1) A faster client(pentium class) gets less performance than a slower > client(486 class). > > 2) After I killed all async daemon (nfsiod), I got 400KB/s, this seems > funny :-). > > 3) By added some debugging code, I found that the performance > reduction happens when the nfsiods are all busy and the buffer > is marked as B_DELWRI(delayed write) in nfs_asyncio() (/sys/nfs/nfs_bio.c). > This explains 1). > > 4) I changed the code so that nfs_asyncio returns with EIO before > marks B_DELWRI, then I got 800-900KB/s. > I think this algorithm is essentially same as 2.1.5 or BSDI 2.1. > > 5) It is interesting that the change above doesn't improve v3 performance. > > 6) I don't know how delayed write scheme is efficient, but at this > point, it is a bottleneck. It is because, after a nfsbiod starts > processiong the delayed write buffer in nfssvc_iod() (nfs_syscalls.c), > other nfsbiods stop its work. I confirmed by debugging code in > kernel, but it can be also easily observed by I was just looking at the code and returning EIO from nfs_asyncio would work. The effect it would have is to perform the rpc synchronously in the context of the writer instead of asynchronously in the context of one of the iods. If the server was able to avoid the 'write to stable storage' requirement of NFSv2 (either because of NVRAM or because it is cheating) then this would not be too inefficient. I am not sure why the delayed write handling would lock out the other iods. Maybe it is caused by the loop in nfssvc_iod which tries to flush out all the delayed write buffers in the vnode. It may be that the writing process is adding buffers at the same rate as the iod is writing them and the race is locking out the other iods. It might be worth experimenting with this code to limit the number of buffers it processes with the loop. > > % iozone & sleep 2; cat /proc/"nfsiod's pid"/status > > nfsiod 15057 1 15056 0 -1,-1 noflags 844842823,10788 0,0 0,78906 sbwait 0 0 0,0,0,5,2,3,4,20,31 > nfsiod 15058 1 15056 0 -1,-1 noflags 844842823,10874 0,0 0,30899 nfsrcvlk 0 0 0,0,0,5,2,3,4,20,31 > nfsiod 15059 1 15056 0 -1,-1 noflags 844842823,10939 0,0 0,3486 nfsrcvlk 0 0 0,0,0,5,2,3,4,20,31 > nfsiod 15060 1 15056 0 -1,-1 noflags 844842823,11001 0,0 0,1694 nfsrcvlk 0 0 0,0,0,5,2,3,4,20,31 > nfsiod 15061 1 15056 0 -1,-1 noflags 844842823,11061 0,0 0,1395 nfsrcvlk 0 0 0,0,0,5,2,3,4,20,31 > nfsiod 15062 1 15056 0 -1,-1 noflags 844842823,11119 0,0 0,1601 nfsrcvlk 0 0 0,0,0,5,2,3,4,20,31 > nfsiod 15063 1 15056 0 -1,-1 noflags 844842823,11176 0,0 0,1339 nfsrcvlk 0 0 0,0,0,5,2,3,4,20,31 > nfsiod 15064 1 15056 0 -1,-1 noflags 844842823,11233 0,0 0,1277 nfsrcvlk 0 0 0,0,0,5,2,3,4,20,31 > > All nfsiods except one are locked at nfs_rcvlock (nfs_socket.c) for long time. > By tcpdump, server did reply but nfsiod can not process it for locking. > I'm not familiar with NFS and kernel programming but > it seems that current code has locking problem. > > I don't know how to fix it, please help, NFS and kernel experts! > > (Isn't it needed to be non-interruptable for the code between nfs_send and > nfs_rcvlock in nfs_request() and nfs_reply()?) > > I also like to know why NFSv3 client is so slow. NFSv3 in current has a bug where uncommitted unstably written buffers can later be rewritten to the server synchronously. This patch fixes this bug as well as improving the code which sends commit rpcs to the server to reduce the number of rpcs needed. It also marks buffers of uncommitted data so that they can be cluster-committed automatically by the bio system. Index: nfs_bio.c =================================================================== RCS file: /home/ncvs/src/sys/nfs/nfs_bio.c,v retrieving revision 1.25 diff -u -r1.25 nfs_bio.c --- nfs_bio.c 1996/09/19 18:20:54 1.25 +++ nfs_bio.c 1996/10/09 09:10:07 @@ -905,9 +905,11 @@ iomode = NFSV3WRITE_FILESYNC; bp->b_flags |= B_WRITEINPROG; error = nfs_writerpc(vp, uiop, cr, &iomode, &must_commit); - if (!error && iomode == NFSV3WRITE_UNSTABLE) + if (!error && iomode == NFSV3WRITE_UNSTABLE) { bp->b_flags |= B_NEEDCOMMIT; - else + if (bp->b_dirtyoff == 0 && bp->b_dirtyend == bp->b_bufsize) + bp->b_flags |= B_CLUSTEROK; + } else bp->b_flags &= ~B_NEEDCOMMIT; bp->b_flags &= ~B_WRITEINPROG; Index: nfs_vnops.c =================================================================== RCS file: /home/ncvs/src/sys/nfs/nfs_vnops.c,v retrieving revision 1.35 diff -u -r1.35 nfs_vnops.c --- nfs_vnops.c 1996/09/19 18:21:01 1.35 +++ nfs_vnops.c 1996/10/09 09:10:15 @@ -1210,7 +1210,10 @@ tsiz -= len; } nfsmout: - *iomode = committed; + if (vp->v_mount->mnt_flag & MNT_ASYNC) + *iomode = NFSV3WRITE_FILESYNC; + else + *iomode = committed; if (error) uiop->uio_resid = tsiz; return (error); @@ -2607,6 +2610,9 @@ int error = 0, wccflag = NFSV3_WCCRATTR; struct mbuf *mreq, *mrep, *md, *mb, *mb2; +#ifdef NFS_DEBUG + printf("nfs_commit(%x, %d, %d, %x, %x)\n", vp, (int) offset, cnt, cred, procp); +#endif if ((nmp->nm_flag & NFSMNT_HASWRITEVERF) == 0) return (0); nfsstats.rpccnt[NFSPROC_COMMIT]++; @@ -2757,13 +2763,14 @@ struct nfsmount *nmp = VFSTONFS(vp->v_mount); int s, error = 0, slptimeo = 0, slpflag = 0, retv, bvecpos; int passone = 1; - u_quad_t off = (u_quad_t)-1, endoff = 0, toff; + u_quad_t off, endoff, toff; struct ucred* wcred = NULL; -#ifndef NFS_COMMITBVECSIZ -#define NFS_COMMITBVECSIZ 20 -#endif - struct buf *bvec[NFS_COMMITBVECSIZ]; + struct buf **bvec = NULL; + int bvecsize = 0, bveccount; +#ifdef NFS_DEBUG + printf("nfs_flush(%x, %x, %d, %x, %d)\n", vp, cred, waitfor, p, commit); +#endif if (nmp->nm_flag & NFSMNT_INT) slpflag = PCATCH; if (!commit) @@ -2776,12 +2783,41 @@ * job. */ again: + off = (u_quad_t)-1; + endoff = 0; bvecpos = 0; if (NFS_ISV3(vp) && commit) { s = splbio(); + /* + * Count up how many buffers waiting for a commit. + */ + bveccount = 0; + for (bp = vp->v_dirtyblkhd.lh_first; bp; bp = nbp) { + nbp = bp->b_vnbufs.le_next; + if ((bp->b_flags & (B_BUSY | B_DELWRI | B_NEEDCOMMIT)) + == (B_DELWRI | B_NEEDCOMMIT)) + bveccount++; + } + /* + * Allocate space to remember the list of bufs to commit. It is + * important to use M_NOWAIT here to avoid a race with nfs_write. + * If we can't get memory (for whatever reason), we will end up + * committing the buffers one-by-one in the loop below. + */ + if (bveccount > bvecsize) { + if (bvec != NULL) + free(bvec, M_TEMP); + bvec = (struct buf **) + malloc(bveccount * sizeof(struct buf *), + M_TEMP, M_NOWAIT); + if (bvec == NULL) + bvecsize = 0; + else + bvecsize = bveccount; + } for (bp = vp->v_dirtyblkhd.lh_first; bp; bp = nbp) { nbp = bp->b_vnbufs.le_next; - if (bvecpos >= NFS_COMMITBVECSIZ) + if (bvecpos >= bvecsize) break; if ((bp->b_flags & (B_BUSY | B_DELWRI | B_NEEDCOMMIT)) != (B_DELWRI | B_NEEDCOMMIT)) @@ -2822,10 +2858,14 @@ * one call for all of them, otherwise commit each one * separately. */ - if (wcred != NOCRED) + if (wcred != NOCRED) { +#ifdef NFS_DEBUG +printf("nfs_flush: calling nfs_commit(%x, %d, %d, %x, %x)\n", + vp, (int) off, (int) (endoff - off), wcred, p); +#endif retv = nfs_commit(vp, off, (int)(endoff - off), wcred, p); - else { + } else { retv = 0; for (i = 0; i < bvecpos; i++) { off_t off, size; @@ -2879,8 +2919,10 @@ "nfsfsync", slptimeo); splx(s); if (error) { - if (nfs_sigintr(nmp, (struct nfsreq *)0, p)) - return (EINTR); + if (nfs_sigintr(nmp, (struct nfsreq *)0, p)) { + error = EINTR; + goto done; + } if (slpflag == PCATCH) { slpflag = 0; slptimeo = 2 * hz; @@ -2892,6 +2934,9 @@ panic("nfs_fsync: not dirty"); if ((passone || !commit) && (bp->b_flags & B_NEEDCOMMIT)) continue; +#ifdef NFS_DEBUG +printf("nfs_flush: writing bp=%x, bp->b_flags=%x\n", bp, bp->b_flags); +#endif bremfree(bp); if (passone || !commit) bp->b_flags |= (B_BUSY|B_ASYNC); @@ -2912,8 +2957,10 @@ error = tsleep((caddr_t)&vp->v_numoutput, slpflag | (PRIBIO + 1), "nfsfsync", slptimeo); if (error) { - if (nfs_sigintr(nmp, (struct nfsreq *)0, p)) - return (EINTR); + if (nfs_sigintr(nmp, (struct nfsreq *)0, p)) { + error = EINTR; + goto done; + } if (slpflag == PCATCH) { slpflag = 0; slptimeo = 2 * hz; @@ -2928,6 +2975,9 @@ error = np->n_error; np->n_flag &= ~NWRITEERR; } +done: + if (bvec) + free(bvec, M_TEMP); return (error); } @@ -3129,8 +3179,9 @@ * an actual write will have to be scheduled via. VOP_STRATEGY(). * If B_WRITEINPROG is already set, then push it with a write anyhow. */ - if (oldflags & (B_NEEDCOMMIT | B_WRITEINPROG) == B_NEEDCOMMIT) { + if ((oldflags & (B_NEEDCOMMIT | B_WRITEINPROG)) == B_NEEDCOMMIT) { off = ((u_quad_t)bp->b_blkno) * DEV_BSIZE + bp->b_dirtyoff; + vfs_busy_pages(bp, 1); bp->b_flags |= B_WRITEINPROG; retv = nfs_commit(bp->b_vp, off, bp->b_dirtyend-bp->b_dirtyoff, bp->b_wcred, bp->b_proc); @@ -3139,8 +3190,10 @@ bp->b_dirtyoff = bp->b_dirtyend = 0; bp->b_flags &= ~B_NEEDCOMMIT; biodone(bp); - } else if (retv == NFSERR_STALEWRITEVERF) + } else if (retv == NFSERR_STALEWRITEVERF) { nfs_clearcommit(bp->b_vp->v_mount); + vfs_unbusy_pages(bp); + } } if (retv) { if (force) -- Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com Phone: +44 171 734 3761 FAX: +44 171 734 6426