From owner-freebsd-bugs@FreeBSD.ORG Mon Nov 19 14:20:02 2007 Return-Path: Delivered-To: freebsd-bugs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E733D16A46B for ; Mon, 19 Nov 2007 14:20:02 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id D072013C4B8 for ; Mon, 19 Nov 2007 14:20:02 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1]) by freefall.freebsd.org (8.14.1/8.14.1) with ESMTP id lAJEK21O051679 for ; Mon, 19 Nov 2007 14:20:02 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.1/8.14.1/Submit) id lAJEK2hl051678; Mon, 19 Nov 2007 14:20:02 GMT (envelope-from gnats) Resent-Date: Mon, 19 Nov 2007 14:20:02 GMT Resent-Message-Id: <200711191420.lAJEK2hl051678@freefall.freebsd.org> Resent-From: FreeBSD-gnats-submit@FreeBSD.org (GNATS Filer) Resent-To: freebsd-bugs@FreeBSD.org Resent-Reply-To: FreeBSD-gnats-submit@FreeBSD.org, bg@sics.se Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0CA8616A4CF for ; Mon, 19 Nov 2007 14:19:19 +0000 (UTC) (envelope-from root@duvel.sics.se) Received: from duvel.sics.se (duvel.sics.se [193.10.67.84]) by mx1.freebsd.org (Postfix) with ESMTP id 5326813C442 for ; Mon, 19 Nov 2007 14:19:18 +0000 (UTC) (envelope-from root@duvel.sics.se) Received: from duvel.sics.se (localhost [127.0.0.1]) by duvel.sics.se (8.14.2/8.14.2) with ESMTP id lAJE4InT001477; Mon, 19 Nov 2007 14:04:18 GMT (envelope-from root@duvel.sics.se) Received: (from root@localhost) by duvel.sics.se (8.14.2/8.14.2/Submit) id lAJE4IAm001476; Mon, 19 Nov 2007 14:04:18 GMT (envelope-from root) Message-Id: <200711191404.lAJE4IAm001476@duvel.sics.se> Date: Mon, 19 Nov 2007 14:04:18 GMT From: bg@sics.se To: FreeBSD-gnats-submit@FreeBSD.org X-Send-Pr-Version: 3.113 Cc: Kip Macy Subject: kern/118126: Poor NFS server write performance X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: bg@sics.se List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 19 Nov 2007 14:20:03 -0000 >Number: 118126 >Category: kern >Synopsis: Poor NFS server write performance >Confidential: no >Severity: non-critical >Priority: medium >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: update >Submitter-Id: current-users >Arrival-Date: Mon Nov 19 14:20:02 UTC 2007 >Closed-Date: >Last-Modified: >Originator: Bjoern Groenvall >Release: FreeBSD 8.0-CURRENT amd64 >Organization: SICS >Environment: System: FreeBSD duvel.sics.se 8.0-CURRENT FreeBSD 8.0-CURRENT #2: Sun Nov 18 20:14:16 UTC 2007 root@duvel.sics.se:/usr/obj/usr/src/sys/GEN-NFS amd64 DELL PC, amd64, gigabit Ethernet, SATA disk >Description: A problem with the current NFS server is that it does not cluster writes, this in turn leads to really poor sequential-write performance. By enabling write clustering NFS write performance goes from 26.6Mbyte/s to 54.3Mbyte/s or increases by a factor of 2. This is on a SATA disk with write caching enabled (hw.ata.wc=1). If write caching is disabled performance still goes up from 1.6Mbyte/s to 5.8Mbyte/s (or by a factor of 3.6). The attached patch (relative to current) makes the following changes: 1/ Rearrange the code so that the same code can be used to detect both sequential read and write access. 2/ Merge in updates from vfs_vnops.c::sequential_heuristic. 3/ Use double hashing in order to avoid hash-clustering in the nfsheur table. This change also makes it possible to reduce "try" from 32 to 8. 4/ Pack the nfsheur table more efficiently. 5/ Tolerate reordered RPCs to some small amount (initially suggested by Ellard and Seltzer). 6/ Back-off from sequential access rather than immediately switching to random access (Ellard and Seltzer). 7/ To avoid starvation of the buffer pool call bwillwrite. The call is issued after the VOP_WRITE in order to avoid additional reordering of write operations. 8/ sysctl variables vfs.nfsrv.cluster_writes and cluster_reads to enable or disable clustering. vfs.nfsrv.reordered_io counts the number of reordered RPCs. 9/ In nfsrv_commit check for write errors and report them back to the client. Also check if the RPC argument count is zero which means that we must flush to the end of file according to the RFC. 10/ Two earlier commits broke the write gathering support: nfs_syscalls.c:1.71 This change removed NQNFS stuff but left the NQNFS variable notstarted. This resulted in NFS write gathering effectively being permanently disabled (regardless if NFSv2 or NFSv3). nfs_syscalls.c:1.103 This change disabled write gathering (again) for NFSv3 although this should be controlled by vfs.nfs.nfsrvw_procrastinate_v3 != 0. Write gathering may still be useful with NFSv3 to put reordered write RPCs into order, perhaps also for other reasons. This is now possible again. The attached patch is for current but you will observe similar improvements with earlier FreeBSD versions. >How-To-Repeat: Apply patch, on server toggle vfs.nfsrv.cluster_writes=0,1. On client run (for example) dd if=/dev/zero of=/xxx/x.rm bs=32k count=10000 >Fix: Apply this patch --- nfs_syscalls.c.orig 2007-10-12 03:56:27.000000000 +0000 +++ nfs_syscalls.c 2007-11-16 21:09:46.000000000 +0000 @@ -85,7 +85,6 @@ int nfsd_waiting = 0; int nfsrv_numnfsd = 0; -static int notstarted = 1; static int nfs_privport = 0; SYSCTL_INT(_vfs_nfsrv, NFS_NFSPRIVPORT, nfs_privport, CTLFLAG_RW, @@ -452,9 +451,8 @@ else procrastinate = nfsrvw_procrastinate; NFSD_UNLOCK(); - if (writes_todo || (!(nd->nd_flag & ND_NFSV3) && - nd->nd_procnum == NFSPROC_WRITE && - procrastinate > 0 && !notstarted)) + if (writes_todo || (nd->nd_procnum == NFSPROC_WRITE && + procrastinate > 0)) error = nfsrv_writegather(&nd, slp, nfsd->nfsd_td, &mreq); else --- nfs_serv.c.orig 2007-10-18 16:38:07.000000000 +0000 +++ nfs_serv.c 2007-11-16 21:09:46.000000000 +0000 @@ -106,16 +106,20 @@ #define MAX_COMMIT_COUNT (1024 * 1024) -#define NUM_HEURISTIC 1017 +#define NUM_HEURISTIC 1031 /* Must be prime! */ +#define MAX_REORDERED_RPC 4 +#define HASH_MAXSTEP 0x3ff #define NHUSE_INIT 64 #define NHUSE_INC 16 #define NHUSE_MAX 2048 +#define NH_TAG(vp) ((uint32_t)((uintptr_t)vp / sizeof(struct vnode))) +CTASSERT(NUM_HEURISTIC > (HASH_MAXSTEP + 1)); static struct nfsheur { - struct vnode *nh_vp; /* vp to match (unreferenced pointer) */ - off_t nh_nextr; /* next offset for sequential detection */ - int nh_use; /* use count for selection */ - int nh_seqcount; /* heuristic */ + off_t nh_nextoff; /* next offset for sequential detection */ + uint32_t nh_tag; /* vp tag to match */ + uint16_t nh_use; /* use count for selection */ + uint16_t nh_seqcount; /* in units of logical blocks */ } nfsheur[NUM_HEURISTIC]; /* Global vars */ @@ -130,10 +134,16 @@ static int nfs_async; static int nfs_commit_blks; static int nfs_commit_miss; +static int nfsrv_cluster_writes = 1; +static int nfsrv_cluster_reads = 1; +static int nfsrv_reordered_io; SYSCTL_INT(_vfs_nfsrv, OID_AUTO, async, CTLFLAG_RW, &nfs_async, 0, ""); SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_blks, CTLFLAG_RW, &nfs_commit_blks, 0, ""); SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_miss, CTLFLAG_RW, &nfs_commit_miss, 0, ""); +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_writes, CTLFLAG_RW, &nfsrv_cluster_writes, 0, ""); +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_reads, CTLFLAG_RW, &nfsrv_cluster_reads, 0, ""); +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, reordered_io, CTLFLAG_RW, &nfsrv_reordered_io, 0, ""); struct nfsrvstats nfsrvstats; SYSCTL_STRUCT(_vfs_nfsrv, NFS_NFSRVSTATS, nfsrvstats, CTLFLAG_RW, &nfsrvstats, nfsrvstats, "S,nfsrvstats"); @@ -144,6 +154,75 @@ struct nfsrv_descript *); /* + * Detect sequential access so that we are able to hint the underlying + * file system to use clustered I/O when appropriate. + */ +static int +nfsrv_sequential_access(const struct uio *uio, const struct vnode *vp) +{ + struct nfsheur *nh; + unsigned hi, step; + int try = 8; + int nblocks, lblocksize; + + /* + * Locate best nfsheur[] candidate using double hashing. + */ + + hi = NH_TAG(vp) % NUM_HEURISTIC; + step = NH_TAG(vp) & HASH_MAXSTEP; + step++; /* Step must not be zero. */ + nh = &nfsheur[hi]; + + while (try--) { + if (nfsheur[hi].nh_tag == NH_TAG(vp)) { + nh = &nfsheur[hi]; + break; + } + if (nfsheur[hi].nh_use > 0) + --nfsheur[hi].nh_use; + hi = hi + step; + if (hi >= NUM_HEURISTIC) + hi -= NUM_HEURISTIC; + if (nfsheur[hi].nh_use < nh->nh_use) + nh = &nfsheur[hi]; + } + + if (nh->nh_tag != NH_TAG(vp)) { /* New entry. */ + nh->nh_tag = NH_TAG(vp); + nh->nh_nextoff = uio->uio_offset; + nh->nh_use = NHUSE_INIT; + nh->nh_seqcount = 1; /* Initially assume sequential access. */ + } else { + nh->nh_use += NHUSE_INC; + if (nh->nh_use > NHUSE_MAX) + nh->nh_use = NHUSE_MAX; + } + + /* + * Calculate heuristic + */ + + lblocksize = vp->v_mount->mnt_stat.f_iosize; + nblocks = howmany(uio->uio_resid, lblocksize); + if (uio->uio_offset == nh->nh_nextoff) { + nh->nh_seqcount += nblocks; + if (nh->nh_seqcount > IO_SEQMAX) + nh->nh_seqcount = IO_SEQMAX; + } else if (uio->uio_offset == 0) { + /* Seek to beginning of file, ignored. */ + } else if (qabs(uio->uio_offset - nh->nh_nextoff) <= + MAX_REORDERED_RPC*imax(lblocksize, uio->uio_resid)) { + nfsrv_reordered_io++; /* Probably reordered RPC, do nothing. */ + } else + nh->nh_seqcount /= 4; /* Not sequential access. */ + + nh->nh_nextoff = uio->uio_offset + uio->uio_resid; + + return (nh->nh_seqcount << IO_SEQSHIFT); +} + +/* * Clear nameidata fields that are tested in nsfmout cleanup code prior * to using first nfsm macro (that might jump to the cleanup code). */ @@ -783,7 +862,6 @@ fhandle_t *fhp; struct uio io, *uiop = &io; struct vattr va, *vap = &va; - struct nfsheur *nh; off_t off; int ioflag = 0; int vfslocked; @@ -855,61 +933,6 @@ else cnt = reqlen; - /* - * Calculate seqcount for heuristic - */ - - { - int hi; - int try = 32; - - /* - * Locate best candidate - */ - - hi = ((int)(vm_offset_t)vp / sizeof(struct vnode)) % NUM_HEURISTIC; - nh = &nfsheur[hi]; - - while (try--) { - if (nfsheur[hi].nh_vp == vp) { - nh = &nfsheur[hi]; - break; - } - if (nfsheur[hi].nh_use > 0) - --nfsheur[hi].nh_use; - hi = (hi + 1) % NUM_HEURISTIC; - if (nfsheur[hi].nh_use < nh->nh_use) - nh = &nfsheur[hi]; - } - - if (nh->nh_vp != vp) { - nh->nh_vp = vp; - nh->nh_nextr = off; - nh->nh_use = NHUSE_INIT; - if (off == 0) - nh->nh_seqcount = 4; - else - nh->nh_seqcount = 1; - } - - /* - * Calculate heuristic - */ - - if ((off == 0 && nh->nh_seqcount > 0) || off == nh->nh_nextr) { - if (++nh->nh_seqcount > IO_SEQMAX) - nh->nh_seqcount = IO_SEQMAX; - } else if (nh->nh_seqcount > 1) { - nh->nh_seqcount = 1; - } else { - nh->nh_seqcount = 0; - } - nh->nh_use += NHUSE_INC; - if (nh->nh_use > NHUSE_MAX) - nh->nh_use = NHUSE_MAX; - ioflag |= nh->nh_seqcount << IO_SEQSHIFT; - } - nfsm_reply(NFSX_POSTOPORFATTR(v3) + 3 * NFSX_UNSIGNED+nfsm_rndup(cnt)); if (v3) { tl = nfsm_build(u_int32_t *, NFSX_V3FATTR + 4 * NFSX_UNSIGNED); @@ -967,9 +990,10 @@ uiop->uio_resid = len; uiop->uio_rw = UIO_READ; uiop->uio_segflg = UIO_SYSSPACE; + if (nfsrv_cluster_reads) + ioflag |= nfsrv_sequential_access(uiop, vp); error = VOP_READ(vp, uiop, IO_NODELOCKED | ioflag, cred); off = uiop->uio_offset; - nh->nh_nextr = off; FREE((caddr_t)iv2, M_TEMP); if (error || (getret = VOP_GETATTR(vp, vap, cred, td))) { if (!error) @@ -1175,6 +1199,8 @@ uiop->uio_segflg = UIO_SYSSPACE; uiop->uio_td = NULL; uiop->uio_offset = off; + if (nfsrv_cluster_writes) + ioflags |= nfsrv_sequential_access(uiop, vp); error = VOP_WRITE(vp, uiop, ioflags, cred); /* XXXRW: unlocked write. */ nfsrvstats.srvvop_writes++; @@ -1223,6 +1249,7 @@ vput(vp); vn_finished_write(mntp); VFS_UNLOCK_GIANT(vfslocked); + bwillwrite(); /* After VOP_WRITE to avoid reordering. */ return(error); } @@ -1489,6 +1516,8 @@ mvfslocked = VFS_LOCK_GIANT(mntp); } if (!error) { + if (nfsrv_cluster_writes) + ioflags |= nfsrv_sequential_access(uiop, vp); error = VOP_WRITE(vp, uiop, ioflags, cred); /* XXXRW: unlocked write. */ nfsrvstats.srvvop_writes++; @@ -1579,6 +1608,7 @@ break; } splx(s); + bwillwrite(); /* After VOP_WRITE to avoid reordering. */ return (0); } @@ -3824,7 +3854,11 @@ } for_ret = VOP_GETATTR(vp, &bfor, cred, td); - if (cnt > MAX_COMMIT_COUNT) { + /* + * If count is 0, a flush from offset to the end of file + * should be performed according to RFC 1813. + */ + if (cnt == 0 || cnt > MAX_COMMIT_COUNT) { /* * Give up and do the whole thing */ @@ -3868,7 +3902,7 @@ s = splbio(); VI_LOCK(vp); - while (cnt > 0) { + while (!error && cnt > 0) { struct buf *bp; /* @@ -3891,7 +3925,7 @@ B_DELWRI) { bremfree(bp); bp->b_flags &= ~B_ASYNC; - bwrite(bp); + error = bwrite(bp); ++nfs_commit_miss; } else BUF_UNLOCK(bp); >Release-Note: >Audit-Trail: >Unformatted: