From owner-freebsd-fs@FreeBSD.ORG Thu Jun 6 20:01:30 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 33A4B939 for ; Thu, 6 Jun 2013 20:01:30 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail108.syd.optusnet.com.au (mail108.syd.optusnet.com.au [211.29.132.59]) by mx1.freebsd.org (Postfix) with ESMTP id D766C112C for ; Thu, 6 Jun 2013 20:01:29 +0000 (UTC) Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id 8CC121A0986 for ; Fri, 7 Jun 2013 05:28:12 +1000 (EST) Date: Fri, 7 Jun 2013 05:28:11 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: fs@freebsd.org Subject: missed clustering for small block sizes in cluster_wbuild() Message-ID: <20130607044845.O24441@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=eqSHVfVX c=1 sm=1 a=S7OjstH4qeUA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=N13i1VaKTP4A:10 a=CGTDn-9qxvHUQwg02b0A:9 a=CjuIK1q_8ugA:10 a=7VLQcNi6lEzEuJ70:21 a=ZZlYl6MfCIyY2woO:21 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 06 Jun 2013 20:01:30 -0000 Would someone please fix this fairly small bug is cluster_wbuild(). When the fs block size is 512 bytes and everything is contiguous, cluster_write() prepares a complete cluster but cluster_build() splits it into two with a runt of size PAGE_SIZE - 512 bytes for the second part. >From vfs_cluster.c: @ /* @ * From this location in the file, scan forward to see @ * if there are buffers with adjacent data that need to @ * be written as well. @ */ @ for (i = 0; i < len; ++i, ++start_lbn) { @ if (i != 0) { /* If not the first buffer */ @ s = splbio(); @ /* @ * If the adjacent data is not even in core it @ * can't need to be written. @ */ @ VI_LOCK(vp); @ if ((tbp = gbincore(vp, start_lbn)) == NULL || @ (tbp->b_vflags & BV_BKGRDINPROG)) { @ VI_UNLOCK(vp); @ splx(s); @ break; @ } @ @ /* @ * If it IS in core, but has different @ * characteristics, or is locked (which @ * means it could be undergoing a background @ * I/O or be in a weird state), then don't @ * cluster with it. @ */ @ if (BUF_LOCK(tbp, @ LK_EXCLUSIVE | LK_NOWAIT | LK_INTERLOCK, @ VI_MTX(vp))) { @ splx(s); @ break; @ } @ @ if ((tbp->b_flags & (B_VMIO | B_CLUSTEROK | @ B_INVAL | B_DELWRI | B_NEEDCOMMIT)) @ != (B_DELWRI | B_CLUSTEROK | @ (bp->b_flags & (B_VMIO | B_NEEDCOMMIT))) || @ tbp->b_wcred != bp->b_wcred) { @ BUF_UNLOCK(tbp); @ splx(s); @ break; @ } @ @ /* @ * Check that the combined cluster @ * would make sense with regard to pages @ * and would not be too large @ */ @ if ((tbp->b_bcount != size) || @ ((bp->b_blkno + (dbsize * i)) != @ tbp->b_blkno) || @ ((tbp->b_npages + bp->b_npages) > @ (vp->v_mount->mnt_iosize_max / PAGE_SIZE))) { This page count check is wrong. Usually mnt_iosize_max is 128K and PAGE_SIZE is 4K. This gives a limit of 32 pages, which is normally enough to hold 256 512-blocks (very sparsely and wastefully mapped, but that is another, much larger bug). The pages up to the 31st are built up right and hold 248 512-blocks. Then bp->b_npages is 31 and tbp->npages = 1. The sum is 32 so the 32nd page is started right. The 249th 512-block is allocated in it. Then for the next 512-block, bp->b_npages is 32 so the sum is 33 so we break out of the loop prematurely. The buffer containing 249 512-blocks is written out. cluster_wbuild() continues and builds and writes out a runt buffer containing the remaining 7 512-blocks. For a quick fix, I just removed this check. cluster_write() should never prepare a cluster too large to write out in 1 step. It uses mnt_stat.f_iosize instead mnt_iosize_max for the limit, and that should not be larger. But there may be a special case where the buffer contents is not aligned. Then an extra page might be needed. @ BUF_UNLOCK(tbp); @ splx(s); @ break; @ } @ /* @ * Ok, it's passed all the tests, @ * so remove it from the free list @ * and mark it busy. We will use it. @ */ @ bremfree(tbp); @ tbp->b_flags &= ~B_DONE; @ splx(s); @ } /* end of code for non-first buffers only */ @ /* check for latent dependencies to be handled */ @ if ((LIST_FIRST(&tbp->b_dep)) != NULL) { @ tbp->b_iocmd = BIO_WRITE; @ buf_start(tbp); @ } @ /* @ * If the IO is via the VM then we do some @ * special VM hackery (yuck). Since the buffer's @ * block size may not be page-aligned it is possible @ * for a page to be shared between two buffers. We @ * have to get rid of the duplication when building @ * the cluster. @ */ @ if (tbp->b_flags & B_VMIO) { @ vm_page_t m; @ @ if (i != 0) { /* if not first buffer */ @ for (j = 0; j < tbp->b_npages; j += 1) { @ m = tbp->b_pages[j]; @ if (m->flags & PG_BUSY) { @ bqrelse(tbp); @ goto finishcluster; @ } @ } @ } @ if (tbp->b_object != NULL) @ VM_OBJECT_LOCK(tbp->b_object); @ vm_page_lock_queues(); @ for (j = 0; j < tbp->b_npages; j += 1) { @ m = tbp->b_pages[j]; @ vm_page_io_start(m); @ vm_object_pip_add(m->object, 1); @ if ((bp->b_npages == 0) || @ (bp->b_pages[bp->b_npages - 1] != m)) { @ bp->b_pages[bp->b_npages] = m; @ bp->b_npages++; The number of new pages needed is the number of increments here, which I think is 1 less than bp->b_pages in most cases where the runt buffer is built. I think this is best fixed be fixed by removing the check above and checking here. Then back out of the changes. I don't know this code well enough to write the backing out easily. @ } @ } @ vm_page_unlock_queues(); @ if (tbp->b_object != NULL) @ VM_OBJECT_UNLOCK(tbp->b_object); @ } @ bp->b_bcount += size; @ bp->b_bufsize += size; @ @ s = splbio(); @ bundirty(tbp); @ tbp->b_flags &= ~B_DONE; @ tbp->b_ioflags &= ~BIO_ERROR; @ tbp->b_flags |= B_ASYNC; @ tbp->b_iocmd = BIO_WRITE; @ reassignbuf(tbp, tbp->b_vp); /* put on clean list */ @ VI_LOCK(tbp->b_vp); @ ++tbp->b_vp->v_numoutput; @ VI_UNLOCK(tbp->b_vp); @ splx(s); @ BUF_KERNPROC(tbp); @ TAILQ_INSERT_TAIL(&bp->b_cluster.cluster_head, @ tbp, b_cluster.cluster_entry); @ } The loss of i/o performance from this wasn't too bad on my disks. The bug mainly made it harder to see the size of other clustering bugs by watching iostat output. Bruce