From owner-freebsd-fs@FreeBSD.ORG  Thu Jun  6 20:01:30 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 33A4B939
 for <fs@freebsd.org>; Thu,  6 Jun 2013 20:01:30 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail108.syd.optusnet.com.au (mail108.syd.optusnet.com.au
 [211.29.132.59]) by mx1.freebsd.org (Postfix) with ESMTP id D766C112C
 for <fs@freebsd.org>; Thu,  6 Jun 2013 20:01:29 +0000 (UTC)
Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au
 (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23])
 by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id 8CC121A0986
 for <fs@freebsd.org>; Fri,  7 Jun 2013 05:28:12 +1000 (EST)
Date: Fri, 7 Jun 2013 05:28:11 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: fs@freebsd.org
Subject: missed clustering for small block sizes in cluster_wbuild()
Message-ID: <20130607044845.O24441@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=eqSHVfVX c=1 sm=1 a=S7OjstH4qeUA:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=N13i1VaKTP4A:10
 a=CGTDn-9qxvHUQwg02b0A:9 a=CjuIK1q_8ugA:10 a=7VLQcNi6lEzEuJ70:21
 a=ZZlYl6MfCIyY2woO:21 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 06 Jun 2013 20:01:30 -0000

Would someone please fix this fairly small bug is cluster_wbuild().

When the fs block size is 512 bytes and everything is contiguous,
cluster_write() prepares a complete cluster but cluster_build()
splits it into two with a runt of size PAGE_SIZE - 512 bytes for
the second part.

>From vfs_cluster.c:

@ 		/*
@ 		 * From this location in the file, scan forward to see
@ 		 * if there are buffers with adjacent data that need to
@ 		 * be written as well.
@ 		 */
@ 		for (i = 0; i < len; ++i, ++start_lbn) {
@ 			if (i != 0) { /* If not the first buffer */
@ 				s = splbio();
@ 				/*
@ 				 * If the adjacent data is not even in core it
@ 				 * can't need to be written.
@ 				 */
@ 				VI_LOCK(vp);
@ 				if ((tbp = gbincore(vp, start_lbn)) == NULL ||
@ 				    (tbp->b_vflags & BV_BKGRDINPROG)) {
@ 					VI_UNLOCK(vp);
@ 					splx(s);
@ 					break;
@ 				}
@ 
@ 				/*
@ 				 * If it IS in core, but has different
@ 				 * characteristics, or is locked (which
@ 				 * means it could be undergoing a background
@ 				 * I/O or be in a weird state), then don't
@ 				 * cluster with it.
@ 				 */
@ 				if (BUF_LOCK(tbp,
@ 				    LK_EXCLUSIVE | LK_NOWAIT | LK_INTERLOCK,
@ 				    VI_MTX(vp))) {
@ 					splx(s);
@ 					break;
@ 				}
@ 
@ 				if ((tbp->b_flags & (B_VMIO | B_CLUSTEROK |
@ 				    B_INVAL | B_DELWRI | B_NEEDCOMMIT))
@ 				    != (B_DELWRI | B_CLUSTEROK |
@ 				    (bp->b_flags & (B_VMIO | B_NEEDCOMMIT))) ||
@ 				    tbp->b_wcred != bp->b_wcred) {
@ 					BUF_UNLOCK(tbp);
@ 					splx(s);
@ 					break;
@ 				}
@ 
@ 				/*
@ 				 * Check that the combined cluster
@ 				 * would make sense with regard to pages
@ 				 * and would not be too large
@ 				 */
@ 				if ((tbp->b_bcount != size) ||
@ 				  ((bp->b_blkno + (dbsize * i)) !=
@ 				    tbp->b_blkno) ||
@ 				  ((tbp->b_npages + bp->b_npages) >
@ 				    (vp->v_mount->mnt_iosize_max / PAGE_SIZE))) {

This page count check is wrong.  Usually mnt_iosize_max is 128K and PAGE_SIZE
is 4K.  This gives a limit of 32 pages, which is normally enough to hold
256 512-blocks (very sparsely and wastefully mapped, but that is another,
much larger bug).  The pages up to the 31st are built up right and hold
248 512-blocks.  Then bp->b_npages is 31 and tbp->npages = 1.  The sum is
32 so the 32nd page is started right.  The 249th 512-block is allocated
in it.  Then for the next 512-block, bp->b_npages is 32 so the sum is 33
so we break out of the loop prematurely.  The buffer containing 249
512-blocks is written out.  cluster_wbuild() continues and builds and
writes out a runt buffer containing the remaining 7 512-blocks.

For a quick fix, I just removed this check.  cluster_write() should
never prepare a cluster too large to write out in 1 step. It uses
mnt_stat.f_iosize instead mnt_iosize_max for the limit, and that
should not be larger.  But there may be a special case where the
buffer contents is not aligned.  Then an extra page might be needed.

@ 					BUF_UNLOCK(tbp);
@ 					splx(s);
@ 					break;
@ 				}
@ 				/*
@ 				 * Ok, it's passed all the tests,
@ 				 * so remove it from the free list
@ 				 * and mark it busy. We will use it.
@ 				 */
@ 				bremfree(tbp);
@ 				tbp->b_flags &= ~B_DONE;
@ 				splx(s);
@ 			} /* end of code for non-first buffers only */
@ 			/* check for latent dependencies to be handled */
@ 			if ((LIST_FIRST(&tbp->b_dep)) != NULL) {
@ 				tbp->b_iocmd = BIO_WRITE;
@ 				buf_start(tbp);
@ 			}
@ 			/*
@ 			 * If the IO is via the VM then we do some
@ 			 * special VM hackery (yuck).  Since the buffer's
@ 			 * block size may not be page-aligned it is possible
@ 			 * for a page to be shared between two buffers.  We
@ 			 * have to get rid of the duplication when building
@ 			 * the cluster.
@ 			 */
@ 			if (tbp->b_flags & B_VMIO) {
@ 				vm_page_t m;
@ 
@ 				if (i != 0) { /* if not first buffer */
@ 					for (j = 0; j < tbp->b_npages; j += 1) {
@ 						m = tbp->b_pages[j];
@ 						if (m->flags & PG_BUSY) {
@ 							bqrelse(tbp);
@ 							goto finishcluster;
@ 						}
@ 					}
@ 				}
@ 				if (tbp->b_object != NULL)
@ 					VM_OBJECT_LOCK(tbp->b_object);
@ 				vm_page_lock_queues();
@ 				for (j = 0; j < tbp->b_npages; j += 1) {
@ 					m = tbp->b_pages[j];
@ 					vm_page_io_start(m);
@ 					vm_object_pip_add(m->object, 1);
@ 					if ((bp->b_npages == 0) ||
@ 					  (bp->b_pages[bp->b_npages - 1] != m)) {
@ 						bp->b_pages[bp->b_npages] = m;
@ 						bp->b_npages++;

The number of new pages needed is the number of increments here, which I
think is 1 less than bp->b_pages in most cases where the runt buffer is
built.

I think this is best fixed be fixed by removing the check above and
checking here.  Then back out of the changes.  I don't know this code
well enough to write the backing out easily.

@ 					}
@ 				}
@ 				vm_page_unlock_queues();
@ 				if (tbp->b_object != NULL)
@ 					VM_OBJECT_UNLOCK(tbp->b_object);
@ 			}
@ 			bp->b_bcount += size;
@ 			bp->b_bufsize += size;
@ 
@ 			s = splbio();
@ 			bundirty(tbp);
@ 			tbp->b_flags &= ~B_DONE;
@ 			tbp->b_ioflags &= ~BIO_ERROR;
@ 			tbp->b_flags |= B_ASYNC;
@ 			tbp->b_iocmd = BIO_WRITE;
@ 			reassignbuf(tbp, tbp->b_vp);	/* put on clean list */
@ 			VI_LOCK(tbp->b_vp);
@ 			++tbp->b_vp->v_numoutput;
@ 			VI_UNLOCK(tbp->b_vp);
@ 			splx(s);
@ 			BUF_KERNPROC(tbp);
@ 			TAILQ_INSERT_TAIL(&bp->b_cluster.cluster_head,
@ 				tbp, b_cluster.cluster_entry);
@ 		}

The loss of i/o performance from this wasn't too bad on my disks.  The
bug mainly made it harder to see the size of other clustering bugs by
watching iostat output.

Bruce