Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 12 Dec 2001 22:34:17 -0800 (PST)
From:      Geoff Mohler <gemohler@www.speedtoys.com>
To:        Matthew Dillon <dillon@apollo.backplane.com>
Cc:        Jordan Hubbard <jkh@winston.freebsd.org>, Peter Wemm <peter@wemm.org>, Mike Smith <msmith@FreeBSD.ORG>, hackers@FreeBSD.ORG, msmith@mass.dis.org
Subject:   Re: Found NFS data corruption bug... (was Re: NFS: How to make FreeBSD fall on its face in one easy step )
Message-ID:  <Pine.BSF.4.10.10112122233540.11838-100000@speedracer.speedtoys.com>
In-Reply-To: <200112130608.fBD689K49906@apollo.backplane.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Are any of these client-side performance upgrades as well as bug fixes?

On Wed, 12 Dec 2001, Matthew Dillon wrote:

>     Ok, here is the latest patch for -stable.  Note that Kirk comitted a
>     slightly modified version of the softupdates fix to -current already
>     (the VOP_FSYNC stuff), which I will be MFCing in 3 days.
> 
>     This still doesn't fix all the problems the nfstest program that Jordan
>     posted finds, but it sure runs a hellofalot longer now before reporting
>     an error.  10,000+ tests now before failing (NFSv2 and NFSv3).
> 
>     Bugs fixed:
> 
> 	* Possible SMP database corruption due to vm_pager_unmap_page()
> 	  not clearing the TLB for the other cpu's.
> 
> 	* When flusing a dirty buffer due to B_CACHE getting cleared,
> 	  we were accidently setting B_CACHE again (that is, bwrite() sets
> 	  B_CACHE), when we really want it to stay clear after the write
> 	  is complete.  This resulted in a corrupt buffer.
> 
> 	* We have to call vtruncbuf() when ftruncate()ing to remove 
> 	  any buffer cache buffers.  This is still tentitive, I may
> 	  be able to remove it due to the second bug fix.
> 
> 	* vnode_pager_setsize() race against nfs_vinvalbuf()... we have
> 	  to set n_size before calling nfs_vinvalbuf or the NFS code
> 	  may recursively vnode_pager_setsize() to the original value
> 	  before the truncate.  This is what was causing the user mmap
> 	  bus faults in the nfs tester program.
> 
> 	* Fix to softupdates (old version)
> 
>     There are some general comments in there too.  After I do more tests
>     and cleanups (maybe remove the vtruncbuf()) I will port it all to
>     -current, test, and commit.  So far the stuff is simple enough that
>     a 3-day MFC will probably suffice.
> 
>     All I can say is... holy shit!
> 
> 						-Matt
> 
> 
> Index: kern/vfs_bio.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
> retrieving revision 1.242.2.13
> diff -u -r1.242.2.13 vfs_bio.c
> --- kern/vfs_bio.c	2001/11/18 07:10:59	1.242.2.13
> +++ kern/vfs_bio.c	2001/12/13 05:52:55
> @@ -2189,9 +2189,22 @@
>  		 * to softupdates re-dirtying the buffer.  In the latter
>  		 * case, B_CACHE is set after the first write completes,
>  		 * preventing further loops.
> +		 *
> +		 * NOTE!  b*write() sets B_CACHE.  If we cleared B_CACHE
> +		 * above while extending the buffer, we cannot allow the
> +		 * buffer to remain with B_CACHE set after the write
> +		 * completes or it will represent a corrupt state.  To
> +		 * deal with this we set B_NOCACHE to scrap the buffer
> +		 * after the write.
> +		 *
> +		 * We might be able to do something fancy, like setting
> +		 * B_CACHE in bwrite() except if B_DELWRI is already set,
> +		 * so the below call doesn't set B_CACHE, but that gets real
> +		 * confusing.  This is much easier.
>  		 */
>  
>  		if ((bp->b_flags & (B_CACHE|B_DELWRI)) == B_DELWRI) {
> +			bp->b_flags |= B_NOCACHE;
>  			VOP_BWRITE(bp->b_vp, bp);
>  			goto loop;
>  		}
> Index: nfs/nfs_bio.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/nfs/Attic/nfs_bio.c,v
> retrieving revision 1.83.2.1
> diff -u -r1.83.2.1 nfs_bio.c
> --- nfs/nfs_bio.c	2001/11/18 07:10:59	1.83.2.1
> +++ nfs/nfs_bio.c	2001/12/13 05:52:10
> @@ -193,8 +193,14 @@
>  			vm_page_set_validclean(m, 0, size - toff);
>  			/* handled by vm_fault now	  */
>  			/* vm_page_zero_invalid(m, TRUE); */
> +		} else {
> +			/*
> +			 * Read operation was short.  If no error occured
> +			 * we may have hit a zero-fill section.   We simply
> +			 * leave valid set to 0.
> +			 */
> +			;
>  		}
> -		
>  		if (i != ap->a_reqpage) {
>  			/*
>  			 * Whether or not to leave the page activated is up in
> @@ -896,14 +902,12 @@
>  				else
>  					bcount = np->n_size - (off_t)lbn * biosize;
>  			}
> -
> -			bp = nfs_getcacheblk(vp, lbn, bcount, p);
> -
>  			if (uio->uio_offset + n > np->n_size) {
>  				np->n_size = uio->uio_offset + n;
>  				np->n_flag |= NMODIFIED;
>  				vnode_pager_setsize(vp, np->n_size);
>  			}
> +			bp = nfs_getcacheblk(vp, lbn, bcount, p);
>  		}
>  
>  		if (!bp) {
> @@ -1409,11 +1413,13 @@
>  	    io.iov_len = uiop->uio_resid = bp->b_bcount;
>  	    io.iov_base = bp->b_data;
>  	    uiop->uio_rw = UIO_READ;
> +
>  	    switch (vp->v_type) {
>  	    case VREG:
>  		uiop->uio_offset = ((off_t)bp->b_blkno) * DEV_BSIZE;
>  		nfsstats.read_bios++;
>  		error = nfs_readrpc(vp, uiop, cr);
> +
>  		if (!error) {
>  		    if (uiop->uio_resid) {
>  			/*
> @@ -1425,7 +1431,7 @@
>  			 * writes, but that is not possible any longer.
>  			 */
>  			int nread = bp->b_bcount - uiop->uio_resid;
> -			int left  = bp->b_bcount - nread;
> +			int left  = uiop->uio_resid;
>  
>  			if (left > 0)
>  				bzero((char *)bp->b_data + nread, left);
> Index: nfs/nfs_vnops.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/nfs/Attic/nfs_vnops.c,v
> retrieving revision 1.150.2.4
> diff -u -r1.150.2.4 nfs_vnops.c
> --- nfs/nfs_vnops.c	2001/08/05 00:23:58	1.150.2.4
> +++ nfs/nfs_vnops.c	2001/12/13 04:23:51
> @@ -710,7 +710,22 @@
>  			 */
>  			if (vp->v_mount->mnt_flag & MNT_RDONLY)
>  				return (EROFS);
> -			vnode_pager_setsize(vp, vap->va_size);
> +
> +			/*
> +			 * We run vnode_pager_setsize() early (why?),
> +			 * we must set np->n_size now to avoid vinvalbuf
> +			 * V_SAVE races that might setsize a lower
> +			 * value.
> +			 */
> + 			tsize = np->n_size;
> + 			np->n_size = vap->va_size;
> +#if 1
> +			if (np->n_size < tsize)
> +			    error = vtruncbuf(vp, ap->a_cred, ap->a_p, vap->va_size, vp->v_mount->mnt_stat.f_iosize);
> +			else
> +#endif
> +			    vnode_pager_setsize(vp, vap->va_size);
> +
>   			if (np->n_flag & NMODIFIED) {
>   			    if (vap->va_size == 0)
>   				error = nfs_vinvalbuf(vp, 0,
> @@ -719,12 +734,12 @@
>   				error = nfs_vinvalbuf(vp, V_SAVE,
>   					ap->a_cred, ap->a_p, 1);
>   			    if (error) {
> +				np->n_size = tsize;
>  				vnode_pager_setsize(vp, np->n_size);
>   				return (error);
>  			    }
>   			}
> - 			tsize = np->n_size;
> - 			np->n_size = np->n_vattr.va_size = vap->va_size;
> +			np->n_vattr.va_size = vap->va_size;
>    		};
>    	} else if ((vap->va_mtime.tv_sec != VNOVAL ||
>  		vap->va_atime.tv_sec != VNOVAL) && (np->n_flag & NMODIFIED) &&
> @@ -1119,10 +1134,12 @@
>  		m_freem(mrep);
>  		tsiz -= retlen;
>  		if (v3) {
> -			if (eof || retlen == 0)
> +			if (eof || retlen == 0) {
>  				tsiz = 0;
> -		} else if (retlen < len)
> +			}
> +		} else if (retlen < len) {
>  			tsiz = 0;
> +		}
>  	}
>  nfsmout:
>  	return (error);
> Index: ufs/ffs/ffs_inode.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_inode.c,v
> retrieving revision 1.56.2.3
> diff -u -r1.56.2.3 ffs_inode.c
> --- ufs/ffs/ffs_inode.c	2001/11/20 20:27:27	1.56.2.3
> +++ ufs/ffs/ffs_inode.c	2001/12/12 23:43:36
> @@ -244,6 +244,10 @@
>  		if (error) {
>  			return (error);
>  		}
> +		if (length > 0 && DOINGSOFTDEP(ovp)) {
> +			if ((error = VOP_FSYNC(ovp, cred, MNT_WAIT, p)) != 0)
> +				return (error);
> +		}
>  		oip->i_size = length;
>  		size = blksize(fs, oip, lbn);
>  		if (ovp->v_type != VDIR)
> @@ -359,6 +363,10 @@
>  		if (newspace == 0)
>  			panic("ffs_truncate: newspace");
>  		if (oldspace - newspace > 0) {
> +			/*
> +			 * XXX Softupdates, adjust queued directblock
> +			 *     in place rather then the second FSYNC above?
> +			 */
>  			/*
>  			 * Block number of space to be free'd is
>  			 * the old block # plus the number of frags
> Index: vm/vnode_pager.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/vm/vnode_pager.c,v
> retrieving revision 1.116.2.5
> diff -u -r1.116.2.5 vnode_pager.c
> --- vm/vnode_pager.c	2001/11/09 03:21:22	1.116.2.5
> +++ vm/vnode_pager.c	2001/12/13 02:49:39
> @@ -310,6 +310,20 @@
>  				vm_pager_unmap_page(kva);
>  
>  				/*
> +				 * XXX work around SMP data integrity race
> +				 * by unmapping the page from user processes.
> +				 * The garbage we just cleared may be mapped
> +				 * to a user process running on another cpu
> +				 * and this code is not running through normal
> +				 * I/O channels which handle SMP issues for
> +				 * us, so unmap page to synchronize all cpus.
> +				 *
> +				 * XXX should vm_pager_unmap_page() have
> +				 * dealt with this?
> +				 */
> +				vm_page_protect(m, VM_PROT_NONE);
> +
> +				/*
>  				 * Clear out partial-page dirty bits.  This
>  				 * has the side effect of setting the valid
>  				 * bits, but that is ok.  There are a bunch
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-hackers" in the body of the message
> 

---
Geoff Mohler


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.10.10112122233540.11838-100000>