From owner-svn-src-head@freebsd.org  Wed Aug 17 11:11:01 2016
Return-Path: <owner-svn-src-head@freebsd.org>
Delivered-To: svn-src-head@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id C880DBBB5DE;
 Wed, 17 Aug 2016 11:11:01 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au
 [211.29.132.249])
 by mx1.freebsd.org (Postfix) with ESMTP id 750291484;
 Wed, 17 Aug 2016 11:11:01 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au
 (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109])
 by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 34ED1104750F;
 Wed, 17 Aug 2016 20:40:59 +1000 (AEST)
Date: Wed, 17 Aug 2016 20:40:58 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
cc: Jilles Tjoelker <jilles@stack.nl>, src-committers@freebsd.org, 
 svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject: Re: svn commit: r304180 - head/sys/ufs/ffs
In-Reply-To: <20160817081055.GM83214@kib.kiev.ua>
Message-ID: <20160817191239.A3117@besplex.bde.org>
References: <201608151922.u7FJMOmT099410@repo.freebsd.org>
 <20160816215355.GB12199@stack.nl> <20160817081055.GM83214@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=VIkg5I7X c=1 sm=1 tr=0
 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10
 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=I9sqMnmzHltNhR3enRgA:9
 a=kSGpwUHVDBIaL0fO:21 a=A_BSZxN6wfCI-H8q:21 a=CjuIK1q_8ugA:10
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 17 Aug 2016 11:11:01 -0000

On Wed, 17 Aug 2016, Konstantin Belousov wrote:

> On Tue, Aug 16, 2016 at 11:53:55PM +0200, Jilles Tjoelker wrote:
>> Hmm, some people interpret POSIX differently than I do, but I think it
>> is clear that XBD 3.384 Synchronized I/O Data Integrity Completion
>> requires the indirect blocks to be written (because "all file system
>> information required to retrieve the data is successfully transferred").
>> The Linux man page matches this interpretation except that an fsync of
>> the parent directory may be required.

All related directories must be synced.  It is almost useless to sync
the data if the data cannot be found later using normal file system
operations.  Not quite useless if the file can be recovered by fsck
and saved in lost+found, but the metadata needed to guarantee that
that works probably requires a lot of writes anyway.

>> If the whole file is being considered, then any change to indirect
>> blocks is needed to access some data (because the file was extended, a
>> hole was filled or snapshotted data was overwritten). For other
>> overwrites, there is no change to indirect blocks and no conflict with
>> fdatasync's performance objectives.
> Note that the same argument is applicable to the inode block:  the di_db
> and di_ib pointers to the direct blocks and to root indirect blocks are
> required to retrieve the written data.

I would never have guessed that you left out either.

> ...
>
>> Ideally, a changed i_size would also cause the inode to be written, but
>> I don't know how to determine that (there is no i_flag for it). Always
>> writing the inode would defeat the point of fdatasync.

My version does a bcmp() of the in-core copy with the disk copy to avoid
null changes.  This turned out to be not very useful.  It was intended
mainly to avoid writing null changes for timestamps.  By I already
avoid most timestamp changes by mounting with -noatime and not having
large data.  I use only seconds granularity for timestamps.  With
nanoseconds granularity, any change to a file is guaranteed to change
its in-core inode, so the bcmp() wouldn't work.  But timestamps are about
the only metadata that obviously doesn't need syncing.  Especially atimes.
POSIX doesn't allow turning off atimes, but it also doesn't require
syncing them before the system crashes except probably using FSYNC of
metadata.

There is an i_flag for critical metadata like i_size.  It is IN_MODIFIED.
This flag is often mismanaged.  Many places set only IN_CHANGE after making
a critical change.  ufs_chown() is one such place.  Ownerships are fairly
critical metadata.  They are not considered critical enough to sync
immediately, but they should be synced by FSYNC of metadata.  But the
implementation is to set only IN_CHANGE.  i_ctime is usually updated much
later and IN_MODIFIED is set then to say that the relatively unimportant
i_ctime has been modified.  This inhibits the optimization of syncing for
chown() but not syncing for ctime changes.

> Which was my reasoning behind omitting the indirect block flushing.
> There is no practical difference between inode block and indirect blocks
> for metadata consistency.

Ugh.

> Note that the affected case is only the async mounts (not sync mounts,
> not any variants of softdeps).

I fixed the non-syncing of inodes for the async mount case ~15 years ago
in my version, and finally merged the changes to my version of -current
a few months OK.  It was stupid that ordinary fsync() synced the indirect
blocks but not the more important inode.  Directory metadata is still not
synced by fsync() in the async mount case.

> I can restore indirect block flushing and wait for them for DATA_ONLY sync,
> but then I should also add ffs_update() call.

Here are my ffs fixes for -current.  The are mostly to sprinkle missing
DOINGASYNC() checks and finally fix ffs_update():

X Index: ffs/ffs_balloc.c
X ===================================================================
X --- ffs/ffs_balloc.c	(revision 304069)
X +++ ffs/ffs_balloc.c	(working copy)
X @@ -155,4 +155,6 @@
X  			if (flags & IO_SYNC)
X  				bwrite(bp);
X +			else if (DOINGASYNC(vp))
X +				bdwrite(bp);
X  			else
X  				bawrite(bp);
X @@ -265,12 +267,10 @@
X  			    newb, 0, fs->fs_bsize, 0, bp);
X  			bdwrite(bp);
X +		} else if ((flags & IO_SYNC) == 0 && DOINGASYNC(vp)) {
X +			if (bp->b_bufsize == fs->fs_bsize)
X +				bp->b_flags |= B_CLUSTEROK;
X +			bdwrite(bp);

This restructures the code a little to fix the DOINGASYNC() check.  IO_SYNC
was not checked.

X  		} else {
X -			/*
X -			 * Write synchronously so that indirect blocks
X -			 * never point at garbage.
X -			 */
X -			if (DOINGASYNC(vp))
X -				bdwrite(bp);
X -			else if ((error = bwrite(bp)) != 0)
X +			if ((error = bwrite(bp)) != 0)
X  				goto fail;
X  		}

The comment was banal and duplicated ad nauseum, except it was wrong when
the DOINGASYNC() check was under it.

X @@ -335,9 +335,9 @@
X  			    indirs[i - 1].in_off, nb);
X  			bdwrite(nbp);
X +		} else if ((flags & IO_SYNC) == 0 && DOINGASYNC(vp)) {
X +			if (nbp->b_bufsize == fs->fs_bsize)
X +				nbp->b_flags |= B_CLUSTEROK;
X +			bdwrite(nbp);
X  		} else {
X -			/*
X -			 * Write synchronously so that indirect blocks
X -			 * never point at garbage.
X -			 */
X  			if ((error = bwrite(nbp)) != 0) {
X  				brelse(bp);

Here the DOINGASYNC() check was missing and the comment wasn't broken.

X @@ -829,12 +829,10 @@
X  			    newb, 0, fs->fs_bsize, 0, bp);
X  			bdwrite(bp);
X +		} else if ((flags & IO_SYNC) == 0 && DOINGASYNC(vp)) {
X +			if (bp->b_bufsize == fs->fs_bsize)
X +				bp->b_flags |= B_CLUSTEROK;
X +			bdwrite(bp);
X  		} else {
X -			/*
X -			 * Write synchronously so that indirect blocks
X -			 * never point at garbage.
X -			 */
X -			if (DOINGASYNC(vp))
X -				bdwrite(bp);
X -			else if ((error = bwrite(bp)) != 0)
X +			if ((error = bwrite(bp)) != 0)
X  				goto fail;
X  		}
X @@ -900,9 +898,9 @@
X  			    indirs[i - 1].in_off, nb);
X  			bdwrite(nbp);
X +		} else if ((flags & IO_SYNC) == 0 && DOINGASYNC(vp)) {
X +			if (nbp->b_bufsize == fs->fs_bsize)
X +				nbp->b_flags |= B_CLUSTEROK;
X +			bdwrite(nbp);
X  		} else {
X -			/*
X -			 * Write synchronously so that indirect blocks
X -			 * never point at garbage.
X -			 */
X  			if ((error = bwrite(nbp)) != 0) {
X  				brelse(bp);

There are 4 instances of almost the same code (not exactly the same).
I changed them to be as identical as possbile.

X Index: ffs/ffs_inode.c
X ===================================================================
X --- ffs/ffs_inode.c	(revision 304069)
X +++ ffs/ffs_inode.c	(working copy)
X @@ -155,5 +155,5 @@
X  		random_harvest_queue(&(ip->i_din2), sizeof(ip->i_din2), 1, RANDOM_FS_ATIME);
X  	}
X -	if (waitfor && !DOINGASYNC(vp))
X +	if (waitfor)
X  		error = bwrite(bp);
X  	else if (vm_page_count_severe() || buf_dirty_count_severe()) {

This fixes ffs_update() by removing dyson's old DOINGASYNC() hack.  This
is a very safe change, but it needs a havy sprinkling of DOINGASYNC() in
callers to avoid passing waitfor = TRUE too often.  Many callers have
already been fixed.

X @@ -265,5 +265,5 @@
X  			}
X  			ip->i_flag |= IN_CHANGE;
X -			if ((error = ffs_update(vp, !DOINGASYNC(vp))))
X +			if ((error = ffs_update(vp, (flags & IO_SYNC) != 0)))
X  				return (error);
X  			for (i = 0; i < NXADDR; i++) {

This was a broken caller.  It and the next few fixes are in ffs_truncate().
ffs_truncate() has already converted (!DOINGASYNC() && !DOINGSOFTDEP())
to IO_ASYNC in flags here, so (flags & IO_SYNC) must be checked here so
as to not lose the caller's setting of IO_SYNC.  This gives the same
precedence to IO_SYNC as in the checks above.  ffs_truncate() used to
do separate checks like the ones above.  This was more readable but not
as fail-safe when a check was forgotten.

X @@ -291,5 +291,5 @@
X  		if (needextclean)
X  			goto extclean;
X -		return (ffs_update(vp, !DOINGASYNC(vp)));
X +		return (ffs_update(vp, (flags & IO_SYNC) != 0));
X  	}
X  	if (ip->i_size == length) {
X @@ -329,5 +329,5 @@
X  			bawrite(bp);
X  		ip->i_flag |= IN_CHANGE | IN_UPDATE;
X -		return (ffs_update(vp, !DOINGASYNC(vp)));
X +		return (ffs_update(vp, (flags & IO_SYNC) != 0));
X  	}
X  	/*
X @@ -358,8 +358,8 @@
X  		if (blkno != 0)
X  			brelse(bp);
X -		else if (DOINGSOFTDEP(vp) || DOINGASYNC(vp))
X -			bdwrite(bp);
X -		else
X +		else if (flags & IO_SYNC)
X  			bwrite(bp);
X +		else
X +			bdwrite(bp);
X  	}
X  	/*

Better check the DOINGSOFTDEP() part of ths fix.  The caller's IO_SYNC
flag and previous logic were defeated by checking DOINGSOFTDEP() again
here.

Note that callers were changed to not check DOING*() before calling
ffs_update().  They just pass IO_SYNC if they want to force a sync.
Most callers don't want to force it, so they pass 0 and syncing is
controlled by DOINGSOFTDEP() and DOINGASYNC() in the usual way.  This
gives ufs's old fail-safe behaviour of sync metadata for ffs_truncate()
in the non-softdep non-async mount case.  Old versions were not so
careful and depending on callers calculating the flag.

X @@ -479,5 +479,5 @@
X  	}
X  	ip->i_flag |= IN_CHANGE | IN_UPDATE;
X -	allerror = ffs_update(vp, !DOINGASYNC(vp));
X +	allerror = ffs_update(vp, (flags & IO_SYNC) != 0);
X 
X  	/*
X @@ -611,5 +611,5 @@
X  	else
X  		softdep_setup_freeblocks(ip, length, IO_EXT);
X -	return (ffs_update(vp, (flags & IO_SYNC) != 0 || !DOINGASYNC(vp)));
X +	return (ffs_update(vp, (flags & IO_SYNC) != 0));
X  }
X

End of fixes for ffs_truncate().


X Index: ffs/ffs_vnops.c
X ===================================================================
X --- ffs/ffs_vnops.c	(revision 304069)
X +++ ffs/ffs_vnops.c	(working copy)
X @@ -717,5 +717,5 @@
X  	else
X  		flags = seqcount << BA_SEQSHIFT;
X -	if ((ioflag & IO_SYNC) && !DOINGASYNC(vp))
X +	if (ioflag & IO_SYNC)
X  		flags |= IO_SYNC;
X  	flags |= BA_UNMAPPED;
X @@ -1037,5 +1037,5 @@
X  	osize = dp->di_extsize;
X  	flags = IO_EXT;
X -	if ((ioflag & IO_SYNC) && !DOINGASYNC(vp))
X +	if (ioflag & IO_SYNC)
X  		flags |= IO_SYNC;
X

Examples of the ffs_write() and ffs_extwrite() callers calculating
the flag wrong.

There must be some magic for DOINGSOFTDEP() here and I hope I didn't
break it.  The old code forces IO_SYNC when the caller doesn't ask
for it.  This is the correct fail-safe behaviour for the old non-softdep
non-async non-sync mount case, but is pessimal.  It avoided here for
the async mount case, but for the softdep case IO_SYNC is bogusly set
here and then soft updates have to magically dishonor IO_SYNC to avoid
all writes become since (just ordered and delayed for soft updates).

I don't see how soft updates can work write here.  If it dishonors
IO_SYNC at lower levels, then this breaks the caller's reques tof
sync writes, and if it doesn't dishonor IO_SYNC at lower levels,
then the usual case is pessimized.

X Index: ufs/ufs_lookup.c
X ===================================================================
X --- ufs/ufs_lookup.c	(revision 304069)
X +++ ufs/ufs_lookup.c	(working copy)
X @@ -1249,5 +1249,5 @@
X  		if (flags & DOWHITEOUT)
X  			error = bwrite(bp);
X -		else if (DOINGASYNC(dvp) && dp->i_count != 0)
X +		else if (DOINGASYNC(dvp))
X  			bdwrite(bp);
X  		else

A pure async mount case fix -- don't do anything sync unless asked for.


X Index: ufs/ufs_vnops.c
X ===================================================================
X --- ufs/ufs_vnops.c	(revision 304069)
X +++ ufs/ufs_vnops.c	(working copy)
X @@ -992,5 +992,5 @@
X  	if (DOINGSOFTDEP(vp))
X  		softdep_setup_link(VTOI(tdvp), ip);
X -	error = UFS_UPDATE(vp, !(DOINGSOFTDEP(vp) | DOINGASYNC(vp)));
X +	error = UFS_UPDATE(vp, !DOINGSOFTDEP(vp) && !DOINGASYNC(vp));
X  	if (!error) {
X  		ufs_makedirentry(ip, cnp, &newdir);
X @@ -1336,5 +1336,5 @@
X  	if (DOINGSOFTDEP(fvp))
X  		softdep_setup_link(tdp, fip);
X -	error = UFS_UPDATE(fvp, !(DOINGSOFTDEP(fvp) | DOINGASYNC(fvp)));
X +	error = UFS_UPDATE(fvp, !DOINGSOFTDEP(fvp) && !DOINGASYNC(fvp));
X  	if (error)
X  		goto bad;
X @@ -1489,6 +1489,6 @@
X  			if (DOINGSOFTDEP(tdvp))
X  				softdep_setup_dotdot_link(tdp, fip);
X -			error = UFS_UPDATE(tdvp, !(DOINGSOFTDEP(tdvp) |
X -						   DOINGASYNC(tdvp)));
X +			error = UFS_UPDATE(tdvp, !DOINGSOFTDEP(tdvp) &&
X +			    !DOINGASYNC(tdvp));
X  			/* Don't go to bad here as the new link exists. */
X  			if (error)

Use consistent logic, and don't do manual optimizations of logical expressions.

X @@ -1534,5 +1534,6 @@
X  			ufsdirhash_dirtrunc(tdp, endoff);
X  #endif
X -		UFS_TRUNCATE(tdvp, endoff, IO_NORMAL | IO_SYNC, tcnp->cn_cred);
X +		UFS_TRUNCATE(tdvp, endoff, IO_NORMAL |
X +		    (DOINGASYNC(tdvp) ? 0 : IO_SYNC), tcnp->cn_cred);
X  	}
X  	if (error == 0 && tdp->i_flag & IN_NEEDSYNC)

This looks wrong.  Why do we want to override IO_NORMAL?  IIRC, I wrote
it like this since I didn't want to change the policy for soft updates.

This is near the directory compaction fix.  I completed this set of patches
after understanding the simplifications given by IO_NORMAL letting
ffs_truncate() handle the details.

X @@ -1879,5 +1880,5 @@
X  	if (DOINGSOFTDEP(dvp))
X  		softdep_setup_mkdir(dp, ip);
X -	error = UFS_UPDATE(dvp, !(DOINGSOFTDEP(dvp) | DOINGASYNC(dvp)));
X +	error = UFS_UPDATE(dvp, !DOINGSOFTDEP(dvp) && !DOINGASYNC(dvp));
X  	if (error)
X  		goto bad;
X @@ -1936,6 +1937,6 @@
X  		}
X  	}
X -	if ((error = UFS_UPDATE(tvp, !(DOINGSOFTDEP(tvp) |
X -				       DOINGASYNC(tvp)))) != 0) {
X +	if ((error = UFS_UPDATE(tvp, !DOINGSOFTDEP(tvp) &&
X +	    !DOINGASYNC(tvp))) != 0) {
X  		(void)bwrite(bp);
X  		goto bad;
X @@ -2671,5 +2672,5 @@
X  	 * Make sure inode goes to disk before directory entry.
X  	 */
X -	error = UFS_UPDATE(tvp, !(DOINGSOFTDEP(tvp) | DOINGASYNC(tvp)));
X +	error = UFS_UPDATE(tvp, !DOINGSOFTDEP(tvp) && !DOINGASYNC(tvp));
X  	if (error)
X  		goto bad;

Lots more style fixes to remove home made optimizations.

Bruce