From owner-svn-src-head@freebsd.org Wed Aug 17 11:11:01 2016 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C880DBBB5DE; Wed, 17 Aug 2016 11:11:01 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 750291484; Wed, 17 Aug 2016 11:11:01 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 34ED1104750F; Wed, 17 Aug 2016 20:40:59 +1000 (AEST) Date: Wed, 17 Aug 2016 20:40:58 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: Jilles Tjoelker , src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r304180 - head/sys/ufs/ffs In-Reply-To: <20160817081055.GM83214@kib.kiev.ua> Message-ID: <20160817191239.A3117@besplex.bde.org> References: <201608151922.u7FJMOmT099410@repo.freebsd.org> <20160816215355.GB12199@stack.nl> <20160817081055.GM83214@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=VIkg5I7X c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=I9sqMnmzHltNhR3enRgA:9 a=kSGpwUHVDBIaL0fO:21 a=A_BSZxN6wfCI-H8q:21 a=CjuIK1q_8ugA:10 X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Aug 2016 11:11:01 -0000 On Wed, 17 Aug 2016, Konstantin Belousov wrote: > On Tue, Aug 16, 2016 at 11:53:55PM +0200, Jilles Tjoelker wrote: >> Hmm, some people interpret POSIX differently than I do, but I think it >> is clear that XBD 3.384 Synchronized I/O Data Integrity Completion >> requires the indirect blocks to be written (because "all file system >> information required to retrieve the data is successfully transferred"). >> The Linux man page matches this interpretation except that an fsync of >> the parent directory may be required. All related directories must be synced. It is almost useless to sync the data if the data cannot be found later using normal file system operations. Not quite useless if the file can be recovered by fsck and saved in lost+found, but the metadata needed to guarantee that that works probably requires a lot of writes anyway. >> If the whole file is being considered, then any change to indirect >> blocks is needed to access some data (because the file was extended, a >> hole was filled or snapshotted data was overwritten). For other >> overwrites, there is no change to indirect blocks and no conflict with >> fdatasync's performance objectives. > Note that the same argument is applicable to the inode block: the di_db > and di_ib pointers to the direct blocks and to root indirect blocks are > required to retrieve the written data. I would never have guessed that you left out either. > ... > >> Ideally, a changed i_size would also cause the inode to be written, but >> I don't know how to determine that (there is no i_flag for it). Always >> writing the inode would defeat the point of fdatasync. My version does a bcmp() of the in-core copy with the disk copy to avoid null changes. This turned out to be not very useful. It was intended mainly to avoid writing null changes for timestamps. By I already avoid most timestamp changes by mounting with -noatime and not having large data. I use only seconds granularity for timestamps. With nanoseconds granularity, any change to a file is guaranteed to change its in-core inode, so the bcmp() wouldn't work. But timestamps are about the only metadata that obviously doesn't need syncing. Especially atimes. POSIX doesn't allow turning off atimes, but it also doesn't require syncing them before the system crashes except probably using FSYNC of metadata. There is an i_flag for critical metadata like i_size. It is IN_MODIFIED. This flag is often mismanaged. Many places set only IN_CHANGE after making a critical change. ufs_chown() is one such place. Ownerships are fairly critical metadata. They are not considered critical enough to sync immediately, but they should be synced by FSYNC of metadata. But the implementation is to set only IN_CHANGE. i_ctime is usually updated much later and IN_MODIFIED is set then to say that the relatively unimportant i_ctime has been modified. This inhibits the optimization of syncing for chown() but not syncing for ctime changes. > Which was my reasoning behind omitting the indirect block flushing. > There is no practical difference between inode block and indirect blocks > for metadata consistency. Ugh. > Note that the affected case is only the async mounts (not sync mounts, > not any variants of softdeps). I fixed the non-syncing of inodes for the async mount case ~15 years ago in my version, and finally merged the changes to my version of -current a few months OK. It was stupid that ordinary fsync() synced the indirect blocks but not the more important inode. Directory metadata is still not synced by fsync() in the async mount case. > I can restore indirect block flushing and wait for them for DATA_ONLY sync, > but then I should also add ffs_update() call. Here are my ffs fixes for -current. The are mostly to sprinkle missing DOINGASYNC() checks and finally fix ffs_update(): X Index: ffs/ffs_balloc.c X =================================================================== X --- ffs/ffs_balloc.c (revision 304069) X +++ ffs/ffs_balloc.c (working copy) X @@ -155,4 +155,6 @@ X if (flags & IO_SYNC) X bwrite(bp); X + else if (DOINGASYNC(vp)) X + bdwrite(bp); X else X bawrite(bp); X @@ -265,12 +267,10 @@ X newb, 0, fs->fs_bsize, 0, bp); X bdwrite(bp); X + } else if ((flags & IO_SYNC) == 0 && DOINGASYNC(vp)) { X + if (bp->b_bufsize == fs->fs_bsize) X + bp->b_flags |= B_CLUSTEROK; X + bdwrite(bp); This restructures the code a little to fix the DOINGASYNC() check. IO_SYNC was not checked. X } else { X - /* X - * Write synchronously so that indirect blocks X - * never point at garbage. X - */ X - if (DOINGASYNC(vp)) X - bdwrite(bp); X - else if ((error = bwrite(bp)) != 0) X + if ((error = bwrite(bp)) != 0) X goto fail; X } The comment was banal and duplicated ad nauseum, except it was wrong when the DOINGASYNC() check was under it. X @@ -335,9 +335,9 @@ X indirs[i - 1].in_off, nb); X bdwrite(nbp); X + } else if ((flags & IO_SYNC) == 0 && DOINGASYNC(vp)) { X + if (nbp->b_bufsize == fs->fs_bsize) X + nbp->b_flags |= B_CLUSTEROK; X + bdwrite(nbp); X } else { X - /* X - * Write synchronously so that indirect blocks X - * never point at garbage. X - */ X if ((error = bwrite(nbp)) != 0) { X brelse(bp); Here the DOINGASYNC() check was missing and the comment wasn't broken. X @@ -829,12 +829,10 @@ X newb, 0, fs->fs_bsize, 0, bp); X bdwrite(bp); X + } else if ((flags & IO_SYNC) == 0 && DOINGASYNC(vp)) { X + if (bp->b_bufsize == fs->fs_bsize) X + bp->b_flags |= B_CLUSTEROK; X + bdwrite(bp); X } else { X - /* X - * Write synchronously so that indirect blocks X - * never point at garbage. X - */ X - if (DOINGASYNC(vp)) X - bdwrite(bp); X - else if ((error = bwrite(bp)) != 0) X + if ((error = bwrite(bp)) != 0) X goto fail; X } X @@ -900,9 +898,9 @@ X indirs[i - 1].in_off, nb); X bdwrite(nbp); X + } else if ((flags & IO_SYNC) == 0 && DOINGASYNC(vp)) { X + if (nbp->b_bufsize == fs->fs_bsize) X + nbp->b_flags |= B_CLUSTEROK; X + bdwrite(nbp); X } else { X - /* X - * Write synchronously so that indirect blocks X - * never point at garbage. X - */ X if ((error = bwrite(nbp)) != 0) { X brelse(bp); There are 4 instances of almost the same code (not exactly the same). I changed them to be as identical as possbile. X Index: ffs/ffs_inode.c X =================================================================== X --- ffs/ffs_inode.c (revision 304069) X +++ ffs/ffs_inode.c (working copy) X @@ -155,5 +155,5 @@ X random_harvest_queue(&(ip->i_din2), sizeof(ip->i_din2), 1, RANDOM_FS_ATIME); X } X - if (waitfor && !DOINGASYNC(vp)) X + if (waitfor) X error = bwrite(bp); X else if (vm_page_count_severe() || buf_dirty_count_severe()) { This fixes ffs_update() by removing dyson's old DOINGASYNC() hack. This is a very safe change, but it needs a havy sprinkling of DOINGASYNC() in callers to avoid passing waitfor = TRUE too often. Many callers have already been fixed. X @@ -265,5 +265,5 @@ X } X ip->i_flag |= IN_CHANGE; X - if ((error = ffs_update(vp, !DOINGASYNC(vp)))) X + if ((error = ffs_update(vp, (flags & IO_SYNC) != 0))) X return (error); X for (i = 0; i < NXADDR; i++) { This was a broken caller. It and the next few fixes are in ffs_truncate(). ffs_truncate() has already converted (!DOINGASYNC() && !DOINGSOFTDEP()) to IO_ASYNC in flags here, so (flags & IO_SYNC) must be checked here so as to not lose the caller's setting of IO_SYNC. This gives the same precedence to IO_SYNC as in the checks above. ffs_truncate() used to do separate checks like the ones above. This was more readable but not as fail-safe when a check was forgotten. X @@ -291,5 +291,5 @@ X if (needextclean) X goto extclean; X - return (ffs_update(vp, !DOINGASYNC(vp))); X + return (ffs_update(vp, (flags & IO_SYNC) != 0)); X } X if (ip->i_size == length) { X @@ -329,5 +329,5 @@ X bawrite(bp); X ip->i_flag |= IN_CHANGE | IN_UPDATE; X - return (ffs_update(vp, !DOINGASYNC(vp))); X + return (ffs_update(vp, (flags & IO_SYNC) != 0)); X } X /* X @@ -358,8 +358,8 @@ X if (blkno != 0) X brelse(bp); X - else if (DOINGSOFTDEP(vp) || DOINGASYNC(vp)) X - bdwrite(bp); X - else X + else if (flags & IO_SYNC) X bwrite(bp); X + else X + bdwrite(bp); X } X /* Better check the DOINGSOFTDEP() part of ths fix. The caller's IO_SYNC flag and previous logic were defeated by checking DOINGSOFTDEP() again here. Note that callers were changed to not check DOING*() before calling ffs_update(). They just pass IO_SYNC if they want to force a sync. Most callers don't want to force it, so they pass 0 and syncing is controlled by DOINGSOFTDEP() and DOINGASYNC() in the usual way. This gives ufs's old fail-safe behaviour of sync metadata for ffs_truncate() in the non-softdep non-async mount case. Old versions were not so careful and depending on callers calculating the flag. X @@ -479,5 +479,5 @@ X } X ip->i_flag |= IN_CHANGE | IN_UPDATE; X - allerror = ffs_update(vp, !DOINGASYNC(vp)); X + allerror = ffs_update(vp, (flags & IO_SYNC) != 0); X X /* X @@ -611,5 +611,5 @@ X else X softdep_setup_freeblocks(ip, length, IO_EXT); X - return (ffs_update(vp, (flags & IO_SYNC) != 0 || !DOINGASYNC(vp))); X + return (ffs_update(vp, (flags & IO_SYNC) != 0)); X } X End of fixes for ffs_truncate(). X Index: ffs/ffs_vnops.c X =================================================================== X --- ffs/ffs_vnops.c (revision 304069) X +++ ffs/ffs_vnops.c (working copy) X @@ -717,5 +717,5 @@ X else X flags = seqcount << BA_SEQSHIFT; X - if ((ioflag & IO_SYNC) && !DOINGASYNC(vp)) X + if (ioflag & IO_SYNC) X flags |= IO_SYNC; X flags |= BA_UNMAPPED; X @@ -1037,5 +1037,5 @@ X osize = dp->di_extsize; X flags = IO_EXT; X - if ((ioflag & IO_SYNC) && !DOINGASYNC(vp)) X + if (ioflag & IO_SYNC) X flags |= IO_SYNC; X Examples of the ffs_write() and ffs_extwrite() callers calculating the flag wrong. There must be some magic for DOINGSOFTDEP() here and I hope I didn't break it. The old code forces IO_SYNC when the caller doesn't ask for it. This is the correct fail-safe behaviour for the old non-softdep non-async non-sync mount case, but is pessimal. It avoided here for the async mount case, but for the softdep case IO_SYNC is bogusly set here and then soft updates have to magically dishonor IO_SYNC to avoid all writes become since (just ordered and delayed for soft updates). I don't see how soft updates can work write here. If it dishonors IO_SYNC at lower levels, then this breaks the caller's reques tof sync writes, and if it doesn't dishonor IO_SYNC at lower levels, then the usual case is pessimized. X Index: ufs/ufs_lookup.c X =================================================================== X --- ufs/ufs_lookup.c (revision 304069) X +++ ufs/ufs_lookup.c (working copy) X @@ -1249,5 +1249,5 @@ X if (flags & DOWHITEOUT) X error = bwrite(bp); X - else if (DOINGASYNC(dvp) && dp->i_count != 0) X + else if (DOINGASYNC(dvp)) X bdwrite(bp); X else A pure async mount case fix -- don't do anything sync unless asked for. X Index: ufs/ufs_vnops.c X =================================================================== X --- ufs/ufs_vnops.c (revision 304069) X +++ ufs/ufs_vnops.c (working copy) X @@ -992,5 +992,5 @@ X if (DOINGSOFTDEP(vp)) X softdep_setup_link(VTOI(tdvp), ip); X - error = UFS_UPDATE(vp, !(DOINGSOFTDEP(vp) | DOINGASYNC(vp))); X + error = UFS_UPDATE(vp, !DOINGSOFTDEP(vp) && !DOINGASYNC(vp)); X if (!error) { X ufs_makedirentry(ip, cnp, &newdir); X @@ -1336,5 +1336,5 @@ X if (DOINGSOFTDEP(fvp)) X softdep_setup_link(tdp, fip); X - error = UFS_UPDATE(fvp, !(DOINGSOFTDEP(fvp) | DOINGASYNC(fvp))); X + error = UFS_UPDATE(fvp, !DOINGSOFTDEP(fvp) && !DOINGASYNC(fvp)); X if (error) X goto bad; X @@ -1489,6 +1489,6 @@ X if (DOINGSOFTDEP(tdvp)) X softdep_setup_dotdot_link(tdp, fip); X - error = UFS_UPDATE(tdvp, !(DOINGSOFTDEP(tdvp) | X - DOINGASYNC(tdvp))); X + error = UFS_UPDATE(tdvp, !DOINGSOFTDEP(tdvp) && X + !DOINGASYNC(tdvp)); X /* Don't go to bad here as the new link exists. */ X if (error) Use consistent logic, and don't do manual optimizations of logical expressions. X @@ -1534,5 +1534,6 @@ X ufsdirhash_dirtrunc(tdp, endoff); X #endif X - UFS_TRUNCATE(tdvp, endoff, IO_NORMAL | IO_SYNC, tcnp->cn_cred); X + UFS_TRUNCATE(tdvp, endoff, IO_NORMAL | X + (DOINGASYNC(tdvp) ? 0 : IO_SYNC), tcnp->cn_cred); X } X if (error == 0 && tdp->i_flag & IN_NEEDSYNC) This looks wrong. Why do we want to override IO_NORMAL? IIRC, I wrote it like this since I didn't want to change the policy for soft updates. This is near the directory compaction fix. I completed this set of patches after understanding the simplifications given by IO_NORMAL letting ffs_truncate() handle the details. X @@ -1879,5 +1880,5 @@ X if (DOINGSOFTDEP(dvp)) X softdep_setup_mkdir(dp, ip); X - error = UFS_UPDATE(dvp, !(DOINGSOFTDEP(dvp) | DOINGASYNC(dvp))); X + error = UFS_UPDATE(dvp, !DOINGSOFTDEP(dvp) && !DOINGASYNC(dvp)); X if (error) X goto bad; X @@ -1936,6 +1937,6 @@ X } X } X - if ((error = UFS_UPDATE(tvp, !(DOINGSOFTDEP(tvp) | X - DOINGASYNC(tvp)))) != 0) { X + if ((error = UFS_UPDATE(tvp, !DOINGSOFTDEP(tvp) && X + !DOINGASYNC(tvp))) != 0) { X (void)bwrite(bp); X goto bad; X @@ -2671,5 +2672,5 @@ X * Make sure inode goes to disk before directory entry. X */ X - error = UFS_UPDATE(tvp, !(DOINGSOFTDEP(tvp) | DOINGASYNC(tvp))); X + error = UFS_UPDATE(tvp, !DOINGSOFTDEP(tvp) && !DOINGASYNC(tvp)); X if (error) X goto bad; Lots more style fixes to remove home made optimizations. Bruce