From owner-freebsd-fs@freebsd.org Sat Sep 16 20:16:52 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 31303E063D0 for ; Sat, 16 Sep 2017 20:16:52 +0000 (UTC) (envelope-from mckusick@mckusick.com) Received: from chez.mckusick.com (chez.mckusick.com [70.36.157.235]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 194ED3677 for ; Sat, 16 Sep 2017 20:16:51 +0000 (UTC) (envelope-from mckusick@mckusick.com) Received: from chez.mckusick.com (localhost [IPv6:::1]) by chez.mckusick.com (8.15.2/8.15.2) with ESMTP id v8GKKQj0033706; Sat, 16 Sep 2017 13:20:26 -0700 (PDT) (envelope-from mckusick@chez.mckusick.com) Message-Id: <201709162020.v8GKKQj0033706@chez.mckusick.com> From: Kirk McKusick To: Andreas Longwitz Subject: Re: fsync: giving up on dirty on ufs partitions running vfs_write_suspend() cc: freebsd-fs@freebsd.org In-reply-to: <59BD0EAC.8030206@incore.de> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <33704.1505593226.1@chez.mckusick.com> Date: Sat, 16 Sep 2017 13:20:26 -0700 X-Spam-Status: No, score=0.1 required=5.0 tests=MISSING_MID, UNPARSEABLE_RELAY autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on chez.mckusick.com X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Sep 2017 20:16:52 -0000 > From: Konstantin Belousov > Date: Sat, 16 Sep 2017 21:31:17 +0300 > To: Andreas Longwitz > Subject: Re: fsync: giving up on dirty on ufs partitions running > vfs_write_suspend() > Cc: Kirk McKusick , freebsd-fs@freebsd.org > > On Sat, Sep 16, 2017 at 01:44:44PM +0200, Andreas Longwitz wrote: >> Ok, I understand your thoughts about the "big loop" and I agree. On the >> other side it is not easy to measure the progress of the dirty buffers >> because these buffers a created from another process at the same time we >> loop in vop_stdfsync(). I can explain from my tests, where I use the >> following loop on a gjournaled partition: >> >> while true; do >> cp -p bigfile bigfile.tmp >> rm bigfile >> mv bigfile.tmp bigfile >> done >> >> When g_journal_switcher starts vfs_write_suspend() immediately after the >> rm command has started to do his "rm stuff" (ufs_inactive, ffs_truncate, >> ffs_indirtrunc at different levels, ffs_blkfree, ...) the we must loop >> (that means wait) in vop_stdfsync() until the rm process has finished >> his work. A lot of locking overhead is needed for coordination. >> Returning from bufobj_wwait() we always see one left dirty buffer (very >> seldom two), that is not optimal. Therefore I have tried the following >> patch (instead of bumping maxretry): >> >> --- vfs_default.c.orig 2016-10-24 12:26:57.000000000 +0200 >> +++ vfs_default.c 2017-09-15 12:30:44.792274000 +0200 >> @@ -688,6 +688,8 @@ >> bremfree(bp); >> bawrite(bp); >> } >> + if( maxretry < 1000) >> + DELAY(waitns); >> BO_LOCK(bo); >> goto loop2; >> } >> >> with different values for waitns. If I run the testloop 5000 times on my >> testserver, the problem is triggered always round about 10 times. The >> results from several runs are given in the following table: >> >> waitns max time max loops >> ------------------------------- >> no DELAY 0,5 sec 8650 (maxres = 100000) >> 1000 0,2 sec 24 >> 10000 0,8 sec 3 >> 100000 7,2 sec 3 >> >> "time" means spent time in vop_stdfsync() measured from entry to return >> by a dtrace script. "loops" means the number of times "--maxretry" is >> executed. I am not sure if DELAY() is the best way to wait or if waiting >> has other drawbacks. Anyway with DELAY() it does not take more than five >> iterazions to finish. > > This is not explicitly stated in your message, but I suppose that the > vop_stdfsync() is called due to VOP_FSYNC(devvp, MNT_SUSPEND) call in > ffs_sync(). Am I right ? > > If yes, then the solution is most likely to continue looping in the > vop_stdfsync() until there is no dirty buffers or the mount point > mnt_secondary_writes counter is zero. The pauses trick you tried might > be still useful, e.g. after some threshold of the performed loop > iterations. > > Some problem with this suggestion is that vop_stdfsync(devvp) needs to > know that the vnode is devvp for some UFS mount. The struct cdev, > acessible as v_rdev, has the pointer to struct mount. You should be > accurate to not access freed or reused struct mount. I concur with Kostik's comments. It would be helpful if you could try out his suggestions and see if the produces a better result. Once you converge on a solution, I will ensure that it gets checked in. ~Kirk