From owner-freebsd-stable@FreeBSD.ORG  Wed Dec 21 17:03:44 2011
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 784E51065670;
	Wed, 21 Dec 2011 17:03:44 +0000 (UTC)
	(envelope-from andrey@zonov.org)
Received: from mail-fx0-f54.google.com (mail-fx0-f54.google.com
	[209.85.161.54])
	by mx1.freebsd.org (Postfix) with ESMTP id CA1178FC15;
	Wed, 21 Dec 2011 17:03:43 +0000 (UTC)
Received: by faaf16 with SMTP id f16so5360460faa.13
	for <multiple recipients>; Wed, 21 Dec 2011 09:03:42 -0800 (PST)
Received: by 10.180.96.72 with SMTP id dq8mr15919446wib.10.1324486983655;
	Wed, 21 Dec 2011 09:03:03 -0800 (PST)
Received: from [10.254.254.77] (ppp95-165-154-46.pppoe.spdop.ru.
	[95.165.154.46])
	by mx.google.com with ESMTPS id en10sm2591754wbb.11.2011.12.21.09.03.02
	(version=SSLv3 cipher=OTHER); Wed, 21 Dec 2011 09:03:03 -0800 (PST)
Message-ID: <4EF21146.9010107@zonov.org>
Date: Wed, 21 Dec 2011 21:03:02 +0400
From: Andrey Zonov <andrey@zonov.org>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ru;
	rv:1.8.1.24) Gecko/20100228 Thunderbird/2.0.0.24 Mnenhy/0.7.6.0
MIME-Version: 1.0
To: Kostik Belousov <kostikbel@gmail.com>
References: <4EE7BF77.5000504@zonov.org>
	<20111213221501.GA85563@icarus.home.lan>
	<4EE8E6E3.7050202@zonov.org>
	<20111214182252.GA5176@icarus.home.lan>
	<4EE8FD3E.8030902@zonov.org>
	<20111214204201.GA7372@icarus.home.lan>
	<CANU_PUGtjjxP-qLjEqb2wVnL_QGJvtApnaD8SSF4zLksY4ME6A@mail.gmail.com>
	<20111215130111.GN50300@deviant.kiev.zoral.com.ua>
In-Reply-To: <20111215130111.GN50300@deviant.kiev.zoral.com.ua>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: alc@freebsd.org, freebsd-stable@freebsd.org,
	Jeremy Chadwick <freebsd@jdc.parodius.com>
Subject: Re: directory listing hangs in "ufs" state
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 21 Dec 2011 17:03:44 -0000

On 15.12.2011 17:01, Kostik Belousov wrote:
> On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote:
>> On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick
>> <freebsd@jdc.parodius.com>wrote:
>>
>>> On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote:
>>>> On 14.12.2011 22:22, Jeremy Chadwick wrote:
>>>>> On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote:
>>>>>> Hi Jeremy,
>>>>>>
>>>>>> This is not hardware problem, I've already checked that. I also ran
>>>>>> fsck today and got no errors.
>>>>>>
>>>>>> After some more exploration of how mongodb works, I found that then
>>>>>> listing hangs, one of mongodb thread is in "biowr" state for a long
>>>>>> time. It periodically calls msync(MS_SYNC) accordingly to ktrace
>>>>>> out.
>>>>>>
>>>>>> If I'll remove msync() calls from mongodb, how often data will be
>>>>>> sync by OS?
>>>>>>
>>>>>> --
>>>>>> Andrey Zonov
>>>>>>
>>>>>> On 14.12.2011 2:15, Jeremy Chadwick wrote:
>>>>>>> On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote:
>>>>>>>>
>>>>>>>> Have you any ideas what is going on? or how to catch the problem?
>>>>>>>
>>>>>>> Assuming this isn't a file on the root filesystem, try booting the
>>>>>>> machine in single-user mode and using "fsck -f" on the filesystem in
>>>>>>> question.
>>>>>>>
>>>>>>> Can you verify there's no problems with the disk this file lives on as
>>>>>>> well (smartctl -a /dev/disk)?  I'm doubting this is the problem, but
>>>>>>> thought I'd mention it.
>>>>>
>>>>> I have no real answer, I'm sorry.  msync(2) indicates it's effectively
>>>>> deprecated (see BUGS).  It looks like this is effectively a mmap-version
>>>>> of fsync(2).
>>>>
>>>> I replaced msync(2) with fsync(2).  Unfortunately, from man pages it
>>>> is not obvious that I can do this. Anyway, thanks.
>>>
>>> Sorry, that wasn't what I was implying.  Let me try to explain
>>> differently.
>>>
>>> msync(2) looks, to me, like an mmap-specific version of fsync(2).  Based
>>> on the man page, it seems that the with msync() you can effectively
>>> guaranteed flushing of certain pages within an mmap()'d region to disk.
>>> fsync() would flush **all** buffers/internal pages to be flushed to
>>> disk.
>>>
>>> One would need to look at the code to mongodb to find out what it's
>>> actually doing with msync().  That is to say, if it's doing something
>>> like this (I probably have the semantics wrong -- I've never spent much
>>> time with mmap()):
>>>
>>> fd = open("/some/file", O_RDWR);
>>> ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
>>> ret = msync(ptr, 65536, MS_SYNC);
>>> /* or alternatively, this:
>>> ret = msync(ptr, NULL, MS_SYNC);
>>> */
>>>
>>> Then this, to me, would be mostly the equivalent to:
>>>
>>> fd = fopen("/some/file", "r+");
>>> ret = fsync(fd);
>>>
>>> Otherwise, if it's calling msync() only on an address/location within
>>> the region ptr points to, then that may be more efficient (less pages to
>>> flush).
>>>
>>
>> They call msync() for the whole file.  So, there will not be any difference.
>>
>>
>>> The mmap() arguments -- specifically flags (see man page) -- also play
>>> a role here.  The one that catches my attention is MAP_NOSYNC.  So you
>>> may need to look at the mongodb code to figure out what it's mmap()
>>> call is.
>>>
>>> One might wonder why they don't just use open() with the O_SYNC.  I
>>> imagine that has to do with, again, performance; possibly the don't want
>>> all I/O synchronous, and would rather flush certain pages in the mmap'd
>>> region to disk as needed.  I see the legitimacy in that approach (vs.
>>> just using O_SYNC).
>>>
>>> There's really no easy way for me to tell you which is more efficient,
>>> better, blah blah without spending a lot of time with a benchmarking
>>> program that tests all of this, *plus* an entire system (world) built
>>> with profiling.
>>>
>>
>> I ran for two hours mongodb with fsync() and got the following:
>> STARTED                      INBLK OUBLK MAJFLT MINFLT
>> Thu Dec 15 10:34:52 2011         3 192744    314 3080182
>>
>> This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'.
>>
>> Then I ran it with default msync():
>> STARTED                      INBLK OUBLK MAJFLT MINFLT
>> Thu Dec 15 12:34:53 2011         0 7241555     79 5401945
>>
>> There are also two graphics of disk business [1] [2].
>>
>> The difference is significant, in 37 times!  That what I expected to get.
>>
>> In commentaries for vm_object_page_clean() I found this:
>>
>>   *      When stuffing pages asynchronously, allow clustering.  XXX we need a
>>   *      synchronous clustering mode implementation.
>>
>> It means for me that msync(MS_SYNC) flush every page on disk in single IO
>> transaction.  If we multiply 4K and 37 we get 150K.  This number is size of
>> the single transaction in my experience.
>>
>> +alc@, kib@
>>
>> Am I right? Is there any plan to implement this?
> Current buffer clustering code can only do only async writes. In fact, I
> am not quite sure what would consitute the sync clustering, because the
> ability to delay the write is important to be able to cluster at all.
>
> Also, I am not sure that lack of clustering is the biggest problem.
> IMO, the fact that each write is sync is the first problem there. It
> would be quite a work to add the tracking of the issued writes to the
> vm_object_page_clean() and down the stack. Esp. due to custom page
> write vops in several fses.
>
> The only guarantee that POSIX requires from msync(MS_SYNC) is that
> the writes are finished when the syscall returned, and not that the
> writes are done synchronously. Below is the hack which should help if
> the msync()ed region contains the mapping of the whole file, since
> it is possible to fsync() the file after all writes are scheduled
> asynchronous then. It will causes unneeded metadata update, but I think
> it would be much faster still.
>
>
> diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
> index 250b769..a9de554 100644
> --- a/sys/vm/vm_object.c
> +++ b/sys/vm/vm_object.c
> @@ -938,7 +938,7 @@ vm_object_sync(vm_object_t object, vm_ooffset_t offset, vm_size_t size,
>   	vm_object_t backing_object;
>   	struct vnode *vp;
>   	struct mount *mp;
> -	int flags;
> +	int flags, fsync_after;
>
>   	if (object == NULL)
>   		return;
> @@ -971,11 +971,26 @@ vm_object_sync(vm_object_t object, vm_ooffset_t offset, vm_size_t size,
>   		(void) vn_start_write(vp,&mp, V_WAIT);
>   		vfslocked = VFS_LOCK_GIANT(vp->v_mount);
>   		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
> -		flags = (syncio || invalidate) ? OBJPC_SYNC : 0;
> -		flags |= invalidate ? OBJPC_INVAL : 0;
> +		if (syncio&&  !invalidate&&  offset == 0&&
> +		    OFF_TO_IDX(size) == object->size) {
> +			/*
> +			 * If syncing the whole mapping of the file,
> +			 * it is faster to schedule all the writes in
> +			 * async mode, also allowing the clustering,
> +			 * and then wait for i/o to complete.
> +			 */
> +			flags = 0;
> +			fsync_after = TRUE;
> +		} else {
> +			flags = (syncio || invalidate) ? OBJPC_SYNC : 0;
> +			flags |= invalidate ? (OBJPC_SYNC | OBJPC_INVAL) : 0;
> +			fsync_after = FALSE;
> +		}
>   		VM_OBJECT_LOCK(object);
>   		vm_object_page_clean(object, offset, offset + size, flags);
>   		VM_OBJECT_UNLOCK(object);
> +		if (fsync_after)
> +			(void) VOP_FSYNC(vp, MNT_WAIT, curthread);
>   		VOP_UNLOCK(vp, 0);
>   		VFS_UNLOCK_GIANT(vfslocked);
>   		vn_finished_write(mp);

Thanks, this patch works.  Performance is the same as of using fsync().

Actually, Linux uses fsync() inside of msync() if MS_SYNC is set.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/msync.c;h=632df4527c0122062d9332a0d483835274ed62f6;hb=HEAD

-- 
Andrey Zonov