From owner-freebsd-stable@FreeBSD.ORG  Thu Dec 15 11:51:05 2011
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 681C9106564A;
	Thu, 15 Dec 2011 11:51:05 +0000 (UTC)
	(envelope-from andrey@zonov.org)
Received: from mail-ww0-f50.google.com (mail-ww0-f50.google.com [74.125.82.50])
	by mx1.freebsd.org (Postfix) with ESMTP id 959378FC0A;
	Thu, 15 Dec 2011 11:51:04 +0000 (UTC)
Received: by wgbdr11 with SMTP id dr11so3789389wgb.31
	for <multiple recipients>; Thu, 15 Dec 2011 03:51:03 -0800 (PST)
MIME-Version: 1.0
Received: by 10.180.94.195 with SMTP id de3mr4291716wib.14.1323949862696; Thu,
	15 Dec 2011 03:51:02 -0800 (PST)
Received: by 10.180.19.67 with HTTP; Thu, 15 Dec 2011 03:51:02 -0800 (PST)
X-Originating-IP: [95.108.170.17]
In-Reply-To: <20111214204201.GA7372@icarus.home.lan>
References: <4EE7BF77.5000504@zonov.org>
	<20111213221501.GA85563@icarus.home.lan>
	<4EE8E6E3.7050202@zonov.org>
	<20111214182252.GA5176@icarus.home.lan>
	<4EE8FD3E.8030902@zonov.org>
	<20111214204201.GA7372@icarus.home.lan>
Date: Thu, 15 Dec 2011 15:51:02 +0400
Message-ID: <CANU_PUGtjjxP-qLjEqb2wVnL_QGJvtApnaD8SSF4zLksY4ME6A@mail.gmail.com>
From: Andrey Zonov <andrey@zonov.org>
To: Jeremy Chadwick <freebsd@jdc.parodius.com>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: alc@freebsd.org, kib@freebsd.org, freebsd-stable@freebsd.org
Subject: Re: directory listing hangs in "ufs" state
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Dec 2011 11:51:05 -0000

On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick
<freebsd@jdc.parodius.com>wrote:

> On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote:
> > On 14.12.2011 22:22, Jeremy Chadwick wrote:
> > >On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote:
> > >>Hi Jeremy,
> > >>
> > >>This is not hardware problem, I've already checked that. I also ran
> > >>fsck today and got no errors.
> > >>
> > >>After some more exploration of how mongodb works, I found that then
> > >>listing hangs, one of mongodb thread is in "biowr" state for a long
> > >>time. It periodically calls msync(MS_SYNC) accordingly to ktrace
> > >>out.
> > >>
> > >>If I'll remove msync() calls from mongodb, how often data will be
> > >>sync by OS?
> > >>
> > >>--
> > >>Andrey Zonov
> > >>
> > >>On 14.12.2011 2:15, Jeremy Chadwick wrote:
> > >>>On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote:
> > >>>>
> > >>>>Have you any ideas what is going on? or how to catch the problem?
> > >>>
> > >>>Assuming this isn't a file on the root filesystem, try booting the
> > >>>machine in single-user mode and using "fsck -f" on the filesystem in
> > >>>question.
> > >>>
> > >>>Can you verify there's no problems with the disk this file lives on as
> > >>>well (smartctl -a /dev/disk)?  I'm doubting this is the problem, but
> > >>>thought I'd mention it.
> > >
> > >I have no real answer, I'm sorry.  msync(2) indicates it's effectively
> > >deprecated (see BUGS).  It looks like this is effectively a mmap-version
> > >of fsync(2).
> >
> > I replaced msync(2) with fsync(2).  Unfortunately, from man pages it
> > is not obvious that I can do this. Anyway, thanks.
>
> Sorry, that wasn't what I was implying.  Let me try to explain
> differently.
>
> msync(2) looks, to me, like an mmap-specific version of fsync(2).  Based
> on the man page, it seems that the with msync() you can effectively
> guaranteed flushing of certain pages within an mmap()'d region to disk.
> fsync() would flush **all** buffers/internal pages to be flushed to
> disk.
>
> One would need to look at the code to mongodb to find out what it's
> actually doing with msync().  That is to say, if it's doing something
> like this (I probably have the semantics wrong -- I've never spent much
> time with mmap()):
>
> fd = open("/some/file", O_RDWR);
> ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
> ret = msync(ptr, 65536, MS_SYNC);
> /* or alternatively, this:
> ret = msync(ptr, NULL, MS_SYNC);
> */
>
> Then this, to me, would be mostly the equivalent to:
>
> fd = fopen("/some/file", "r+");
> ret = fsync(fd);
>
> Otherwise, if it's calling msync() only on an address/location within
> the region ptr points to, then that may be more efficient (less pages to
> flush).
>

They call msync() for the whole file.  So, there will not be any difference.


> The mmap() arguments -- specifically flags (see man page) -- also play
> a role here.  The one that catches my attention is MAP_NOSYNC.  So you
> may need to look at the mongodb code to figure out what it's mmap()
> call is.
>
> One might wonder why they don't just use open() with the O_SYNC.  I
> imagine that has to do with, again, performance; possibly the don't want
> all I/O synchronous, and would rather flush certain pages in the mmap'd
> region to disk as needed.  I see the legitimacy in that approach (vs.
> just using O_SYNC).
>
> There's really no easy way for me to tell you which is more efficient,
> better, blah blah without spending a lot of time with a benchmarking
> program that tests all of this, *plus* an entire system (world) built
> with profiling.
>

I ran for two hours mongodb with fsync() and got the following:
STARTED                      INBLK OUBLK MAJFLT MINFLT
Thu Dec 15 10:34:52 2011         3 192744    314 3080182

This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'.

Then I ran it with default msync():
STARTED                      INBLK OUBLK MAJFLT MINFLT
Thu Dec 15 12:34:53 2011         0 7241555     79 5401945

There are also two graphics of disk business [1] [2].

The difference is significant, in 37 times!  That what I expected to get.

In commentaries for vm_object_page_clean() I found this:

 *      When stuffing pages asynchronously, allow clustering.  XXX we need a
 *      synchronous clustering mode implementation.

It means for me that msync(MS_SYNC) flush every page on disk in single IO
transaction.  If we multiply 4K and 37 we get 150K.  This number is size of
the single transaction in my experience.

+alc@, kib@

Am I right? Is there any plan to implement this?


> All of this would really fall into the hands of the mongodb people to
> figure out, if you ask me.  But I should note that mmap() on BSD behaves
> and performs very differently than on, say, Linux; so if the authors
> wrote what they did intended for Linux systems, I wouldn't be too
> surprised.  :-)
>

https://jira.mongodb.org/browse/SERVER-663


>  > >I'm extremely confused by this problem.  What you're describing above
> is
> > >that the process is "stuck in biowr state for a long time", but what you
> > >stated originally was that the process was "stuck in ufs state for a
> > >few minutes":
> >
> > Listing of the directory with mongodb files by ls(1) stuck in "ufs"
> > state when one of mongodb's thread in "biowr" state.  It looks like
> > system holds global lock of the file which is msync(2)-ed and can't
> > immediately return from lstat(2) call.
>
> Thanks for the clarification -- yes this helps.  To some degree it makes
> sense, some piece of the filesystem or VFS layer is blocking
> intentionally.  How to figure out what layer I do not know.  Kernel
> folks familiar with this aspect would need to chime in here.
>
> > >>I've got STABLE-8 (r221983) with mongodb-1.8.1 installed on it.  A
> > >>couple days ago I observed that listing of mongodb directory stuck in
> > >>a few minutes in "ufs" state.
> > >
> > >Can we narrow down what we're talking about here?  Does the process
> > >actually deadlock?  Or are you concerned about performance implications?
> > >
> > >I know nothing about this "mongodb" software, but the reason it's
> > >calling msync() is because it wants to try and ensure that the data it
> > >changed in an mmap()-mapped page to be reflected (fully written) on the
> > >disk.  This behaviour is fairly common within database software, but
> > >"how often" the software chooses to do this is entirely a design
> > >implementation choice by the authors.
> > >
> > >Meaning: if mongodb is either 1) continually calling msync(), or 2)
> > >waiting for too long a period of time before calling msync(),
> > >performance within the process will suffer.  #1 could result in overall
> > >bad performance, while #2 could result in a process that's spending a
> > >lot of time doing I/O (flushing to disk) and therefore appears
> > >"deadlocked" when in fact the kernel/subsystems are doing exactly what
> > >they were told to do.
> > >
> > >Removing the msync() call could result in inconsistent data (possibly
> > >non-recoverable) if the mongodb software crashes or if some other piece
> > >(thread or child?  Not sure) expects to open a new fd on that file which
> > >has mmap()'d data.
> >
> > Yes, I clearly understand this.  I think of any system tuning
> > instead, but nothing arose in my head.
>
> Nor I.  I think this is more of a userland/application thing than a
> kernel thing, but there is a love-and-hate relationship between userland
> and kernel when it comes to the above syscalls and framework.
>
> Wish I could be of more help -- sorry.  :-(
>

Your help was very important.  Thanks!


>
> --
> | Jeremy Chadwick                                jdc at parodius.com |
> | Parodius Networking                       http://www.parodius.com/ |
> | UNIX Systems Administrator                   Mountain View, CA, US |
> | Making life hard for others since 1977.               PGP 4BD6C0CB |
>
>
[1] http://zonov.org/mongodb/155647.png
[2] http://zonov.org/mongodb/155657.png

-- 
Andrey Zonov