FreeBSD Mail Archives

Date:      Wed, 14 Dec 2011 12:42:01 -0800
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Andrey Zonov <andrey@zonov.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: directory listing hangs in "ufs" state
Message-ID:  <20111214204201.GA7372@icarus.home.lan>
In-Reply-To: <4EE8FD3E.8030902@zonov.org>
References:  <4EE7BF77.5000504@zonov.org> <20111213221501.GA85563@icarus.home.lan> <4EE8E6E3.7050202@zonov.org> <20111214182252.GA5176@icarus.home.lan> <4EE8FD3E.8030902@zonov.org>

On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote:
> On 14.12.2011 22:22, Jeremy Chadwick wrote:
> >On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote:
> >>Hi Jeremy,
> >>
> >>This is not hardware problem, I've already checked that. I also ran
> >>fsck today and got no errors.
> >>
> >>After some more exploration of how mongodb works, I found that then
> >>listing hangs, one of mongodb thread is in "biowr" state for a long
> >>time. It periodically calls msync(MS_SYNC) accordingly to ktrace
> >>out.
> >>
> >>If I'll remove msync() calls from mongodb, how often data will be
> >>sync by OS?
> >>
> >>--
> >>Andrey Zonov
> >>
> >>On 14.12.2011 2:15, Jeremy Chadwick wrote:
> >>>On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote:
> >>>>
> >>>>Have you any ideas what is going on? or how to catch the problem?
> >>>
> >>>Assuming this isn't a file on the root filesystem, try booting the
> >>>machine in single-user mode and using "fsck -f" on the filesystem in
> >>>question.
> >>>
> >>>Can you verify there's no problems with the disk this file lives on as
> >>>well (smartctl -a /dev/disk)?  I'm doubting this is the problem, but
> >>>thought I'd mention it.
> >
> >I have no real answer, I'm sorry.  msync(2) indicates it's effectively
> >deprecated (see BUGS).  It looks like this is effectively a mmap-version
> >of fsync(2).
> 
> I replaced msync(2) with fsync(2).  Unfortunately, from man pages it
> is not obvious that I can do this. Anyway, thanks.

Sorry, that wasn't what I was implying.  Let me try to explain
differently.

msync(2) looks, to me, like an mmap-specific version of fsync(2).  Based
on the man page, it seems that the with msync() you can effectively
guaranteed flushing of certain pages within an mmap()'d region to disk.
fsync() would flush **all** buffers/internal pages to be flushed to
disk.

One would need to look at the code to mongodb to find out what it's
actually doing with msync().  That is to say, if it's doing something
like this (I probably have the semantics wrong -- I've never spent much
time with mmap()):

fd = open("/some/file", O_RDWR);
ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
ret = msync(ptr, 65536, MS_SYNC);
/* or alternatively, this:
ret = msync(ptr, NULL, MS_SYNC);
*/

Then this, to me, would be mostly the equivalent to:

fd = fopen("/some/file", "r+");
ret = fsync(fd);

Otherwise, if it's calling msync() only on an address/location within
the region ptr points to, then that may be more efficient (less pages to
flush).

The mmap() arguments -- specifically flags (see man page) -- also play
a role here.  The one that catches my attention is MAP_NOSYNC.  So you
may need to look at the mongodb code to figure out what it's mmap()
call is.

One might wonder why they don't just use open() with the O_SYNC.  I
imagine that has to do with, again, performance; possibly the don't want
all I/O synchronous, and would rather flush certain pages in the mmap'd
region to disk as needed.  I see the legitimacy in that approach (vs.
just using O_SYNC).

There's really no easy way for me to tell you which is more efficient,
better, blah blah without spending a lot of time with a benchmarking
program that tests all of this, *plus* an entire system (world) built
with profiling.

All of this would really fall into the hands of the mongodb people to
figure out, if you ask me.  But I should note that mmap() on BSD behaves
and performs very differently than on, say, Linux; so if the authors
wrote what they did intended for Linux systems, I wouldn't be too
surprised.  :-)

> >I'm extremely confused by this problem.  What you're describing above is
> >that the process is "stuck in biowr state for a long time", but what you
> >stated originally was that the process was "stuck in ufs state for a
> >few minutes":
> 
> Listing of the directory with mongodb files by ls(1) stuck in "ufs"
> state when one of mongodb's thread in "biowr" state.  It looks like
> system holds global lock of the file which is msync(2)-ed and can't
> immediately return from lstat(2) call.

Thanks for the clarification -- yes this helps.  To some degree it makes
sense, some piece of the filesystem or VFS layer is blocking
intentionally.  How to figure out what layer I do not know.  Kernel
folks familiar with this aspect would need to chime in here.

> >>I've got STABLE-8 (r221983) with mongodb-1.8.1 installed on it.  A
> >>couple days ago I observed that listing of mongodb directory stuck in
> >>a few minutes in "ufs" state.
> >
> >Can we narrow down what we're talking about here?  Does the process
> >actually deadlock?  Or are you concerned about performance implications?
> >
> >I know nothing about this "mongodb" software, but the reason it's
> >calling msync() is because it wants to try and ensure that the data it
> >changed in an mmap()-mapped page to be reflected (fully written) on the
> >disk.  This behaviour is fairly common within database software, but
> >"how often" the software chooses to do this is entirely a design
> >implementation choice by the authors.
> >
> >Meaning: if mongodb is either 1) continually calling msync(), or 2)
> >waiting for too long a period of time before calling msync(),
> >performance within the process will suffer.  #1 could result in overall
> >bad performance, while #2 could result in a process that's spending a
> >lot of time doing I/O (flushing to disk) and therefore appears
> >"deadlocked" when in fact the kernel/subsystems are doing exactly what
> >they were told to do.
> >
> >Removing the msync() call could result in inconsistent data (possibly
> >non-recoverable) if the mongodb software crashes or if some other piece
> >(thread or child?  Not sure) expects to open a new fd on that file which
> >has mmap()'d data.
> 
> Yes, I clearly understand this.  I think of any system tuning
> instead, but nothing arose in my head.

Nor I.  I think this is more of a userland/application thing than a
kernel thing, but there is a love-and-hate relationship between userland
and kernel when it comes to the above syscalls and framework.

Wish I could be of more help -- sorry.  :-(

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111214204201.GA7372>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation