From owner-freebsd-stable@FreeBSD.ORG Thu Dec 15 11:51:05 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 681C9106564A; Thu, 15 Dec 2011 11:51:05 +0000 (UTC) (envelope-from andrey@zonov.org) Received: from mail-ww0-f50.google.com (mail-ww0-f50.google.com [74.125.82.50]) by mx1.freebsd.org (Postfix) with ESMTP id 959378FC0A; Thu, 15 Dec 2011 11:51:04 +0000 (UTC) Received: by wgbdr11 with SMTP id dr11so3789389wgb.31 for ; Thu, 15 Dec 2011 03:51:03 -0800 (PST) MIME-Version: 1.0 Received: by 10.180.94.195 with SMTP id de3mr4291716wib.14.1323949862696; Thu, 15 Dec 2011 03:51:02 -0800 (PST) Received: by 10.180.19.67 with HTTP; Thu, 15 Dec 2011 03:51:02 -0800 (PST) X-Originating-IP: [95.108.170.17] In-Reply-To: <20111214204201.GA7372@icarus.home.lan> References: <4EE7BF77.5000504@zonov.org> <20111213221501.GA85563@icarus.home.lan> <4EE8E6E3.7050202@zonov.org> <20111214182252.GA5176@icarus.home.lan> <4EE8FD3E.8030902@zonov.org> <20111214204201.GA7372@icarus.home.lan> Date: Thu, 15 Dec 2011 15:51:02 +0400 Message-ID: From: Andrey Zonov To: Jeremy Chadwick Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: alc@freebsd.org, kib@freebsd.org, freebsd-stable@freebsd.org Subject: Re: directory listing hangs in "ufs" state X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Dec 2011 11:51:05 -0000 On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick wrote: > On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote: > > On 14.12.2011 22:22, Jeremy Chadwick wrote: > > >On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: > > >>Hi Jeremy, > > >> > > >>This is not hardware problem, I've already checked that. I also ran > > >>fsck today and got no errors. > > >> > > >>After some more exploration of how mongodb works, I found that then > > >>listing hangs, one of mongodb thread is in "biowr" state for a long > > >>time. It periodically calls msync(MS_SYNC) accordingly to ktrace > > >>out. > > >> > > >>If I'll remove msync() calls from mongodb, how often data will be > > >>sync by OS? > > >> > > >>-- > > >>Andrey Zonov > > >> > > >>On 14.12.2011 2:15, Jeremy Chadwick wrote: > > >>>On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: > > >>>> > > >>>>Have you any ideas what is going on? or how to catch the problem? > > >>> > > >>>Assuming this isn't a file on the root filesystem, try booting the > > >>>machine in single-user mode and using "fsck -f" on the filesystem in > > >>>question. > > >>> > > >>>Can you verify there's no problems with the disk this file lives on as > > >>>well (smartctl -a /dev/disk)? I'm doubting this is the problem, but > > >>>thought I'd mention it. > > > > > >I have no real answer, I'm sorry. msync(2) indicates it's effectively > > >deprecated (see BUGS). It looks like this is effectively a mmap-version > > >of fsync(2). > > > > I replaced msync(2) with fsync(2). Unfortunately, from man pages it > > is not obvious that I can do this. Anyway, thanks. > > Sorry, that wasn't what I was implying. Let me try to explain > differently. > > msync(2) looks, to me, like an mmap-specific version of fsync(2). Based > on the man page, it seems that the with msync() you can effectively > guaranteed flushing of certain pages within an mmap()'d region to disk. > fsync() would flush **all** buffers/internal pages to be flushed to > disk. > > One would need to look at the code to mongodb to find out what it's > actually doing with msync(). That is to say, if it's doing something > like this (I probably have the semantics wrong -- I've never spent much > time with mmap()): > > fd = open("/some/file", O_RDWR); > ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); > ret = msync(ptr, 65536, MS_SYNC); > /* or alternatively, this: > ret = msync(ptr, NULL, MS_SYNC); > */ > > Then this, to me, would be mostly the equivalent to: > > fd = fopen("/some/file", "r+"); > ret = fsync(fd); > > Otherwise, if it's calling msync() only on an address/location within > the region ptr points to, then that may be more efficient (less pages to > flush). > They call msync() for the whole file. So, there will not be any difference. > The mmap() arguments -- specifically flags (see man page) -- also play > a role here. The one that catches my attention is MAP_NOSYNC. So you > may need to look at the mongodb code to figure out what it's mmap() > call is. > > One might wonder why they don't just use open() with the O_SYNC. I > imagine that has to do with, again, performance; possibly the don't want > all I/O synchronous, and would rather flush certain pages in the mmap'd > region to disk as needed. I see the legitimacy in that approach (vs. > just using O_SYNC). > > There's really no easy way for me to tell you which is more efficient, > better, blah blah without spending a lot of time with a benchmarking > program that tests all of this, *plus* an entire system (world) built > with profiling. > I ran for two hours mongodb with fsync() and got the following: STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 10:34:52 2011 3 192744 314 3080182 This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'. Then I ran it with default msync(): STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 12:34:53 2011 0 7241555 79 5401945 There are also two graphics of disk business [1] [2]. The difference is significant, in 37 times! That what I expected to get. In commentaries for vm_object_page_clean() I found this: * When stuffing pages asynchronously, allow clustering. XXX we need a * synchronous clustering mode implementation. It means for me that msync(MS_SYNC) flush every page on disk in single IO transaction. If we multiply 4K and 37 we get 150K. This number is size of the single transaction in my experience. +alc@, kib@ Am I right? Is there any plan to implement this? > All of this would really fall into the hands of the mongodb people to > figure out, if you ask me. But I should note that mmap() on BSD behaves > and performs very differently than on, say, Linux; so if the authors > wrote what they did intended for Linux systems, I wouldn't be too > surprised. :-) > https://jira.mongodb.org/browse/SERVER-663 > > >I'm extremely confused by this problem. What you're describing above > is > > >that the process is "stuck in biowr state for a long time", but what you > > >stated originally was that the process was "stuck in ufs state for a > > >few minutes": > > > > Listing of the directory with mongodb files by ls(1) stuck in "ufs" > > state when one of mongodb's thread in "biowr" state. It looks like > > system holds global lock of the file which is msync(2)-ed and can't > > immediately return from lstat(2) call. > > Thanks for the clarification -- yes this helps. To some degree it makes > sense, some piece of the filesystem or VFS layer is blocking > intentionally. How to figure out what layer I do not know. Kernel > folks familiar with this aspect would need to chime in here. > > > >>I've got STABLE-8 (r221983) with mongodb-1.8.1 installed on it. A > > >>couple days ago I observed that listing of mongodb directory stuck in > > >>a few minutes in "ufs" state. > > > > > >Can we narrow down what we're talking about here? Does the process > > >actually deadlock? Or are you concerned about performance implications? > > > > > >I know nothing about this "mongodb" software, but the reason it's > > >calling msync() is because it wants to try and ensure that the data it > > >changed in an mmap()-mapped page to be reflected (fully written) on the > > >disk. This behaviour is fairly common within database software, but > > >"how often" the software chooses to do this is entirely a design > > >implementation choice by the authors. > > > > > >Meaning: if mongodb is either 1) continually calling msync(), or 2) > > >waiting for too long a period of time before calling msync(), > > >performance within the process will suffer. #1 could result in overall > > >bad performance, while #2 could result in a process that's spending a > > >lot of time doing I/O (flushing to disk) and therefore appears > > >"deadlocked" when in fact the kernel/subsystems are doing exactly what > > >they were told to do. > > > > > >Removing the msync() call could result in inconsistent data (possibly > > >non-recoverable) if the mongodb software crashes or if some other piece > > >(thread or child? Not sure) expects to open a new fd on that file which > > >has mmap()'d data. > > > > Yes, I clearly understand this. I think of any system tuning > > instead, but nothing arose in my head. > > Nor I. I think this is more of a userland/application thing than a > kernel thing, but there is a love-and-hate relationship between userland > and kernel when it comes to the above syscalls and framework. > > Wish I could be of more help -- sorry. :-( > Your help was very important. Thanks! > > -- > | Jeremy Chadwick jdc at parodius.com | > | Parodius Networking http://www.parodius.com/ | > | UNIX Systems Administrator Mountain View, CA, US | > | Making life hard for others since 1977. PGP 4BD6C0CB | > > [1] http://zonov.org/mongodb/155647.png [2] http://zonov.org/mongodb/155657.png -- Andrey Zonov