From owner-freebsd-stable@FreeBSD.ORG Wed Dec 21 17:03:44 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 784E51065670; Wed, 21 Dec 2011 17:03:44 +0000 (UTC) (envelope-from andrey@zonov.org) Received: from mail-fx0-f54.google.com (mail-fx0-f54.google.com [209.85.161.54]) by mx1.freebsd.org (Postfix) with ESMTP id CA1178FC15; Wed, 21 Dec 2011 17:03:43 +0000 (UTC) Received: by faaf16 with SMTP id f16so5360460faa.13 for ; Wed, 21 Dec 2011 09:03:42 -0800 (PST) Received: by 10.180.96.72 with SMTP id dq8mr15919446wib.10.1324486983655; Wed, 21 Dec 2011 09:03:03 -0800 (PST) Received: from [10.254.254.77] (ppp95-165-154-46.pppoe.spdop.ru. [95.165.154.46]) by mx.google.com with ESMTPS id en10sm2591754wbb.11.2011.12.21.09.03.02 (version=SSLv3 cipher=OTHER); Wed, 21 Dec 2011 09:03:03 -0800 (PST) Message-ID: <4EF21146.9010107@zonov.org> Date: Wed, 21 Dec 2011 21:03:02 +0400 From: Andrey Zonov User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.1.24) Gecko/20100228 Thunderbird/2.0.0.24 Mnenhy/0.7.6.0 MIME-Version: 1.0 To: Kostik Belousov References: <4EE7BF77.5000504@zonov.org> <20111213221501.GA85563@icarus.home.lan> <4EE8E6E3.7050202@zonov.org> <20111214182252.GA5176@icarus.home.lan> <4EE8FD3E.8030902@zonov.org> <20111214204201.GA7372@icarus.home.lan> <20111215130111.GN50300@deviant.kiev.zoral.com.ua> In-Reply-To: <20111215130111.GN50300@deviant.kiev.zoral.com.ua> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: alc@freebsd.org, freebsd-stable@freebsd.org, Jeremy Chadwick Subject: Re: directory listing hangs in "ufs" state X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 21 Dec 2011 17:03:44 -0000 On 15.12.2011 17:01, Kostik Belousov wrote: > On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote: >> On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick >> wrote: >> >>> On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote: >>>> On 14.12.2011 22:22, Jeremy Chadwick wrote: >>>>> On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: >>>>>> Hi Jeremy, >>>>>> >>>>>> This is not hardware problem, I've already checked that. I also ran >>>>>> fsck today and got no errors. >>>>>> >>>>>> After some more exploration of how mongodb works, I found that then >>>>>> listing hangs, one of mongodb thread is in "biowr" state for a long >>>>>> time. It periodically calls msync(MS_SYNC) accordingly to ktrace >>>>>> out. >>>>>> >>>>>> If I'll remove msync() calls from mongodb, how often data will be >>>>>> sync by OS? >>>>>> >>>>>> -- >>>>>> Andrey Zonov >>>>>> >>>>>> On 14.12.2011 2:15, Jeremy Chadwick wrote: >>>>>>> On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: >>>>>>>> >>>>>>>> Have you any ideas what is going on? or how to catch the problem? >>>>>>> >>>>>>> Assuming this isn't a file on the root filesystem, try booting the >>>>>>> machine in single-user mode and using "fsck -f" on the filesystem in >>>>>>> question. >>>>>>> >>>>>>> Can you verify there's no problems with the disk this file lives on as >>>>>>> well (smartctl -a /dev/disk)? I'm doubting this is the problem, but >>>>>>> thought I'd mention it. >>>>> >>>>> I have no real answer, I'm sorry. msync(2) indicates it's effectively >>>>> deprecated (see BUGS). It looks like this is effectively a mmap-version >>>>> of fsync(2). >>>> >>>> I replaced msync(2) with fsync(2). Unfortunately, from man pages it >>>> is not obvious that I can do this. Anyway, thanks. >>> >>> Sorry, that wasn't what I was implying. Let me try to explain >>> differently. >>> >>> msync(2) looks, to me, like an mmap-specific version of fsync(2). Based >>> on the man page, it seems that the with msync() you can effectively >>> guaranteed flushing of certain pages within an mmap()'d region to disk. >>> fsync() would flush **all** buffers/internal pages to be flushed to >>> disk. >>> >>> One would need to look at the code to mongodb to find out what it's >>> actually doing with msync(). That is to say, if it's doing something >>> like this (I probably have the semantics wrong -- I've never spent much >>> time with mmap()): >>> >>> fd = open("/some/file", O_RDWR); >>> ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); >>> ret = msync(ptr, 65536, MS_SYNC); >>> /* or alternatively, this: >>> ret = msync(ptr, NULL, MS_SYNC); >>> */ >>> >>> Then this, to me, would be mostly the equivalent to: >>> >>> fd = fopen("/some/file", "r+"); >>> ret = fsync(fd); >>> >>> Otherwise, if it's calling msync() only on an address/location within >>> the region ptr points to, then that may be more efficient (less pages to >>> flush). >>> >> >> They call msync() for the whole file. So, there will not be any difference. >> >> >>> The mmap() arguments -- specifically flags (see man page) -- also play >>> a role here. The one that catches my attention is MAP_NOSYNC. So you >>> may need to look at the mongodb code to figure out what it's mmap() >>> call is. >>> >>> One might wonder why they don't just use open() with the O_SYNC. I >>> imagine that has to do with, again, performance; possibly the don't want >>> all I/O synchronous, and would rather flush certain pages in the mmap'd >>> region to disk as needed. I see the legitimacy in that approach (vs. >>> just using O_SYNC). >>> >>> There's really no easy way for me to tell you which is more efficient, >>> better, blah blah without spending a lot of time with a benchmarking >>> program that tests all of this, *plus* an entire system (world) built >>> with profiling. >>> >> >> I ran for two hours mongodb with fsync() and got the following: >> STARTED INBLK OUBLK MAJFLT MINFLT >> Thu Dec 15 10:34:52 2011 3 192744 314 3080182 >> >> This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'. >> >> Then I ran it with default msync(): >> STARTED INBLK OUBLK MAJFLT MINFLT >> Thu Dec 15 12:34:53 2011 0 7241555 79 5401945 >> >> There are also two graphics of disk business [1] [2]. >> >> The difference is significant, in 37 times! That what I expected to get. >> >> In commentaries for vm_object_page_clean() I found this: >> >> * When stuffing pages asynchronously, allow clustering. XXX we need a >> * synchronous clustering mode implementation. >> >> It means for me that msync(MS_SYNC) flush every page on disk in single IO >> transaction. If we multiply 4K and 37 we get 150K. This number is size of >> the single transaction in my experience. >> >> +alc@, kib@ >> >> Am I right? Is there any plan to implement this? > Current buffer clustering code can only do only async writes. In fact, I > am not quite sure what would consitute the sync clustering, because the > ability to delay the write is important to be able to cluster at all. > > Also, I am not sure that lack of clustering is the biggest problem. > IMO, the fact that each write is sync is the first problem there. It > would be quite a work to add the tracking of the issued writes to the > vm_object_page_clean() and down the stack. Esp. due to custom page > write vops in several fses. > > The only guarantee that POSIX requires from msync(MS_SYNC) is that > the writes are finished when the syscall returned, and not that the > writes are done synchronously. Below is the hack which should help if > the msync()ed region contains the mapping of the whole file, since > it is possible to fsync() the file after all writes are scheduled > asynchronous then. It will causes unneeded metadata update, but I think > it would be much faster still. > > > diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c > index 250b769..a9de554 100644 > --- a/sys/vm/vm_object.c > +++ b/sys/vm/vm_object.c > @@ -938,7 +938,7 @@ vm_object_sync(vm_object_t object, vm_ooffset_t offset, vm_size_t size, > vm_object_t backing_object; > struct vnode *vp; > struct mount *mp; > - int flags; > + int flags, fsync_after; > > if (object == NULL) > return; > @@ -971,11 +971,26 @@ vm_object_sync(vm_object_t object, vm_ooffset_t offset, vm_size_t size, > (void) vn_start_write(vp,&mp, V_WAIT); > vfslocked = VFS_LOCK_GIANT(vp->v_mount); > vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); > - flags = (syncio || invalidate) ? OBJPC_SYNC : 0; > - flags |= invalidate ? OBJPC_INVAL : 0; > + if (syncio&& !invalidate&& offset == 0&& > + OFF_TO_IDX(size) == object->size) { > + /* > + * If syncing the whole mapping of the file, > + * it is faster to schedule all the writes in > + * async mode, also allowing the clustering, > + * and then wait for i/o to complete. > + */ > + flags = 0; > + fsync_after = TRUE; > + } else { > + flags = (syncio || invalidate) ? OBJPC_SYNC : 0; > + flags |= invalidate ? (OBJPC_SYNC | OBJPC_INVAL) : 0; > + fsync_after = FALSE; > + } > VM_OBJECT_LOCK(object); > vm_object_page_clean(object, offset, offset + size, flags); > VM_OBJECT_UNLOCK(object); > + if (fsync_after) > + (void) VOP_FSYNC(vp, MNT_WAIT, curthread); > VOP_UNLOCK(vp, 0); > VFS_UNLOCK_GIANT(vfslocked); > vn_finished_write(mp); Thanks, this patch works. Performance is the same as of using fsync(). Actually, Linux uses fsync() inside of msync() if MS_SYNC is set. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/msync.c;h=632df4527c0122062d9332a0d483835274ed62f6;hb=HEAD -- Andrey Zonov