Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 22 Dec 2011 11:48:36 +0200
From:      Kostik Belousov <kostikbel@gmail.com>
To:        Andrey Zonov <andrey@zonov.org>
Cc:        alc@freebsd.org, freebsd-stable@freebsd.org, Jeremy Chadwick <freebsd@jdc.parodius.com>
Subject:   Re: directory listing hangs in "ufs" state
Message-ID:  <20111222094836.GD50300@deviant.kiev.zoral.com.ua>
In-Reply-To: <4EF21146.9010107@zonov.org>
References:  <4EE7BF77.5000504@zonov.org> <20111213221501.GA85563@icarus.home.lan> <4EE8E6E3.7050202@zonov.org> <20111214182252.GA5176@icarus.home.lan> <4EE8FD3E.8030902@zonov.org> <20111214204201.GA7372@icarus.home.lan> <CANU_PUGtjjxP-qLjEqb2wVnL_QGJvtApnaD8SSF4zLksY4ME6A@mail.gmail.com> <20111215130111.GN50300@deviant.kiev.zoral.com.ua> <4EF21146.9010107@zonov.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--eA/EPO+dPjTdWiw1
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Dec 21, 2011 at 09:03:02PM +0400, Andrey Zonov wrote:
> On 15.12.2011 17:01, Kostik Belousov wrote:
> >On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote:
> >>On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick
> >><freebsd@jdc.parodius.com>wrote:
> >>
> >>>On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote:
> >>>>On 14.12.2011 22:22, Jeremy Chadwick wrote:
> >>>>>On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote:
> >>>>>>Hi Jeremy,
> >>>>>>
> >>>>>>This is not hardware problem, I've already checked that. I also ran
> >>>>>>fsck today and got no errors.
> >>>>>>
> >>>>>>After some more exploration of how mongodb works, I found that then
> >>>>>>listing hangs, one of mongodb thread is in "biowr" state for a long
> >>>>>>time. It periodically calls msync(MS_SYNC) accordingly to ktrace
> >>>>>>out.
> >>>>>>
> >>>>>>If I'll remove msync() calls from mongodb, how often data will be
> >>>>>>sync by OS?
> >>>>>>
> >>>>>>--
> >>>>>>Andrey Zonov
> >>>>>>
> >>>>>>On 14.12.2011 2:15, Jeremy Chadwick wrote:
> >>>>>>>On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote:
> >>>>>>>>
> >>>>>>>>Have you any ideas what is going on? or how to catch the problem?
> >>>>>>>
> >>>>>>>Assuming this isn't a file on the root filesystem, try booting the
> >>>>>>>machine in single-user mode and using "fsck -f" on the filesystem =
in
> >>>>>>>question.
> >>>>>>>
> >>>>>>>Can you verify there's no problems with the disk this file lives o=
n=20
> >>>>>>>as
> >>>>>>>well (smartctl -a /dev/disk)?  I'm doubting this is the problem, b=
ut
> >>>>>>>thought I'd mention it.
> >>>>>
> >>>>>I have no real answer, I'm sorry.  msync(2) indicates it's effective=
ly
> >>>>>deprecated (see BUGS).  It looks like this is effectively a=20
> >>>>>mmap-version
> >>>>>of fsync(2).
> >>>>
> >>>>I replaced msync(2) with fsync(2).  Unfortunately, from man pages it
> >>>>is not obvious that I can do this. Anyway, thanks.
> >>>
> >>>Sorry, that wasn't what I was implying.  Let me try to explain
> >>>differently.
> >>>
> >>>msync(2) looks, to me, like an mmap-specific version of fsync(2).  Bas=
ed
> >>>on the man page, it seems that the with msync() you can effectively
> >>>guaranteed flushing of certain pages within an mmap()'d region to disk.
> >>>fsync() would flush **all** buffers/internal pages to be flushed to
> >>>disk.
> >>>
> >>>One would need to look at the code to mongodb to find out what it's
> >>>actually doing with msync().  That is to say, if it's doing something
> >>>like this (I probably have the semantics wrong -- I've never spent much
> >>>time with mmap()):
> >>>
> >>>fd =3D open("/some/file", O_RDWR);
> >>>ptr =3D mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
> >>>ret =3D msync(ptr, 65536, MS_SYNC);
> >>>/* or alternatively, this:
> >>>ret =3D msync(ptr, NULL, MS_SYNC);
> >>>*/
> >>>
> >>>Then this, to me, would be mostly the equivalent to:
> >>>
> >>>fd =3D fopen("/some/file", "r+");
> >>>ret =3D fsync(fd);
> >>>
> >>>Otherwise, if it's calling msync() only on an address/location within
> >>>the region ptr points to, then that may be more efficient (less pages =
to
> >>>flush).
> >>>
> >>
> >>They call msync() for the whole file.  So, there will not be any=20
> >>difference.
> >>
> >>
> >>>The mmap() arguments -- specifically flags (see man page) -- also play
> >>>a role here.  The one that catches my attention is MAP_NOSYNC.  So you
> >>>may need to look at the mongodb code to figure out what it's mmap()
> >>>call is.
> >>>
> >>>One might wonder why they don't just use open() with the O_SYNC.  I
> >>>imagine that has to do with, again, performance; possibly the don't wa=
nt
> >>>all I/O synchronous, and would rather flush certain pages in the mmap'd
> >>>region to disk as needed.  I see the legitimacy in that approach (vs.
> >>>just using O_SYNC).
> >>>
> >>>There's really no easy way for me to tell you which is more efficient,
> >>>better, blah blah without spending a lot of time with a benchmarking
> >>>program that tests all of this, *plus* an entire system (world) built
> >>>with profiling.
> >>>
> >>
> >>I ran for two hours mongodb with fsync() and got the following:
> >>STARTED                      INBLK OUBLK MAJFLT MINFLT
> >>Thu Dec 15 10:34:52 2011         3 192744    314 3080182
> >>
> >>This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongod=
b'.
> >>
> >>Then I ran it with default msync():
> >>STARTED                      INBLK OUBLK MAJFLT MINFLT
> >>Thu Dec 15 12:34:53 2011         0 7241555     79 5401945
> >>
> >>There are also two graphics of disk business [1] [2].
> >>
> >>The difference is significant, in 37 times!  That what I expected to ge=
t.
> >>
> >>In commentaries for vm_object_page_clean() I found this:
> >>
> >>  *      When stuffing pages asynchronously, allow clustering.  XXX we=
=20
> >>  need a
> >>  *      synchronous clustering mode implementation.
> >>
> >>It means for me that msync(MS_SYNC) flush every page on disk in single =
IO
> >>transaction.  If we multiply 4K and 37 we get 150K.  This number is siz=
e=20
> >>of
> >>the single transaction in my experience.
> >>
> >>+alc@, kib@
> >>
> >>Am I right? Is there any plan to implement this?
> >Current buffer clustering code can only do only async writes. In fact, I
> >am not quite sure what would consitute the sync clustering, because the
> >ability to delay the write is important to be able to cluster at all.
> >
> >Also, I am not sure that lack of clustering is the biggest problem.
> >IMO, the fact that each write is sync is the first problem there. It
> >would be quite a work to add the tracking of the issued writes to the
> >vm_object_page_clean() and down the stack. Esp. due to custom page
> >write vops in several fses.
> >
> >The only guarantee that POSIX requires from msync(MS_SYNC) is that
> >the writes are finished when the syscall returned, and not that the
> >writes are done synchronously. Below is the hack which should help if
> >the msync()ed region contains the mapping of the whole file, since
> >it is possible to fsync() the file after all writes are scheduled
> >asynchronous then. It will causes unneeded metadata update, but I think
> >it would be much faster still.
> >
> >
> >diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
> >index 250b769..a9de554 100644
> >--- a/sys/vm/vm_object.c
> >+++ b/sys/vm/vm_object.c
> >@@ -938,7 +938,7 @@ vm_object_sync(vm_object_t object, vm_ooffset_t=20
> >offset, vm_size_t size,
> >  	vm_object_t backing_object;
> >  	struct vnode *vp;
> >  	struct mount *mp;
> >-	int flags;
> >+	int flags, fsync_after;
> >
> >  	if (object =3D=3D NULL)
> >  		return;
> >@@ -971,11 +971,26 @@ vm_object_sync(vm_object_t object, vm_ooffset_t=20
> >offset, vm_size_t size,
> >  		(void) vn_start_write(vp,&mp, V_WAIT);
> >  		vfslocked =3D VFS_LOCK_GIANT(vp->v_mount);
> >  		vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
> >-		flags =3D (syncio || invalidate) ? OBJPC_SYNC : 0;
> >-		flags |=3D invalidate ? OBJPC_INVAL : 0;
> >+		if (syncio&&  !invalidate&&  offset =3D=3D 0&&
> >+		    OFF_TO_IDX(size) =3D=3D object->size) {
> >+			/*
> >+			 * If syncing the whole mapping of the file,
> >+			 * it is faster to schedule all the writes in
> >+			 * async mode, also allowing the clustering,
> >+			 * and then wait for i/o to complete.
> >+			 */
> >+			flags =3D 0;
> >+			fsync_after =3D TRUE;
> >+		} else {
> >+			flags =3D (syncio || invalidate) ? OBJPC_SYNC : 0;
> >+			flags |=3D invalidate ? (OBJPC_SYNC | OBJPC_INVAL) : 0;
> >+			fsync_after =3D FALSE;
> >+		}
> >  		VM_OBJECT_LOCK(object);
> >  		vm_object_page_clean(object, offset, offset + size, flags);
> >  		VM_OBJECT_UNLOCK(object);
> >+		if (fsync_after)
> >+			(void) VOP_FSYNC(vp, MNT_WAIT, curthread);
> >  		VOP_UNLOCK(vp, 0);
> >  		VFS_UNLOCK_GIANT(vfslocked);
> >  		vn_finished_write(mp);
>=20
> Thanks, this patch works.  Performance is the same as of using fsync().
>=20
> Actually, Linux uses fsync() inside of msync() if MS_SYNC is set.
> http://git.kernel.org/?p=3Dlinux/kernel/git/torvalds/linux.git;a=3Dblob;f=
=3Dmm/msync.c;h=3D632df4527c0122062d9332a0d483835274ed62f6;hb=3DHEAD
>=20
I see, indeed Linux fully fsync the whole file if even single page of it
appeared to be (non-shadowed) mmaped into the msync(MS_SYNC) region.
I am not sure that we shall follow this behaviour.

Alan, do you agree with the patch above ?

--eA/EPO+dPjTdWiw1
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (FreeBSD)

iEYEARECAAYFAk7y/PMACgkQC3+MBN1Mb4hVPgCffSCnM6eR8Dns4WJBcDYDTpva
fBcAoJtCGbz3vwkGGXz5en2Q4llLdcIr
=oqRK
-----END PGP SIGNATURE-----

--eA/EPO+dPjTdWiw1--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111222094836.GD50300>