Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 28 Nov 2013 21:18:51 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        Kirk McKusick <mckusick@mckusick.com>, FreeBSD FS <freebsd-fs@freebsd.org>
Subject:   Re: RFC: NFS client patch to reduce sychronous writes
Message-ID:  <20131128191851.GM59496@kib.kiev.ua>
In-Reply-To: <820090900.22521963.1385648137865.JavaMail.root@uoguelph.ca>
References:  <20131128072435.GI59496@kib.kiev.ua> <820090900.22521963.1385648137865.JavaMail.root@uoguelph.ca>

next in thread | previous in thread | raw e-mail | index | archive | help

--rNM3Wo3yRXFr03ac
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Nov 28, 2013 at 09:15:37AM -0500, Rick Macklem wrote:
> Kostik wrote:
> > On Wed, Nov 27, 2013 at 07:19:47PM -0500, Rick Macklem wrote:
> > > Kirk wrote:
> > > > > Date: Wed, 27 Nov 2013 17:50:48 -0500 (EST)
> > > > > From: Rick Macklem <rmacklem@uoguelph.ca>
> > > > > To: Konstantin Belousov <kostikbel@gmail.com>
> > > > > Subject: Re: RFC: NFS client patch to reduce sychronous writes
> > > > >=20
> > > > > Kostik wrote:
> > > > >> Sorry, I do not understand the question. mmap(2) itself does
> > > > >> not
> > > > >> change
> > > > >> file size.  But if mmaped area includes the last page, I still
> > > > >> think
> > > > >> that the situation I described before is possible.
> > > > >=20
> > > > > Yes, I'll need to look at this. If it is a problem, all I can
> > > > > think
> > > > > of
> > > > > is bzeroing all new pages when they're allocated to the buffer
> > > > > cache.
> > > > >=20
> > > > > Thanks for looking at it, rick
> > > > > ps: Btw, jhb@'s patch didn't have the bzeroing in it.
> > > >=20
> > > > The ``fix'' of bzero'ing every buffer cache page was made to
> > > > UFS/FFS
> > > > for this problem and it killed write performance of the
> > > > filesystem
> > > > by nearly half. We corrected this by only doing the bzero when
> > > > the
> > > > file is mmap'ed which helped things considerably (since most
> > > > files
> > > > being written are not also bmap'ed).
> > > >=20
> > > > 	Kirk
> > > >=20
> > > Ok, thanks. I've been trying to reproduce the problem over NFS and
> > > haven't been able to break my patch. I was using the attached
> > > trivial
> > > test program and would simply make a copy of the source file (529
> > > bytes)
> > > to test on. I got the same results both locally and over NFS:
> > > - built without -DWRITEIT, the setting of a value after EOF would
> > > be
> > >   lost, because nothing grew the file from 529 bytes to over
> > >   4080bytes.
> > > - built with -DWRITEIT, both the 'A' and 'B' are in the result,
> > > since
> > >   my patch bzeros the grown segment in the write(2) syscall.
> > >=20
> > > - If I move the write (code in #ifdef WRITEIT) to after the "*cp"
> > >   of the mapped page, the 'A' assigned to "*cp" gets lost for
> > >   both UFS and NFS.
> > >   Is this correct behaviour?
> > >=20
> > > If it is correct behaviour, I can't see how the patch is broken,
> > > but
> > > if you think it might still be, I'll look at doing what Kirk
> > > suggests,
> > > which is bzeroing all new buffer cache pages when the file is
> > > mmap()d.
> > >=20
> > Replying there, since text description is more informative than the
> > code.
> >=20
> > You cannot get the situation I described, with single process.
> > You should have a writer in one thread, and reader through the mmaped
> > area in another.  Even than, the race window is thin.
> >=20
> > Let me describe the issue which could exist one more time:
> >=20
> > Thread A (writer) issued write(2).  The kernel does two things:
> > 1. zeroes part of the last buffer of the affected file.
> > 2. kernel uiomove()s the write data into the buffer b_data.
> >=20
> > Now, assume that thread B has the same file mmaped somewhere, and
> > accesses the page of the buffer after the [1] but before [2].  Than,
> > it would see zeroes instead of the valid content.
> >=20
> > I said that this breaks write/mmap consistency, since thread B can
> > see a content in the file which was never written there.  The
> > condition
> > is transient, it self-repairs after thread A passes point 2.
> >=20
> Ok, but doesn't that exist now?
>=20
> Without the patch, when the Thread A (writer) appends to the file,
> the np->n_size is grown and vnode_pager_setsize() is called with
> the new, larger size, followed by an allocbuf().
> { at around line# 1066,1073 of nfs_clbio.c. Same code exists in the
>   old client and was cloned. }
> Then vn_io_fault_uiomove() is called at line#1195.
>=20
> Without the patch, Thread B would get whatever garbage is in the
> page(s) if it accesses the latter (just grown) part of the buffer between
> 1 and 2, would it not?
>=20
> With the patch, it will most likely get 0s instead of garbage, if
> it accesses the latter (just grown) part of the buffer between 1. and 2.
> There is actually a short time just after the vnode_pager_setsize()
> and before the bzeroing, that it would still get garbage.
> (Note that the patch only zeros out the part of the last page that
>  was "grown" by the write in 1. that is past the previous EOF.
>  If I bzeroing data that was before the previous EOF, I can see
>  the breakage, but the patch doesn't do that. Unless I've written
>  buggy code, of course, but I think it's correct.)
>=20
> So, I don't see any difference between the unpatched and patched
> version?
AFAIU, you bzero the whole region in the buffer which is subject to i/o,
not the new extent.  E.g., if the file size mod biosize is non-zero,
but the current write starts on the buffer boundary, than then
range from the buffer start to the old EOF is zeroed for some moment,
before being overwritten by user data.

--rNM3Wo3yRXFr03ac
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (FreeBSD)

iQIcBAEBAgAGBQJSl5caAAoJEJDCuSvBvK1BDxoQAJ3izrlf3y3xpjTglmsFXGJ0
AbLjaiptVr35dLPK0cPMuia/zScjmOPx1gZCjg4AZVuQ6dPMqsadIIhORDu+ekGF
r2RaTPW570P0a0IDoAkZwpNVhBf8WnJ9SEvt9N43H6DaEVVoRSGtwqSlV/vbk10o
qC4B//QhRlNpQa4drP1AJRd1DPCKOzjTmPnWeSYJiYmp8x2aZUWfb3N6OykG8m/f
o9O4Y20YrIg9gAdw207niunH3umzanrV40TDzH+8BGs66mUrH2rmwNZbbdAW8SEE
fZkOgG76Ixc8OhHfT1GT6xQbfr925lh9aa+ODAMQ4SDzWvfCRGA0H24iLCXe+LCy
GVBeTCO1B7y5O271a3BunxzvuRuBeo0fELh0wPQo365iM2WJ9Nk8C4tCCuxqNNlQ
aCBymFehLKekCUTK3plhMoxmy5tnf3y8s9rkZdg9j24ATGUrGKFAMZZ69AEeyGS7
qFV5BDGsAWWyXg3xVmZhgPVcMhs1KyX4kkoAloAbY/nkdtvmFSdE2NpFGZrNdm8Q
zBeJRV1P4g6mKuSXHF8FDLsqYMxdtwhsRtXJYOpdvqxU6wDEwEdR4LeIdxMFU9n7
YY+81si4GPZe0cxE5Yz0Az4obyhLYbj+7DQ5nWBWgWjuOLvuuaE3Eq0IZqrmSkEm
Hbe6CVwMdQPDoxdGuFiq
=EGIt
-----END PGP SIGNATURE-----

--rNM3Wo3yRXFr03ac--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20131128191851.GM59496>