Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 28 Nov 2013 18:50:20 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Kirk McKusick <mckusick@mckusick.com>, FreeBSD FS <freebsd-fs@freebsd.org>
Subject:   Re: RFC: NFS client patch to reduce sychronous writes
Message-ID:  <1355437347.22943634.1385682620696.JavaMail.root@uoguelph.ca>
In-Reply-To: <20131128191851.GM59496@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
Kostik wrote:
> On Thu, Nov 28, 2013 at 09:15:37AM -0500, Rick Macklem wrote:
> > Kostik wrote:
> > > On Wed, Nov 27, 2013 at 07:19:47PM -0500, Rick Macklem wrote:
> > > > Kirk wrote:
> > > > > > Date: Wed, 27 Nov 2013 17:50:48 -0500 (EST)
> > > > > > From: Rick Macklem <rmacklem@uoguelph.ca>
> > > > > > To: Konstantin Belousov <kostikbel@gmail.com>
> > > > > > Subject: Re: RFC: NFS client patch to reduce sychronous
> > > > > > writes
> > > > > > 
> > > > > > Kostik wrote:
> > > > > >> Sorry, I do not understand the question. mmap(2) itself
> > > > > >> does
> > > > > >> not
> > > > > >> change
> > > > > >> file size.  But if mmaped area includes the last page, I
> > > > > >> still
> > > > > >> think
> > > > > >> that the situation I described before is possible.
> > > > > > 
> > > > > > Yes, I'll need to look at this. If it is a problem, all I
> > > > > > can
> > > > > > think
> > > > > > of
> > > > > > is bzeroing all new pages when they're allocated to the
> > > > > > buffer
> > > > > > cache.
> > > > > > 
> > > > > > Thanks for looking at it, rick
> > > > > > ps: Btw, jhb@'s patch didn't have the bzeroing in it.
> > > > > 
> > > > > The ``fix'' of bzero'ing every buffer cache page was made to
> > > > > UFS/FFS
> > > > > for this problem and it killed write performance of the
> > > > > filesystem
> > > > > by nearly half. We corrected this by only doing the bzero
> > > > > when
> > > > > the
> > > > > file is mmap'ed which helped things considerably (since most
> > > > > files
> > > > > being written are not also bmap'ed).
> > > > > 
> > > > > 	Kirk
> > > > > 
> > > > Ok, thanks. I've been trying to reproduce the problem over NFS
> > > > and
> > > > haven't been able to break my patch. I was using the attached
> > > > trivial
> > > > test program and would simply make a copy of the source file
> > > > (529
> > > > bytes)
> > > > to test on. I got the same results both locally and over NFS:
> > > > - built without -DWRITEIT, the setting of a value after EOF
> > > > would
> > > > be
> > > >   lost, because nothing grew the file from 529 bytes to over
> > > >   4080bytes.
> > > > - built with -DWRITEIT, both the 'A' and 'B' are in the result,
> > > > since
> > > >   my patch bzeros the grown segment in the write(2) syscall.
> > > > 
> > > > - If I move the write (code in #ifdef WRITEIT) to after the
> > > > "*cp"
> > > >   of the mapped page, the 'A' assigned to "*cp" gets lost for
> > > >   both UFS and NFS.
> > > >   Is this correct behaviour?
> > > > 
> > > > If it is correct behaviour, I can't see how the patch is
> > > > broken,
> > > > but
> > > > if you think it might still be, I'll look at doing what Kirk
> > > > suggests,
> > > > which is bzeroing all new buffer cache pages when the file is
> > > > mmap()d.
> > > > 
> > > Replying there, since text description is more informative than
> > > the
> > > code.
> > > 
> > > You cannot get the situation I described, with single process.
> > > You should have a writer in one thread, and reader through the
> > > mmaped
> > > area in another.  Even than, the race window is thin.
> > > 
> > > Let me describe the issue which could exist one more time:
> > > 
> > > Thread A (writer) issued write(2).  The kernel does two things:
> > > 1. zeroes part of the last buffer of the affected file.
> > > 2. kernel uiomove()s the write data into the buffer b_data.
> > > 
> > > Now, assume that thread B has the same file mmaped somewhere, and
> > > accesses the page of the buffer after the [1] but before [2].
> > >  Than,
> > > it would see zeroes instead of the valid content.
> > > 
> > > I said that this breaks write/mmap consistency, since thread B
> > > can
> > > see a content in the file which was never written there.  The
> > > condition
> > > is transient, it self-repairs after thread A passes point 2.
> > > 
> > Ok, but doesn't that exist now?
> > 
> > Without the patch, when the Thread A (writer) appends to the file,
> > the np->n_size is grown and vnode_pager_setsize() is called with
> > the new, larger size, followed by an allocbuf().
> > { at around line# 1066,1073 of nfs_clbio.c. Same code exists in the
> >   old client and was cloned. }
> > Then vn_io_fault_uiomove() is called at line#1195.
> > 
> > Without the patch, Thread B would get whatever garbage is in the
> > page(s) if it accesses the latter (just grown) part of the buffer
> > between
> > 1 and 2, would it not?
> > 
> > With the patch, it will most likely get 0s instead of garbage, if
> > it accesses the latter (just grown) part of the buffer between 1.
> > and 2.
> > There is actually a short time just after the vnode_pager_setsize()
> > and before the bzeroing, that it would still get garbage.
> > (Note that the patch only zeros out the part of the last page that
> >  was "grown" by the write in 1. that is past the previous EOF.
> >  If I bzeroing data that was before the previous EOF, I can see
> >  the breakage, but the patch doesn't do that. Unless I've written
> >  buggy code, of course, but I think it's correct.)
> > 
> > So, I don't see any difference between the unpatched and patched
> > version?
> AFAIU, you bzero the whole region in the buffer which is subject to
> i/o,
> not the new extent.  E.g., if the file size mod biosize is non-zero,
> but the current write starts on the buffer boundary, than then
> range from the buffer start to the old EOF is zeroed for some moment,
> before being overwritten by user data.
> 
Why does the fact the write starts on the buffer boundary affect this?
obcount is == file size mod biosize and that is where the bzero'ng starts.
(Now, it is true that if the current write starts on the buffer boundary,
 then bzero'ng isn't necessary, but that doesn't make the bzero start at
 the buffer boundary and not the old EOF. The patch I'm currently testing
 wouldn't do a bzero at all for the case where "on" (the offset within the
 buffer isn't after the old EOF), which will be true when the write starts
 at the buffer boundary, or "on" == 0 if you prefer. But this is an optimization
 and doesn't change where the bzero starts.)
Ok, here's the code snippet with some annotations:
		if ((uio->uio_offset == np->n_size ||
		    (noncontig_write != 0 &&
		    lbn == (np->n_size / biosize) &&
		    uio->uio_offset + n > np->n_size)) && n) {
*** In this code block we know:
       lbn == np->n_size / biosize (for the case of uio_offset == n_size, it must be)
       lbn is the last buffer cache block for the file
       the size of the file is growing (ie. n_size will get larger)
			mtx_unlock(&np->n_mtx);
			/*
			 * Get the buffer (in its pre-append state to maintain
			 * B_CACHE if it was previously set).  Resize the
			 * nfsnode after we have locked the buffer to prevent
			 * readers from reading garbage.
			 */
			obcount = np->n_size - (lbn * biosize);
*** This sets obcount to the byte offset within the buffer cache block of
    EOF before it is grown by this write (or file size mod biosize, if you prefer)
			bp = nfs_getcacheblk(vp, lbn, obcount, td);

			if (bp != NULL) {
				long save;

				mtx_lock(&np->n_mtx);
  				np->n_size = uio->uio_offset + n;
				np->n_flag |= NMODIFIED;
				vnode_pager_setsize(vp, np->n_size);
*** n_size is now grown to the new EOF for after the write.
				mtx_unlock(&np->n_mtx);

				save = bp->b_flags & B_CACHE;
				bcount = on + n;
				allocbuf(bp, bcount);
				bp->b_flags |= save;
				if (noncontig_write != 0 && bcount > obcount)
					vfs_bio_bzero_buf(bp, obcount, bcount -
					    obcount);
*** This zeros bytes from "obcount" (the offset of the old EOF) to "bcount"
    (which is the offset of EOF after the write).
Sorry, but I can't see why this would bzero from the start of this write,
since this write starts at "on" calculated from uio_offset, whereas
"obcount" is calculated from the pre-write value of n_size.
Or am I just brain farting when I look at this code?

--> For the patch I am now testing, I have changed the above 3 lines to:
				if (noncontig_write != 0 && on > obcount)
					vfs_bio_bzero_buf(bp, obcount, on -
					    obcount);
   I realized that using "bcount" (the new EOF offset) was overkill (although
   I believe harmless) since the bytes from offset "on" to offset "bcount"
   will be written soon. By using "on" (the start of the this write) as the
   end of the bzero'ng range, I minimize the bytes being bzero'd and avoid
   the call completely when the new write isn't leaving a gap after the
   old EOF, such as the case you mentioned.

rick





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1355437347.22943634.1385682620696.JavaMail.root>