FreeBSD Mail Archives

Date:      Sat, 27 Feb 2016 20:45:01 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        fs@freebsd.org
Subject:   Re: silly write caching in nfs3
Message-ID:  <1367050076.11783367.1456623900997.JavaMail.zimbra@uoguelph.ca>
In-Reply-To: <20160227131353.V1337@besplex.bde.org>
References:  <20160226164613.N2180@besplex.bde.org> <20160227131353.V1337@besplex.bde.org>

Bruce Evans wrote:
> On Fri, 26 Feb 2016, Bruce Evans wrote:
> 
> > nfs3 is slower than in old versions of FreeBSD.  I debugged one of the
> > reasons today.
> > ...
> > oldnfs was fixed many years ago to use timestamps with nanoseconds
> > resolution, but it doesn't suffer from the discarding in nfs_open()
> > in the !NMODIFIED case which is reached by either fsync() before close
> > of commit on close.  I think this is because it updates n_mtime to
> > the server's new timestamp in nfs_writerpc().  This seems to be wrong,
> > since the file might have been written to by other clients and then
> > the change would not be noticed until much later if ever (setting the
> > timestamp prevents seeing it change when it is checked later, but you
> > might be able to see another metadata change).
> >
> > newfs has quite different code for nfs_writerpc().  Most of it was
> > moved to another function in nanother file.  I understand this even
> > less, but it doesn't seem to have fetch the server's new timestamp or
> > update n_mtime in the v3 case.
> 
> This quick fix seems to give the same behaviour as in oldnfs.  It also
> fixes some bugs in comments in nfs_fsync() (where I tried to pass a
> non-null cred, but none is available.  The ARGSUSED bug is in many
> other functions):
> 
> X Index: nfs_clvnops.c
> X ===================================================================
> X --- nfs_clvnops.c	(revision 296089)
> X +++ nfs_clvnops.c	(working copy)
> X @@ -1425,6 +1425,23 @@
> X  	}
> X  	if (DOINGASYNC(vp))
> X  		*iomode = NFSWRITE_FILESYNC;
> X +	if (error == 0 && NFS_ISV3(vp)) {
> X +		/*
> X +		 * Break seeing concurrent changes by other clients,
> X +		 * since without this the next nfs_open() would
> X +		 * invalidate our write buffers.  This is worse than
> X +		 * useless unless the write is committed on close or
> X +		 * fsynced, since otherwise NMODIFIED remains set so
> X +		 * the next nfs_open() will still invalidate the write
> X +		 * buffers.  Unfortunately, this cannot be placed in
> X +		 * ncl_flush() where NMODIFIED is cleared since
> X +		 * credentials are unavailable there for at least
> X +		 * calls by nfs_fsync().
> X +		 */
> X +		mtx_lock(&(VTONFS(vp))->n_mtx);
> X +		VTONFS(vp)->n_mtime = nfsva.na_mtime;
> X +		mtx_unlock(&(VTONFS(vp))->n_mtx);
> X +	}
> X  	if (error && NFS_ISV4(vp))
> X  		error = nfscl_maperr(uiop->uio_td, error, (uid_t)0, (gid_t)0);
> X  	return (error);
> X @@ -2613,9 +2630,8 @@
> X  }
> X
The fix I attached to the other email should fix this without breaking the
weak cache consistency case (where another client has changed the mtime on
the server).

> X  /*
> X - * fsync vnode op. Just call ncl_flush() with commit == 1.
> X + * fsync vnode op.
> X   */
> X -/* ARGSUSED */
> X  static int
> X  nfs_fsync(struct vop_fsync_args *ap)
> X  {
> X @@ -2622,8 +2638,12 @@
> X
> X  	if (ap->a_vp->v_type != VREG) {
> X  		/*
> X +		 * XXX: this comment is misformatted (after fixing its
> X +		 * internal errors) and misplaced.
> X +		 *
> X  		 * For NFS, metadata is changed synchronously on the server,
> X -		 * so there is nothing to flush. Also, ncl_flush() clears
> X +		 * so the only thing to flush is data for regular files.
> X +		 * Also, ncl_flush() clears
> X  		 * the NMODIFIED flag and that shouldn't be done here for
> X  		 * directories.
> X  		 */
> 
> > There are many other reasons why nfs is slower than in old versions.
> > One is that writes are more often done out of order.  This tends to
> > give a slowness factor of about 2 unless the server can fix up the
> > order.  I use an old server which can do the fixup for old clients but
> > not for newer clients starting in about FreeBSD-9 (or 7?).  I suspect
> > that this is just because Giant locking in old clients gave accidental
> > serialization.  Multiple nfsiod's and/or nfsd's are are clearly needed
> > for performance if you have multiple NICs serving multiple mounts.
I believe that you want at least one nfsiod for each concurrent process/thread
reading/writing a file. If your readahead is set to a larger value than the
default of 1, I think you would want that many nfsiod threads for each process/thread
doing concurrent I/O on files. This is just conjecture. I have not done any
benchmarking.

> > Other cases are less clear.  For the iozone benchmark, there is only
> > 1 stream and multiple nfsiod's pessimize it into multiple streams that
> > give buffers which arrive out of order on the server if the multiple
> > nfsiod's are actually active.
Since NFS specs have no ordering requirement or recommendation I would actually
like to find a way to make ZFS (and maybe UFS too) perform well when the I/O
ops are out of order. I have been told that ZFS uses some sequential writing heuristic.
(I'm tempted to look for a way for the NFS server to force it to always assume
 sequential and just ignore the ordering of the VOP_WRITE()s.)

>  I use the following configuration to
> > ameliorate this, but the slowness factor is still often about 2 for
> > iozone:
> > - limit nfsd's to 4
> > - limit nfsiod's to 4
> > - limit nfs i/o sizes to 8K.  The server fs block size is 16K, and
> >  using a smaller block size usually helps by giving some delayed
> >  writes which can be clustered better.  (The non-nfs parts of the
> >  server could be smarter and do this intentionally.  The out-of-order
> >  buffers look like random writes to the server.)  16K i/o sizes
> >  otherwise work OK, but 32K i/o sizes are much slower for unknown
> >  reasons.
> 
> Size 16K seems to work better now.
> 
> I also use:
> 
> - turn off most interrupt moderation.  This reduces (ping) latency from
>    ~125 usec to ~75 usec for em on PCIe (after already turning off interrupt
>    moderation on the server to reduce it from 150-200 usec).  75 usec
>    is still a lot, though it is about 3 times lower than the default
>    misconfiguration.  Downgrading up to older lem on PCI/33 reduces it to
>    52.  Downgrading to DEVICE_POLLING reduces it to about 40.  The
>    dowgrades are upgrades :-(.  Not using a switch reduces it by about
>    another 20.
> 
>    Low latency important for small i/o's.  I was suprised that it also
>    helps a lot for large i/o's.  Apparently it changes the timing enough
>    to reduce the out-of-order buffers significantly.
> 
> The default misconfiguration with 20 nfsiod's is worse than I expected
> (on an 8 core system).  For (old) "iozone auto" which starts with a file
> size of 1MB, the write speed is about 2MB/sec with 20 nfsiod's and
> 22 MB/sec with 1 nfsiod.  2-4 nfsiod's work best.  They give 30-40MB/sec
> for most file sizes.  Apparently, with 20 nfsiod's the write of 1MB is
> split up into almost twenty pieces of 50K each (6 or 7 8K buffers each),
> and the final order is perhaps even worse than random.  I think it is
> basically sequential with about <number of nfsiods> seeks for all file
> sizes between 1MB and many MB.
> 
Unfortunately the number of concurrent processes doing I/O will vary all
the time and I don't think trying to dynamically change the number of nfsiods
to track this would be practical. (Of course, since you know you are only
running one thread you can tune for that.)

I like to do two things:
1 - Find a way to keep the requests more ordered in the client.
    (I wish I hadn't lost the patches. They were at least a starting
     point for this.)
2 - Find a way to make the FreeBSD server file systems less sensitive to
    ordering of I/O requests.
    --> Out of order for the nfsd doesn't imply random access.

Have fun with it, rick

> I also use:
> 
> - no PREEMPTION and no IPI_PREEMPTION on SMP systems.  This limits context
>    switching.
> - no SCHED_ULE.  HZ = 100.  This also limits context switching.
> 
> With more or fairer context switching, all nfsiods are more likely to run,
> causing more damage.
> 
> More detailed result for iozone 1 65536 with nfsiodmax=64 and oldnfs and
> mostly best known other tuning:
> 
> - first run write speed 2MB/S (probably still using 20)
>    (all rates use disk marketing MB)
> - second run 9MB/S
> - after repeated runs, 250MB/S
> - the speed kept mostly dropping, and reached 21K/S
> - server stats for next run at 29K/S: 139 blocks tested and order of
>    24 fixed (the server has an early version of what is in -current,
>    with more debugging)
> 
> with nfsiodmax=20:
> - most runs 2-2.2MB/S; one at 750K/S
> - server stats for a run at 2.2MB/S: 135 blocks tested and 86 fixed
> 
> with nfsiodmax=4:
> - 5.8-6.5MB/S
> - server stats for a run at 6.0MB/S: 135 blocks tested and 0 fixed
> 
> with nfsiodmax=2:
> - 4.8-5.2MB/S
> - server stats for a run at 5.1MB/S: 138 blocks tested and 0 fixed
> 
> with nfsiodmax=1:
> - 3.4MB/S
> - server stats: 138 blocks tested and 0 fixed
> 
> For iozone 512 65536:
> 
> with nfsiodmax=1:
> - 34.7MB/S
> - server stats: 65543 blocks tested and 0 fixed
> 
> with nfsiodmax=2:
> - 45.9MB/S (this is close to the drive's speed and faster than direct on the
>    server.  It is faster because everything the clustering accidentally works
>    better)
> - server stats: 65550 blocks tested and 578 fixed
> 
> with nfsiodmax=4:
> - 45.6MB/S
> - server stats: 65550 blocks tested and 2067 fixed
> 
> with nfsiodmax=20:
> - 21.4MB/S
> - server stats: 65576 blocks tested and 12057 fixed
>    (it is easy to see how 7 nfsiods could give 1/7 = 14% of blocks
>    out of order.  The server is fixing up almost 20%, but that is
>    not enough)
> 
> with nfsiodmax=64 (caused server to not respond):
> - test aborted at 500+MB
> - server stats: about 10000 blocks fixed
> 
> with nfsiodmax=64 again:
> - 9.6MB/S
> - server stats: 65598 blocks tested and 14034 fixed
> 
> The nfsiod's get scheduled almost equally.
> 
> Bruce
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1367050076.11783367.1456623900997.JavaMail.zimbra>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation