Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 28 Jan 2014 10:35:41 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        John-Mark Gurney <jmg@funkthat.com>
Cc:        freebsd-net@freebsd.org, Adam McDougall <mcdouga9@egr.msu.edu>
Subject:   Re: Terrible NFS performance under 9.2-RELEASE?
Message-ID:  <372707859.17587309.1390923341323.JavaMail.root@uoguelph.ca>
In-Reply-To: <20140128021450.GY13704@funkthat.com>

next in thread | previous in thread | raw e-mail | index | archive | help
John-Mark Gurney wrote:
> Rick Macklem wrote this message on Mon, Jan 27, 2014 at 20:32 -0500:
> > John-Mark Gurney wrote:
> > > Rick Macklem wrote this message on Mon, Jan 27, 2014 at 18:47
> > > -0500:
> > > > John-Mark Gurney wrote:
> > > > > Rick Macklem wrote this message on Sun, Jan 26, 2014 at 21:16
> > > > > -0500:
> > > > > > Btw, thanks go to Garrett Wollman for suggesting the change
> > > > > > to
> > > > > > MJUMPAGESIZE
> > > > > > clusters.
> > > > > > 
> > > > > > rick
> > > > > > ps: If the attachment doesn't make it through and you want
> > > > > > the
> > > > > > patch, just
> > > > > >     email me and I'll send you a copy.
> > > > > 
> > > > > The patch looks good, but we probably shouldn't change
> > > > > _readlink..
> > > > > The chances of a link being >2k are pretty slim, and the
> > > > > chances
> > > > > of
> > > > > the link being >32k are even smaller...
> > > > > 
> > > > Yea, I already thought of that, actually. However, see below
> > > > w.r.t.
> > > > NFSv4.
> > > > 
> > > > However, at this point I
> > > > mostly want to find out if it the long mbuf chain that causes
> > > > problems
> > > > for TSO enabled network interfaces.
> > > 
> > > I agree, though a long mbuf chain is more of a driver issue than
> > > an
> > > NFS issue...
> > > 
> > Yes, if my hunch is correct, it is. If my hunch gets verified, I
> > will
> > be posting w.r.t. how best to deal with the problem. I suspect a
> > patch
> > like this one might serve as a useful work-around while the drivers
> > gets fixed, if the hunch is correct.
> 
> It would be nice to have a way to force such a segment to go out to
> the drivers to make debugging/testing drivers easier...  I'm not sure
> the best way to handle that though...
> 
> > > > > In fact, we might want to switch _readlink to MGET (could be
> > > > > conditional
> > > > > upon cnt) so that if it fits in an mbuf we don't allocate a
> > > > > cluster
> > > > > for
> > > > > it...
> > > > > 
> > > > For NFSv4, what was an RPC for NFSv3 becomes one of several
> > > > Ops. in
> > > > a compound RPC. As such, there is no way to know how much
> > > > additional
> > > > RPC message there will be. So, although the readlink reply
> > > > won't
> > > > use
> > > > much of the 4K allocation, replies for subsequent Ops. in the
> > > > compound
> > > > certainly could. (Is it more efficient to allocate 4K now and
> > > > use
> > > > part of it for subsequent message reply stuff or allocate
> > > > additional
> > > > mbuf clusters later for subsequent stuff, as required? On a
> > > > small
> > > > memory constrained machine, I suspect the latter is correct,
> > > > but
> > > > for
> > > > the kind of hardware that has TSO scatter/gather enabled
> > > > network
> > > > interfaces, I'm not so sure. At this point, I wouldn't even say
> > > > that using 4K clusters is going to be a win and my hunch is
> > > > that
> > > > any win wouldn't apply to small memory constrained machines.)
> > > 
> > > Though the code that was patched wasn't using any partial
> > > buffers,
> > > it was always allocating a new buffer...  If the code in
> > > _read/_readlinks starts using a previous mbuf chain, then
> > > obviously
> > > things are different and I'd agree, always allocating a 2k/4k
> > > cluster makes sense...
> > > 
> > Yes, but nd_mb and nd_bpos are set, which means subsequent replies
> > can
> > use the remainder of the cluster.
> 
> Couldn't we scan the list of replies, find out how much data we need,
> m_getm the space for it all (which will use 4k clusters as
> necessary)?
> 
The NFSv4 server parses the compound as it processes it. It must keep
things like current-filehandle and saved-filehandle between RPCs and
things like the attributes are a lot of work to parse, so I don't think
two passes through a request is warranted. Also, there is no way of knowing
how big a reply is until you execute the reply, although you can "guess"
at it.

I never intended to imply that the patch I emailed is ready for head.
It does 4K clusters for all RPCs, even ones known to be small (as in
client side Getattr/Lookup requests).

Since messgaes are sent quickly and then mbufs released, except for
the DRC in the server, I think avoiding large allocations for server
replies that may be cached is the case to try and avoid. Fortunately
the large replies will be for read and readdir and these don't need
to be cached by the DRC. As such, a patch that uses 4K clusters in
the server for read, readdir and 4K clusters for write requests in
the client, should be appropriate, I think?
(And, yes, I think you are correct that readlink is better off with
 a MCLBYTES cluster.)
The coding is straightforward, but the patch will be fairly large,
since readdir in the server uses NFSM_BUILD(), that in turn uses
NFSMCLGET(). These will need an extra "do a big cluster" argument.
For initial testing, it was just simpler to make them all big.

rick

> > Why does it always allocate a new cluster? Well, because the code
> > is
> > OLD. It was written for OpenBSD2.6 and, at that time, I tried to
> > make
> > it portable across the BSDen. I'm not so concerned w.r.t. its
> > portability
> > now, since no one else is porting it and I don't plan to, but I
> > still
> > think it would be nice if it were portable to other BSDen.
> > Back when I wrote it, I believe that MCLBYTES was 1K and an entire
> > cluster was needed. (To be honest, I found out that FreeBSD's
> > NCLBYTES
> > is 2K about 2 days ago, when I started looking at this stuff.)
> > 
> > Could it now look to see if enough bytes (a little over 1K) were
> > available
> > in the current cluster and use that. Yes, but it would reduce the
> > portability
> > of the code and I don't think it would make a measurable difference
> > performance
> > wise.
> 
> Are you sure it would reduce the portability?  I can't think of a way
> it would...  Some code will always need to be written for
> portability..
> 
Well, I had it ported to OpenBSD, FreeBSD6 and Mac OS X 10.3 by using
the NFSMCLGET() macro. If it uses things like m_getm2() and separate
uma zones, I don't know how much extra work would be needed for other
BSDen, since I have no idea which BSDen have these things?

> > > > My test server has 256Mbytes of ram and it certainly doesn't
> > > > show
> > > > any improvement (big surprise;-), but it also doesn't show any
> > > > degradation for the limited testing I've done.
> > > 
> > > I'm not too surprised, unless you're on a heavy server pushing
> > > >200MB/sec, the allocation cost is probably cheap enough that it
> > > doesn't show up...  going to 4k means immediately half as many
> > > mbufs
> > > are needed/allocated, and as they are page sized, don't have the
> > > problems of physical memory fragmentation, nor do they have to do
> > > an
> > > IPI/tlb shoot down in the case of multipage allocations...  (I'm
> > > dealing w/ this for geli.)
> > > 
> > Yes, Garrett Wollman proposed this and I suspect there might be a
> > performance gain for larger systems. He has a more involved patch.
> > To be honest, if Garrett is convinced that his patch is of benefit
> > performance wise, I will do a separate posting w.r.t. it and
> > whether
> > or not it is appropriate to be committed to head, etc.
> > 
> > > > Again, my main interest at this point is whether reducing the
> > > > number of mbufs in the chain fixes the TSO issues. I think
> > > > the question of whether or not 4K clusters are performance
> > > > improvement in general, is an interesting one that comes later.
> > > 
> > > Another thing I noticed is that we are getting an mbuf and then
> > > allocating a cluster... Is there a reason we aren't using
> > > something
> > > like m_getm or m_getcl?  We have a special uma zone that has
> > > mbuf and mbuf cluster already paired meaning we save some lock
> > > operations for each segment allocated...
> > > 
> > See above w.r.t. OLD portable code. There was a time when MGETCL()
> > wasn't guaranteed to succeed even when M_WAITOK is specified.
> > This is also why there is that weird loop in the NFSMCLGET() macro.
> 
> Correct, but as you wrapped them in NFS* macros, it doesn't mean you
> can't merge the MGETCL w/ NFSMCLGET into a new function that merges
> the two...  It's just another (not too difficult) wrapper that the
> porter has to write...
> 
> Though apparently portability has been given up since you use MCLGET
> directly in nfsserver/nfs_nfsdport.c instead of NFSMCLGET...
> 
> Sounds like nfsport.h needs some updating....
> 
The files with "port" in the names are re-written for each port. They
were generated by cribbing code from the extant client/server. (Without
looking, I'd guess you find MGET(), MCLGET() in the old FreeBSD server,
or maybe it was inherited from OpenBSD 2.6.)

Everything can be re-written, but why do so if the old code still works.
I'm one guy who does this as a spare time unpaid hobby and I'm working
on 4.1 server code these days.

> > (I think there was a time in FreeBSD's past when allocation was
> > never
> >  guaranteed and the rest of the code doesn't tolerate a NULL mbuf
> >  ptr.
> >  Something like M_TRYWAIT in old versions of FreeBSD?)
> 
> Correct, there was a time that M_WAITOK could still return, but it
> was
> many years ago and many releases ago...
> 
> > Btw, Garrett Wollman's patch uses m_getm2() to get the mbuf list.
> 
> Interestingly, m_getm2 will use 4k clusters as necessary, and in
> the _readlink case, do the correct thing...
> 
> Hmmm... m_getm2 isn't documented...  It was added by andre almost 7
> years ago...  It does appear to be a public interface as ofed, sctp
> iscsi and ng(_tty) all use it, though only sctp appears to use it any
> differently than m_getm.. The rest could simply use m_getm instead
> of m_getm2...   Considering it was committed the day before SCTP was
> committed, I'm not too surprised...
> 
> P.S. if someone wants to submit a patch to mbuf.9 to update the docs
> that would be helpful... I'll review and commit...  and m_append is
> also undocumented...
> --
>   John-Mark Gurney				Voice: +1 415 225 5579
> 
>      "All that I will do, has been done, All that I have, has not."
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscribe@freebsd.org"
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?372707859.17587309.1390923341323.JavaMail.root>