Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 26 Feb 2015 17:46:51 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        freebsd-fs@freebsd.org, freebsd-net@freebsd.org, Garrett Wollman <wollman@hergotha.csail.mit.edu>
Subject:   Re: NFS: kernel modules (loading/unloading) and scheduling
Message-ID:  <422345651.1296741.1424990811199.JavaMail.root@uoguelph.ca>
In-Reply-To: <20150226092147.GC2379@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
Kostik wrote:
> On Wed, Feb 25, 2015 at 10:55:35PM -0500, Rick Macklem wrote:
> > Garrett Wollman wrote:
> > > In article
> > > <388835013.10159778.1424820357923.JavaMail.root@uoguelph.ca>,
> > > rmacklem@uoguelph.ca writes:
> > > 
> > > >I tend to think that a bias towards doing Getattr/Lookup over
> > > >Read/Write
> > > >may help performance (the old "shortest job first" principal),
> > > >I'm
> > > >not
> > > >sure you'll have a big enough queue of outstanding RPCs under
> > > >normal
> > > >load
> > > >for this to make a real difference.
> > > 
> > > I don't think this is a particularly relevant condition here.
> > >  There
> > > are lots of ways RPCs can pile up where you really need to do
> > > better
> > > work-sharing than the current implementation does.  One example
> > > is a
> > > client that issues lots of concurrent reads (e.g., a compute node
> > > running dozens of parallel jobs).  Two such systems on gigabit
> > > NICs
> > > can easily issue large reads fast enough to cause 64 nfsd service
> > > threads to blocked while waiting for the socket send buffer to
> > > drain.
> > > Meanwhile, the file server is completely idle, but unable to
> > > respond
> > > to incoming requests, and the other users get angry.  Rather than
> > > assigning new threads to requests from the slow clients, it would
> > > be
> > > better to let the requests sit until the send buffer drains, and
> > > process other clients' requests instead of letting the resources
> > > get
> > > monopolized by a single user.
> > > 
> > > Lest you think this is purely hypothetical: we actually
> > > experienced
> > > this problem today, and I verified with "procstat -kk" that all
> > > of
> > > the
> > > nfsd threads were in fact blocked waiting for send buffer space
> > > to
> > > open up.  I was able to restore service immediately by increasing
> > > the
> > > number of nfsd threads, but I'm unsure to what extent I can do
> > > this
> > > without breaking other things or hitting other bottlenecks.[1]
> > >  So I
> > > have a user asking me why I haven't enable fair-share scheduling
> > > for
> > > NFS, and I'm going to have to tell him the answer is "no such
> > > thing".
> > > 
> > > -GAWollman
> > > 
> > > [1] What would the right number actually be?  We could
> > > potentially
> > > have many thousands of threads in a compute cluster all operating
> > > simultaneously on the same filesystem, well within the I/O
> > > capacity
> > > of
> > > the server, and we'd really like to degrade gracefully rather
> > > than
> > > falling over when a single slow client soaks up all of the nfsd
> > > worker
> > > threads.
> > Well, each of these threads have two structures allocated to it.
> > 1 - The kthread info (sched_sizeof_thread() <-- struct thread + the
> > scheduler info one)
> > 2 - A structure used by the krpc for each thread.
> > Since allocating two moderate sized structures isn't a lot of
> > kernel
> > memory, I would think a server like yours would be fine with
> > several
> > thousand nfsd threads.
> The biggest memory consumer for any thread, kernel or not, is the
> kernel thread stack.  It consumes both physical memory and KVA, the
> later is not too sparce for amd64.
> 
Yes, thanks, I should have thought of that.
- For amd64, it appears to be a 16K stack (plus a KVA page to catch
  stack overflows?)
  --> Figure around 16Mbytes for 1000 kernel threads. I don't think
      this would be an issue for a 64bit arch with quite a few Gbytes of RAM?
- For i386, it appears to be a 8K stack (plus a KVA page to catch
  stack overflows?)
  --> Figure 8Mbytes for 1000 kernel threads. Still sounds ok to me,
      although I think the KVA limit is about 430Mbytes by default.
I'd guess that KVA exhaustion due to mbuf and other malloc() allocations
will happen before having some extra nfsd threads, will occur.
I have succeeded in exhausting KVA with the server running on a 256Mbyte
i386, but it took several days of heavy load to get it to happen and
I never found a reliable way to do it.

> > 
> > What would be interesting would be the receive queue lengths for
> > the
> > sockets for NFS client TCP connections when the server is running
> > normally. (This would be an indication of how many outstanding RPC
> > requests any scheduling effort would select between.)
> > I'll admit (given basic queuing theory) I would have expected these
> > receive queues to be small unless the server is overloaded.
> > 
> > Oh, and I now realize my response related to your first idea
> > "Admission" was way off and didn't make much sense. Somehow, I
> > thought receive queue when you were talking about send queue.
> > (Basically, just ignore my response.)
> > However, given the different sizes of RPC replies, it might
> > be hard to come up with a reasonable high water mark for the
> > send queue. Also, the networking code would have to do some
> > sort of upcall to the krpc when the send queue shrinks.
> > (So, still not trivial to implement, I think?)
> > 
> > I do agree with Alfred, in that I think you are experiencing
> > nfsd thread starvation and increasing the number of nfsd threads
> > a lot is the simple way to resolve this.
> 
> This also increases indirect scheduler costs.  Direct management of
> the runqueues costs proportionally to the number of runnable threads,
> but some rare operations have to account all threads.
> 
Since the extra nfsd threads won't be runable, I'd guess that the rare
operations will be the only ones affected.

Thanks for pointing this out, rick
ps: Does increasing the MAXNFSDCNT to something like 2048 sound reasonable
    for FreeBSD11? (I'm not talking default, but the maximum a server can
    be set to.)



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?422345651.1296741.1424990811199.JavaMail.root>