Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Feb 2015 18:29:45 -0500
From:      Alfred Perlstein <alfred@freebsd.org>
To:        Garrett Wollman <wollman@csail.mit.edu>, freebsd-fs@freebsd.org
Cc:        freebsd-net@freebsd.org
Subject:   Re: Implementing backpressure in the NFS server
Message-ID:  <54EE5AE9.1000908@freebsd.org>
In-Reply-To: <21742.18390.976511.707403@khavrinen.csail.mit.edu>
References:  <21742.18390.976511.707403@khavrinen.csail.mit.edu>

next in thread | previous in thread | raw e-mail | index | archive | help

On 2/25/15 5:08 PM, Garrett Wollman wrote:
> Here's the scenario:
>
> 1) A small number of (Linux) clients run a large number of processes
> (compute jobs) that read large files sequentially out of an NFS
> filesystem.  Each process is reading from a different file.
>
> 2) The clients are behind a network bottleneck.
>
> 3) The Linux NFS client will issue NFS3PROC_READ RPCs (potentially
> including read-ahead) independently for each process.
>
> 4) The network bottleneck does not serve to limit the rate at which
> read RPCs can be issued, because the requests are small (it's only the
> responses that are large).
>
> 5) Even if the responses are delayed, causing one process to block,
> there are sufficient other processes that are still runnable to allow
> more reads to be issued.
>
> 6) On the server side, because these are requests for different file
> handles, they will get steered to different NFS service threads by the
> generic RPC queueing code.
>
> 7) Each service thread will process the read to completion, and then
> block when the reply is transmitted because the socket buffer is full.
>
> 8) As more reads continue to be issued by the clients, more and more
> service threads are stuck waiting for the socket buffer until all of
> the nfsd threads are blocked.
>
> 9) The server is now almost completely idle.  Incoming requests can
> only be serviced when one of the nfsd threads finally manages to put
> its pending reply on the socket send queue, at which point it can
> return to the RPC code and pick up one request -- which, because the
> incoming queues are full of pending reads from the problem clients, is
> likely to get stuck in the same place.  Lather, rinse, repeat.
>
> What should happen here?  As an administrator, I can certainly
> increase the number of NFS service threads until there are sufficient
> threads available to handle all of the offered load -- but the load
> varies widely over time, and it's likely that I would run into other
> resource constraints if I did this without limit.  (Is 1000 threads
> practical? What happens when a different mix of RPCs comes in -- will
> it livelock the server?)
>
> I'm of the opinion that we need at least one of the following things
> to mitigate this issue, but I don't have a good knowledge of the RPC
> code to have an idea how feasible this is:
>
> a) Admission control.  RPCs should not be removed from the receive
> queue if the transmit queue is over some high-water mark.  This will
> ensure that a problem client behind a network bottleneck like this one
> will eventually feel backpressure via TCP window contraction if
> nothing else.  This will also make it more likely that other clients
> will still get their RPCs processed even if most service threads are
> taken up by the problem clients.
>
> b) Fairness scheduling.  There should be some parameter, configurable
> by the administrator, that restricts the number of nfsd threads any
> one client can occupy, independent of how many requests it has
> pending.  A really advanced scheduler would allow bursting over the
> limit for some small number of requests.
>
> Does anyone else have thoughts, or even implementation ideas, on this?
The default number of threads is insanely low, the only reason I didn't 
bump them to FreeNAS levels (or higher) was because of the inevitable 
bikeshed/cryfest about Alfred touching defaults so I didn't bother.  I 
kept them really small, because y'know people whine, and they are capped 
at ncpu * 8, it really should be higher imo.

Just increase the nfs servers to something higher, I think we were at 
256 threads in FreeNAS and it did us just fine.  Higher seemed ok, 
except we lost a bit of performance.

The only problem you might see is on SMALL machines where people will 
complain.  So probably want an arch specific override or perhaps a 
memory based sliding scale.

If that could become a FreeBSD default (with overrides for small memory 
machines and arches) that would be even better.

I think your other suggestions are fine, however the problem is that:
1) they seem complex for an edge case
2) turning them on may tank performance for no good reason if the 
heuristic is met but we're not in the bad situation

That said if you want to pursue those options, by all means please do.

-Alfred



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?54EE5AE9.1000908>