Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 14 Apr 2018 23:18:27 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        =?iso-8859-1?Q?Niels_Kobsch=E4tzki?= <niels@kobschaetzki.net>, "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Subject:   Re: High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release
Message-ID:  <YQBPR0101MB1042087832CE6FDCCA3B4216DDB20@YQBPR0101MB1042.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <f3cea179-75d7-916b-68d1-61fe75c0bb80@kobschaetzki.net>
References:  <ce3712c0-626e-c8f2-3bba-933cf359bcef@kobschaetzki.net> <YQBPR0101MB1042D2F0CE2575EB4F17588ADDB20@YQBPR0101MB1042.CANPRD01.PROD.OUTLOOK.COM>, <f3cea179-75d7-916b-68d1-61fe75c0bb80@kobschaetzki.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Niels Kobsch=E4tzki wrote:
>On 04/14/2018 03:49 AM, Rick Macklem wrote:
>> Niels Kobsch=E4tzki wrote:
>>> sorry for the cross-posting but so far I had no real luck on the forum
>>> or on question, thus I want to try my luck here as well.
>> I read email lists but don't do the other stuff, so I just saw this yest=
erday.
>> Short answer, I haven't a clue why cache hits rate would have changed.
>>
>> The code that decides if there is a hit/miss for the attribute cache is =
in
>> ncl_getattrcache() and the code hasn't changed between 10.3->11.1,
>> except the old code did a mtx_lock(&Giant), but I can't imagine how that
>> would affect the code.
>>
>> You might want to:
>> # sysctl -a | fgrep vfs.nfs
>> for both the 10.3 and 11.1 systems, to check if any defaults have someho=
w
>> been changed. (I don't recall any being changed, but??)
>
>I did that and there did nothing change.
>
>> If you go into ncl_getattrcache() {it's in sys/fs/nfsclient/nfs_clsubs.c=
}
>> and add a printf() for "time_second" and "np->n_mtime.tv_sec" near the
>> top, where it calculates "timeo" from it.
>> Running this hacked kernel might show you if either of these fields is b=
ogus.
>> (You could then printf() "timeo" and "np->n_attrtimeo" just before the "=
if"
>> clause that increments "attrcache_misses", which is where the cache miss=
es
>> happen to see why it is missing the cache.)
>> If you could do this for the 10.3 kernel as well, this might indicate wh=
y the
>> miss rate has increased?
>
>I will do this next week. On monday we switch for other reasons to other
>nfs-servers and when we see that they run stable, I will do this next.
With a miss rate of 2.7%, I doubt printing the above will help. I thought
you were seeing a high miss rate.

>Btw. I calculated now the percentages. The old servers had a attr miss
>rate of something like 0.004%, while the upgraded one has more like
>2.7%. This is till low from what I've read (I remember that you should
>start adjusting acreg* when you hit more than 40% misses) but far higher
>than before.
You could try increasing acregmin, acregmax and see if the misses are reduc=
ed.
(The only risk with increasing the cache timeout is that, if another client=
 changes
 the attributes, then the client will use stale ones for longer. Usually, t=
his doesn't
 cause serious problems.)
To be honest, a Getattr RPC is pretty low overhead, so I doubt the increase
to 2.7% will affect your application's performance, but it is interesting t=
hat
it increased.

You might also try increasing acdirmin, acdirmax in case it is the director=
y
attributes that are having cache misses.

Oh, and check that your time of day clocks are in sync with the server,
since the caches are time based, since there is no cache coherency protocol
in NFS.
[good stuff snipped]
rick



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQBPR0101MB1042087832CE6FDCCA3B4216DDB20>