Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 15 Apr 2018 13:10:24 +0200
From:      Niels Kobschaetzki <niels@kobschaetzki.net>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Subject:   Re: High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release
Message-ID:  <36907CE0-EAD3-4E11-8023-5BCEA1239813@kobschaetzki.net>
In-Reply-To: <YQBPR0101MB1042087832CE6FDCCA3B4216DDB20@YQBPR0101MB1042.CANPRD01.PROD.OUTLOOK.COM>
References:  <ce3712c0-626e-c8f2-3bba-933cf359bcef@kobschaetzki.net> <YQBPR0101MB1042D2F0CE2575EB4F17588ADDB20@YQBPR0101MB1042.CANPRD01.PROD.OUTLOOK.COM> <f3cea179-75d7-916b-68d1-61fe75c0bb80@kobschaetzki.net> <YQBPR0101MB1042087832CE6FDCCA3B4216DDB20@YQBPR0101MB1042.CANPRD01.PROD.OUTLOOK.COM>

next in thread | previous in thread | raw e-mail | index | archive | help

> On 15. Apr 2018, at 01:18, Rick Macklem <rmacklem@uoguelph.ca> wrote:
>=20
> Niels Kobsch=C3=A4tzki wrote:
>>> On 04/14/2018 03:49 AM, Rick Macklem wrote:
>>> Niels Kobsch=C3=A4tzki wrote:
>>>> sorry for the cross-posting but so far I had no real luck on the forum
>>>> or on question, thus I want to try my luck here as well.
>>> I read email lists but don't do the other stuff, so I just saw this yest=
erday.
>>> Short answer, I haven't a clue why cache hits rate would have changed.
>>>=20
>>> The code that decides if there is a hit/miss for the attribute cache is i=
n
>>> ncl_getattrcache() and the code hasn't changed between 10.3->11.1,
>>> except the old code did a mtx_lock(&Giant), but I can't imagine how that=

>>> would affect the code.
>>>=20
>>> You might want to:
>>> # sysctl -a | fgrep vfs.nfs
>>> for both the 10.3 and 11.1 systems, to check if any defaults have someho=
w
>>> been changed. (I don't recall any being changed, but??)
>>=20
>> I did that and there did nothing change.
>>=20
>>> If you go into ncl_getattrcache() {it's in sys/fs/nfsclient/nfs_clsubs.c=
}
>>> and add a printf() for "time_second" and "np->n_mtime.tv_sec" near the
>>> top, where it calculates "timeo" from it.
>>> Running this hacked kernel might show you if either of these fields is b=
ogus.
>>> (You could then printf() "timeo" and "np->n_attrtimeo" just before the "=
if"
>>> clause that increments "attrcache_misses", which is where the cache miss=
es
>>> happen to see why it is missing the cache.)
>>> If you could do this for the 10.3 kernel as well, this might indicate wh=
y the
>>> miss rate has increased?
>>=20
>> I will do this next week. On monday we switch for other reasons to other
>> nfs-servers and when we see that they run stable, I will do this next.
> With a miss rate of 2.7%, I doubt printing the above will help. I thought
> you were seeing a high miss rate.

It is low but increased by nearly a factor of 1000 to before. I hope the pri=
nt will help. Just a lot of grepping through wherever I can get this data.=20=


>> Btw. I calculated now the percentages. The old servers had a attr miss
>> rate of something like 0.004%, while the upgraded one has more like
>> 2.7%. This is till low from what I've read (I remember that you should
>> start adjusting acreg* when you hit more than 40% misses) but far higher
>> than before.
> You could try increasing acregmin, acregmax and see if the misses are redu=
ced.
> (The only risk with increasing the cache timeout is that, if another clien=
t changes
> the attributes, then the client will use stale ones for longer. Usually, t=
his doesn't
> cause serious problems.)

I tried that and it had exactly no effect

> To be honest, a Getattr RPC is pretty low overhead, so I doubt the increas=
e
> to 2.7% will affect your application's performance, but it is interesting t=
hat
> it increased.

It is a website with quite some traffic handles by three webservers behind a=
 pair of loadbalancers.=20
We see a loss of 20% in speed(TTFB reduced by 100ms; sounds not a lot but Go=
ogle et al doesn=E2=80=99t like it at all) after upgrading to 11.1 with a co=
mbined upgrade to php7.1. On another server without NFS that upgrade improve=
d performance considerably (I was told ca 30% by the front end-dev)

> You might also try increasing acdirmin, acdirmax in case it is the directo=
ry
> attributes that are having cache misses.

I did that, too

> Oh, and check that your time of day clocks are in sync with the server,
> since the caches are time based, since there is no cache coherency protoco=
l
> in NFS.

I checked that. All three frontends are using the same server for ntp

Thanks so far,

Niels=




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?36907CE0-EAD3-4E11-8023-5BCEA1239813>