From owner-freebsd-fs@freebsd.org  Sat Mar 14 15:37:04 2020
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 7784525C728
 for <freebsd-fs@mailman.nyi.freebsd.org>; Sat, 14 Mar 2020 15:37:04 +0000 (UTC)
 (envelope-from pen@lysator.liu.se)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 48fmrs0WJcz4RX6
 for <freebsd-fs@freebsd.org>; Sat, 14 Mar 2020 15:37:00 +0000 (UTC)
 (envelope-from pen@lysator.liu.se)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id E5A6440005
 for <freebsd-fs@freebsd.org>; Sat, 14 Mar 2020 16:36:50 +0100 (CET)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id D03B040003; Sat, 14 Mar 2020 16:36:50 +0100 (CET)
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 bernadotte.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED, AWL, HTML_MESSAGE
 autolearn=disabled version=3.4.2
X-Spam-Score: -1.0
Received: from [IPv6:2001:9b1:28ff:d901:ece3:3af1:128f:2a35] (unknown
 [IPv6:2001:9b1:28ff:d901:ece3:3af1:128f:2a35])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id D87EF40003
 for <freebsd-fs@freebsd.org>; Sat, 14 Mar 2020 16:36:49 +0100 (CET)
From: Peter Eriksson <pen@lysator.liu.se>
Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3608.60.0.2.5\))
Subject: ZFS/NFS hickups and some tools to monitor stuff...
Message-Id: <CFD0E4E5-EF2B-4789-BF14-F46AC569A191@lysator.liu.se>
Date: Sat, 14 Mar 2020 16:36:49 +0100
To: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-Mailer: Apple Mail (2.3608.60.0.2.5)
X-Virus-Scanned: ClamAV using ClamSMTP
X-Rspamd-Queue-Id: 48fmrs0WJcz4RX6
X-Spamd-Bar: -----
Authentication-Results: mx1.freebsd.org; dkim=none;
 dmarc=pass (policy=none) header.from=liu.se;
 spf=pass (mx1.freebsd.org: domain of pen@lysator.liu.se designates
 130.236.254.3 as permitted sender) smtp.mailfrom=pen@lysator.liu.se
X-Spamd-Result: default: False [-5.63 / 15.00]; ARC_NA(0.00)[];
 RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-0.998,0];
 FROM_HAS_DN(0.00)[];
 R_SPF_ALLOW(-0.20)[+a:mail.lysator.liu.se];
 TO_MATCH_ENVRCPT_ALL(0.00)[];
 MIME_GOOD(-0.10)[multipart/alternative,text/plain];
 PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org];
 NEURAL_HAM_LONG(-1.00)[-1.000,0]; RCPT_COUNT_ONE(0.00)[1];
 RCVD_COUNT_THREE(0.00)[4]; RCVD_TLS_LAST(0.00)[];
 TO_DN_ALL(0.00)[];
 RCVD_IN_DNSWL_MED(-0.20)[3.254.236.130.list.dnswl.org : 127.0.11.2];
 DMARC_POLICY_ALLOW(-0.50)[liu.se,none]; MV_CASE(0.50)[];
 IP_SCORE(-3.13)[ip: (-8.11), ipnet: 130.236.0.0/16(-4.17), asn: 2843(-3.34),
 country: SE(-0.03)]; FROM_EQ_ENVFROM(0.00)[];
 R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~];
 ASN(0.00)[asn:2843, ipnet:130.236.0.0/16, country:SE];
 MID_RHS_MATCH_FROM(0.00)[]
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 14 Mar 2020 15:37:04 -0000

The last couple of weeks I=E2=80=99ve been fighting with a severe case =
of NFS users complaining about slow response times from our (5) FreeBSD =
11.3-RELEASE-p6 file servers. Now even though our SMB (Windows) users =
(thankfully since they are like 500 per server vs 50 NFS users) didn=E2=80=
=99t see the same slowdown (or atleast didn=E2=80=99t complain about it) =
the root cause is probably ZFS-related.

We=E2=80=99ve identified a number of cases where some ZFS operation can =
cause severe slowdown of NFS operations, and I=E2=80=99ve been trying to =
figure our what is the cause and ways to mitigate the problem=E2=80=A6

Some operations that have caused issues:

1. Resilver (basically made NFS service useless during the week it =
took=E2=80=A6) with response time for NFS operations regularity up to 10 =
seconds or more (vs the normal 1-10ms)

2. Snapshot recursive deferred destruction (=E2=80=9Czfs destroy -dr =
DATA@snapnam=E2=80=9D). Especially bad together with filesystems at or =
near quota.

3. Rsync cloning of data into the servers. Response times up to 15 =
minutes was seen=E2=80=A6 Yes, 15 minutes to do a mkdir(=E2=80=9Ctest-dir=E2=
=80=9D). Possibly in conjunction with #1 above=E2=80=A6.

Previously #1 and #2 hasn=E2=80=99t caused that much problems, and #3 =
definitely. Something has changed the last half year or so but so far I =
haven=E2=80=99t been able to figure it out.

In order to test/figure things out I=E2=80=99ve written a couple of =
tools that other might be finding useful:


A couple of Dtrace scripts to watch and monitor timing of kernel calls =
related to NFS, GSS & ZFS operations (the =E2=80=9Cnfsuser=E2=80=9D =
script is a simple =E2=80=9Cnfs user top=E2=80=9D tool). Also a patch =
that adds a sysctl than can be set and have the NFS kernel code be =
verbose about slow NFS operations:

  - https://github.com/ptrrkssn/freebsd-stuff =
<https://github.com/ptrrkssn/freebsd-stuff>

With the kernel patch (nfsd-verbose-timing-11.3.patch or =
nfsd-verbose-timing-12.1.patch) then it=E2=80=99ll look like this:

(Altert if op takes more than 1000us):
# sysctl vfs.nfsd.verbose_timing=3D1000

nfsrvd_dorpc(vers=3D4.1, uid=3D65534, procnum=3D9, repstat=3D0) took =
2853 =C2=B5s
nfsrvd_dorpc(vers=3D4.0, uid=3D1003258, procnum=3D17, repstat=3D10009) =
took 5433 =C2=B5s
nfsrvd_dorpc(vers=3D4.1, uid=3D65534, procnum=3D9, repstat=3D0) took =
2026 =C2=B5s

(The Dtrace scripts can do similar stuff, but just in case I wanted a =
real in-kernel way to see if delays are caused by something on the =
machine or something else (network delays etc).


A very simple C program that basically in a loop tests some simple =
filesystem operations and measures the time it takes:

  - https://github.com/ptrrkssn/pfst <https://github.com/ptrrkssn/pfst>


Mount a number of fileservers on /mnt/filur01, /mnt/filur02, =
/mnt/filur03 etc, then run this to have It complain if an operation =
takes more than 100ms):
(With =E2=80=9C-v=E2=80=9D will print time time each operation takes)

$ ./pfst -t100ms /mnt/filur0*
2020-03-14 10:56:36 [ 109 ms]: /mnt/filur03: =
mkdir("t-omnibus-615-53202") [Time limit exceeded]
2020-03-14 11:14:22 [ 536 s ]: /mnt/filur03: =
mkdir("t-omnibus-615-53729") [Time limit exceeded]
2020-03-14 11:31:47 [  13 m ]: /mnt/filur03: =
mkdir("t-omnibus-615-53965") [Time limit exceeded]
2020-03-14 11:35:44 [ 182 ms]: /mnt/filur03: =
mkdir("t-omnibus-615-54201") [Time limit exceeded]
2020-03-14 12:20:03 [  15 m ]: /mnt/filur03: =
mkdir("t-omnibus-615-55908") [Time limit exceeded]
2020-03-14 12:39:09 [  15 m ]: /mnt/filur03: =
mkdir("t-omnibus-615-56103") [Time limit exceeded]
2020-03-14 12:50:58 [ 466 s ]: /mnt/filur03: =
mkdir("t-omnibus-615-56344") [Time limit exceeded]

With =E2=80=9C-v=E2=80=9D:
2020-03-14 16:27:48 [1349 =C2=B5s]: /mnt/filur01: =
mkdir("t-omnibus-637-2")
2020-03-14 16:27:48 [ 327 =C2=B5s]: /mnt/filur01: =
rmdir("t-omnibus-637-2")


It would be interresting to see if others too are seeing ZFS and/or NFS =
slowdowns during heavy writing operations (resilver, snapshot-destroy, =
rsync)=E2=80=A6


Our DATA pools are basically 2xRAIDZ2(4+2) of 10TB 7200rpm disks + 400GB =
SSD:s for ZIL + 400GB SSDs for L2ARC. 256GB RAM, configured with ARC-MAX =
set to 64GB (used to be 128GB but we ran into out-of-memory with the =
500+ Samba smbd daemons that would compete for the RAM=E2=80=A6)


We=E2=80=99ve tried it with and without L2ARC, and replaced the SSD:s. =
Disabled TRIM. Not much difference. Tried trimming various sysctls but =
no difference seen so far. Annoying problem this=E2=80=A6

- Peter