From owner-freebsd-fs@freebsd.org Sat Mar 14 15:37:04 2020 Return-Path: Delivered-To: freebsd-fs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 7784525C728 for ; Sat, 14 Mar 2020 15:37:04 +0000 (UTC) (envelope-from pen@lysator.liu.se) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 48fmrs0WJcz4RX6 for ; Sat, 14 Mar 2020 15:37:00 +0000 (UTC) (envelope-from pen@lysator.liu.se) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id E5A6440005 for ; Sat, 14 Mar 2020 16:36:50 +0100 (CET) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id D03B040003; Sat, 14 Mar 2020 16:36:50 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on bernadotte.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED, AWL, HTML_MESSAGE autolearn=disabled version=3.4.2 X-Spam-Score: -1.0 Received: from [IPv6:2001:9b1:28ff:d901:ece3:3af1:128f:2a35] (unknown [IPv6:2001:9b1:28ff:d901:ece3:3af1:128f:2a35]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id D87EF40003 for ; Sat, 14 Mar 2020 16:36:49 +0100 (CET) From: Peter Eriksson Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3608.60.0.2.5\)) Subject: ZFS/NFS hickups and some tools to monitor stuff... Message-Id: Date: Sat, 14 Mar 2020 16:36:49 +0100 To: FreeBSD Filesystems X-Mailer: Apple Mail (2.3608.60.0.2.5) X-Virus-Scanned: ClamAV using ClamSMTP X-Rspamd-Queue-Id: 48fmrs0WJcz4RX6 X-Spamd-Bar: ----- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=none) header.from=liu.se; spf=pass (mx1.freebsd.org: domain of pen@lysator.liu.se designates 130.236.254.3 as permitted sender) smtp.mailfrom=pen@lysator.liu.se X-Spamd-Result: default: False [-5.63 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-0.998,0]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(-0.20)[+a:mail.lysator.liu.se]; TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_THREE(0.00)[4]; RCVD_TLS_LAST(0.00)[]; TO_DN_ALL(0.00)[]; RCVD_IN_DNSWL_MED(-0.20)[3.254.236.130.list.dnswl.org : 127.0.11.2]; DMARC_POLICY_ALLOW(-0.50)[liu.se,none]; MV_CASE(0.50)[]; IP_SCORE(-3.13)[ip: (-8.11), ipnet: 130.236.0.0/16(-4.17), asn: 2843(-3.34), country: SE(-0.03)]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:2843, ipnet:130.236.0.0/16, country:SE]; MID_RHS_MATCH_FROM(0.00)[] Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 14 Mar 2020 15:37:04 -0000 The last couple of weeks I=E2=80=99ve been fighting with a severe case = of NFS users complaining about slow response times from our (5) FreeBSD = 11.3-RELEASE-p6 file servers. Now even though our SMB (Windows) users = (thankfully since they are like 500 per server vs 50 NFS users) didn=E2=80= =99t see the same slowdown (or atleast didn=E2=80=99t complain about it) = the root cause is probably ZFS-related. We=E2=80=99ve identified a number of cases where some ZFS operation can = cause severe slowdown of NFS operations, and I=E2=80=99ve been trying to = figure our what is the cause and ways to mitigate the problem=E2=80=A6 Some operations that have caused issues: 1. Resilver (basically made NFS service useless during the week it = took=E2=80=A6) with response time for NFS operations regularity up to 10 = seconds or more (vs the normal 1-10ms) 2. Snapshot recursive deferred destruction (=E2=80=9Czfs destroy -dr = DATA@snapnam=E2=80=9D). Especially bad together with filesystems at or = near quota. 3. Rsync cloning of data into the servers. Response times up to 15 = minutes was seen=E2=80=A6 Yes, 15 minutes to do a mkdir(=E2=80=9Ctest-dir=E2= =80=9D). Possibly in conjunction with #1 above=E2=80=A6. Previously #1 and #2 hasn=E2=80=99t caused that much problems, and #3 = definitely. Something has changed the last half year or so but so far I = haven=E2=80=99t been able to figure it out. In order to test/figure things out I=E2=80=99ve written a couple of = tools that other might be finding useful: A couple of Dtrace scripts to watch and monitor timing of kernel calls = related to NFS, GSS & ZFS operations (the =E2=80=9Cnfsuser=E2=80=9D = script is a simple =E2=80=9Cnfs user top=E2=80=9D tool). Also a patch = that adds a sysctl than can be set and have the NFS kernel code be = verbose about slow NFS operations: - https://github.com/ptrrkssn/freebsd-stuff = With the kernel patch (nfsd-verbose-timing-11.3.patch or = nfsd-verbose-timing-12.1.patch) then it=E2=80=99ll look like this: (Altert if op takes more than 1000us): # sysctl vfs.nfsd.verbose_timing=3D1000 nfsrvd_dorpc(vers=3D4.1, uid=3D65534, procnum=3D9, repstat=3D0) took = 2853 =C2=B5s nfsrvd_dorpc(vers=3D4.0, uid=3D1003258, procnum=3D17, repstat=3D10009) = took 5433 =C2=B5s nfsrvd_dorpc(vers=3D4.1, uid=3D65534, procnum=3D9, repstat=3D0) took = 2026 =C2=B5s (The Dtrace scripts can do similar stuff, but just in case I wanted a = real in-kernel way to see if delays are caused by something on the = machine or something else (network delays etc). A very simple C program that basically in a loop tests some simple = filesystem operations and measures the time it takes: - https://github.com/ptrrkssn/pfst Mount a number of fileservers on /mnt/filur01, /mnt/filur02, = /mnt/filur03 etc, then run this to have It complain if an operation = takes more than 100ms): (With =E2=80=9C-v=E2=80=9D will print time time each operation takes) $ ./pfst -t100ms /mnt/filur0* 2020-03-14 10:56:36 [ 109 ms]: /mnt/filur03: = mkdir("t-omnibus-615-53202") [Time limit exceeded] 2020-03-14 11:14:22 [ 536 s ]: /mnt/filur03: = mkdir("t-omnibus-615-53729") [Time limit exceeded] 2020-03-14 11:31:47 [ 13 m ]: /mnt/filur03: = mkdir("t-omnibus-615-53965") [Time limit exceeded] 2020-03-14 11:35:44 [ 182 ms]: /mnt/filur03: = mkdir("t-omnibus-615-54201") [Time limit exceeded] 2020-03-14 12:20:03 [ 15 m ]: /mnt/filur03: = mkdir("t-omnibus-615-55908") [Time limit exceeded] 2020-03-14 12:39:09 [ 15 m ]: /mnt/filur03: = mkdir("t-omnibus-615-56103") [Time limit exceeded] 2020-03-14 12:50:58 [ 466 s ]: /mnt/filur03: = mkdir("t-omnibus-615-56344") [Time limit exceeded] With =E2=80=9C-v=E2=80=9D: 2020-03-14 16:27:48 [1349 =C2=B5s]: /mnt/filur01: = mkdir("t-omnibus-637-2") 2020-03-14 16:27:48 [ 327 =C2=B5s]: /mnt/filur01: = rmdir("t-omnibus-637-2") It would be interresting to see if others too are seeing ZFS and/or NFS = slowdowns during heavy writing operations (resilver, snapshot-destroy, = rsync)=E2=80=A6 Our DATA pools are basically 2xRAIDZ2(4+2) of 10TB 7200rpm disks + 400GB = SSD:s for ZIL + 400GB SSDs for L2ARC. 256GB RAM, configured with ARC-MAX = set to 64GB (used to be 128GB but we ran into out-of-memory with the = 500+ Samba smbd daemons that would compete for the RAM=E2=80=A6) We=E2=80=99ve tried it with and without L2ARC, and replaced the SSD:s. = Disabled TRIM. Not much difference. Tried trimming various sysctls but = no difference seen so far. Annoying problem this=E2=80=A6 - Peter