Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 29 Mar 2020 21:16:00 +0200
From:      Peter Eriksson <pen@lysator.liu.se>
To:        FreeBSD Filesystems <freebsd-fs@freebsd.org>
Cc:        "PK1048.COM" <info@pk1048.com>
Subject:   Re: ZFS/NFS hickups and some tools to monitor stuff...
Message-ID:  <7790DD37-4F95-409E-9E33-1A330B1B49C8@lysator.liu.se>
In-Reply-To: <CDB51790-ED6B-4670-B256-43CDF98BD26D@pk1048.com>
References:  <CFD0E4E5-EF2B-4789-BF14-F46AC569A191@lysator.liu.se> <66AB88C0-12E8-48A0-9CD7-75B30C15123A@pk1048.com> <E6171E44-F677-4926-9F55-775F538900E4@lysator.liu.se> <FE244C11-44CA-4DCC-8CD9-A8C7A7C5F059@pk1048.com> <982F9A21-FF1C-4DAB-98B3-610D70714ED3@lysator.liu.se> <CDB51790-ED6B-4670-B256-43CDF98BD26D@pk1048.com>

next in thread | previous in thread | raw e-mail | index | archive | help
> I thought that snapshot deletion was single threaded within a zpool, =
since TXGs are zpool-wide, not per dataset. So you may not be able to =
destroy snapshot in parallel.

Yeah, I thought so too but decided to try it anyway. It sometimes goes =
faster and since I decoupled the "read snapshots to delete" and the =
=E2=80=9Cdo the deletion=E2=80=9D into separate threads now it doesn=E2=80=
=99t have to read all snapshots to delete first and then delete them =
all, but can interleave the jobs.

Basically what my code now does is:

	for all datasets (recursively)
	   collect_snapshots_to_delete
	   if more than a (configurable limit) is queued
		start a deletion worker (configurable limit)

So it can continue gathering snapshots to delete while deleting a batch. =
And it doesn=E2=80=99t have to wait for the reading all snapshots in all =
dataset before starting to delete stuff. So if it (for some reason) is =
slow then atleast it will have deleted _some_ snapshots until we =
terminate the =E2=80=9Cclean=E2=80=9D command

I did some tests on the speed with different number of =E2=80=9Cworker=E2=80=
=9D threads and I actually did see some speed improvements (cut the time =
in half in some cases). But it varies a lot I guess - if all metadata is =
in the ARC then it normally is pretty quick anyway.

I=E2=80=99ve been thinking of also adding separate read workers so if =
one dataset takes a long time to read it=E2=80=99s snapshots then others =
could continue but it=E2=80=99s a bit harder to code in a good way :-)

What we do now is (simplified):

	# Create hourly snapshots that expire in 2 days:
	zfs snap -r -E =E2=80=9Cse.liu.it:expires=E2=80=9D -e 2d =
"DATA/staff@${DATETIME}"

	# Clean expired snapshots (10 workers, atleast 500 snapshots per =
delete)
	zfs clean -r -E =E2=80=9Cse.liu.it:expires=E2=80=9D -P10 -L500 =
-e DATA/staff

I have my patch available at GitHub ( =
https://github.com/ptrrkssn/freebsd-stuff =
<https://github.com/ptrrkssn/freebsd-stuff>; ) if it would be of =
interest.=20

(At first I modified the =E2=80=9Czfs destroy=E2=80=9D command but since =
I always feel nervous about using that one since a slip of the finger =
could have
catastrophic consequences so I decided to create a separate one that =
only works on snapshots and nothing else).


> I expect zpool/zfs commands to be very slow when large zfs operations =
are in flight. The fact that you are not seeing the issue locally means =
the issue is not directly with the zpool/dataset but somehow with the =
interaction between NFS Client <-> NFS Server <-> ZFS dataset =E2=80=A6 =
NFS does not have to be sync, but can you force the NFS client to always =
use sync writes? That might better leverage the SLOG. Since our use case =
is VMs and VirtualBox does sync writes, we get the benefit of the SLOG.
>=20
>>> If this is during typical activity, you are already using 13% of =
your capacity. I also don=E2=80=99t like the 80ms per operation times.
>>=20
>> The spinning rust drives are HGST He10 (10TB SAS 7200rpm) drives on =
Dell HBA330 controllers (LSI SAS3008). We also use HP server with their =
own Smart H241 HBAs and are seeing similar latencies there.
>=20
> That should be Ok, but I have heard some reports of issues with the HP =
Smart 4xx series controllers with FreeBSD. Why are you seeing higher =
disk latency with SAS than we are with SATA? I assume you checked logs =
for device communication errors and retries?

Yeah, no errors. The HP H241 HBAs are not as well supported as the =
SAS3008 ones, but they work OK. At least if you force them into =
=E2=80=9CHBA=E2=80=9D mode (changeable from BIOS. Until we did that they =
had their problems yes=E2=80=A6 also there were some firmware issues on =
certain releases)

Anyway, I we are going to expand the RAM in the servers from 256GB to =
512GB (or 768GB). A test I did on our test server seems to indicate that =
the metadata set fits much better with more RAM so everything is much =
faster.

(Now I=E2=80=99d also like to see persistent L2ARC support (it would be =
great to have the metadata cached on faster SSDs and have it survive a =
reboot) - but that won=E2=80=99t happen until the switch to OpenZFS =
(FreeBSD 13 hopefully) so=E2=80=A6

- Peter




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7790DD37-4F95-409E-9E33-1A330B1B49C8>