Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 15 Feb 2016 10:05:45 -0500
From:      Paul Kraus <paul@kraus-haus.org>
To:        Andrew Reilly <areilly@bigpond.net.au>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: Hours of tiny transfers at the end of a ZFS resilver?
Message-ID:  <44B57B63-C9C5-4166-8737-D4866E6A9D08@kraus-haus.org>
In-Reply-To: <120226C8-3003-4334-9F5F-882CCB0D28C5@bigpond.net.au>
References:  <120226C8-3003-4334-9F5F-882CCB0D28C5@bigpond.net.au>

next in thread | previous in thread | raw e-mail | index | archive | help
On Feb 15, 2016, at 5:18, Andrew Reilly <areilly@bigpond.net.au> wrote:

> Hi Filesystem experts,
>=20
> I have a question about the nature of ZFS and the resilvering
> that occurs after a driver replacement from a raidz array.

How many snapshots do you have ? I have seen this behavior on pools with =
many snapshots and ongoing creation of snapshots during the resilver. =
The resilver gets to somewhere above 95% (usually 99.xxx % for me) and =
then slows to a crawl, often for days.

Most of the ZFS pools I manage have automated jobs to create hourly =
snapshots, so I am always creating snapshots.

More below...

>=20
> I have a fairly simple home file server that (by way of

<snip>

> have had the system off-line for many hours (I guess).
>=20
> Now, one thing that I didn't realise at the start of this
> process was that the zpool has the original 512B sector size
> baked in at a fairly low level, so it is using some sort of
> work-around for the fact that the new drives actually have 4096B
> sectors (although they lie about that in smartctl -i queries):

Running 4K native drives in a 512B pool will cause a performance hit. =
When I ran into this I rebuilt the pool from scratch as a 4K native =
pool. If there is at least one 4K native drive in a given vdev the vdev =
will be created native 4K (at least under FBSD 10.x). My home server has =
a pool of mixed 512B and 4K drives. I made sure each vdev was built 4K.

The code in the drive that emulates 512B behavior has not been very fast =
and that is the crux of the performance issues. I just had to rebuild a =
pool because 2TB WD Red Pro are 4K while 2TB WD RE are 512B.=20

<snip>

> While clearly sub-optimal, I expect that the performance will
> still be good enough for my purposes: I can build a new,
> properly aligned file system when I do the next re-build.
>=20
> The odd thing is that after charging through the resilver using
> large blocks (around 64k according to systat), when they get to
> the end, as this one is now, the process drags on for hours with
> millions of tiny, sub-2K transfers:

Yup.

The resilver process walks through the transaction groups (TXG) =
replaying them onto the new (replacement) drive. This is different from =
other traditional resync methods. It also means that the early TXG will =
be large (as you loaded data) and then he size of the TXG will vary with =
the size of the data written.

<snip>

> So there's a problem wth the zpool status output: it's
> predicting half an hour to go based on the averaged 67M/s over
> the whole drive, not the <2MB/s that it's actually doing, and
> will probably continue to do so for several hours, if tonight
> goes the same way as last night.  Last night zpool status said
> "0h05m to go" for more than three hours, before I gave up
> waiting to start the next drive.

Yup, the code that estimates time to go is based on the overall average =
transfer not the current. In my experience the transfer rate peaks =
somewhere in the middle of the resilver.

> Is this expected behaviour, or something bad and peculiar about
> my system?

Expected ? I=92m not sure if the designers of ZFS expected this behavior =
:-)

But it is the typical behavior and is correct.

> I'm confused about how ZFS really works, given this state.  I
> had thought that the zpool layer did parity calculation in big
> 256k-ish stripes across the drives, and the zfs filesystem layer
> coped with that large block size because it had lots of caching
> and wrote everything in log-structure.  Clearly that mental
> model must be incorrect, because then it would only ever be
> doing large transfers.  Anywhere I could go to find a nice
> write-up of how ZFS is working?

You really can=92t think about ZFS the same way as older systems, with a =
volume manager and a filesystem, they are fully integrated. For example, =
stripe size (across all the top level vdevs) is dynamic, changing with =
each write operation. I believe that it tries to include every top level =
vdev in each write operation. In your case that does not apply as you =
only have one top level vdev, but note that performance really scales =
with the number of top level vdevs more than the number of drives per =
vdev.

Also note that striping within a RAIDz<n> vdev is separate from the top =
level vdev striping.

Take a look here: =
http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for a good =
discussion of ZFS striping for RAIDz<n> vdevs. And don=92t forget to =
follow the links at the bottom of the page for more details.

P.S. For performance it is generally recommended to use mirrors while =
for capacity use RAIDz<n>, all tempered by the mean time to data loss =
(MTTDL) you need. Hint, a 3-way mirror has about the same MTTDL as a =
RAIDz2.

--
Paul Kraus
paul@kraus-haus.org




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?44B57B63-C9C5-4166-8737-D4866E6A9D08>