Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 22 Feb 2017 22:50:01 +0100
From:      Wiktor Niesiobedzki <bsd@vink.pl>
To:        "Eric A. Borisch" <eborisch@gmail.com>
Cc:        "Eugene M. Zheganin" <emz@norma.perm.ru>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: zfs raidz overhead
Message-ID:  <CAH17caWPRtJVpTQNrqaabtYt7xR%2Boc-eL87tvea=pXjG12oEJg@mail.gmail.com>
In-Reply-To: <CAASnNnpB7NFWUbBLxKidXzsDMAwzcJzRc_f4R-9JG_=BZ9fA%2BA@mail.gmail.com>
References:  <1b54a2fe35407a95edca1f992fa08a71@norman-vivat.ru> <CAASnNnpB7NFWUbBLxKidXzsDMAwzcJzRc_f4R-9JG_=BZ9fA%2BA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
I can add to this, that this is not only seen on raidz, but also on
mirror pools, such as this:
# zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0 in 3h22m with 0 errors on Thu Feb  9 06:47:07 2017
config:

        NAME               STATE     READ WRITE CKSUM
        tank               ONLINE       0     0     0
          mirror-0         ONLINE       0     0     0
            gpt/tank1.eli  ONLINE       0     0     0
            gpt/tank2.eli  ONLINE       0     0     0

errors: No known data errors


When I createted test zvols:
# zfs create -V10gb -o volblocksize=3D8k tank/tst-8k
# zfs create -V10gb -o volblocksize=3D16k tank/tst-16k
# zfs create -V10gb -o volblocksize=3D32k tank/tst-32k
# zfs create -V10gb -o volblocksize=3D64k tank/tst-64k
# zfs create -V10gb -o volblocksize=3D128k tank/tst-128k

# zfs get used tank/tst-8k
NAME         PROPERTY  VALUE  SOURCE
tank/tst-8k  used      10.3G  -
root@kadlubek:~ # zfs get used tank/tst-16k
NAME          PROPERTY  VALUE  SOURCE
tank/tst-16k  used      10.2G  -
root@kadlubek:~ # zfs get used tank/tst-32k
NAME          PROPERTY  VALUE  SOURCE
tank/tst-32k  used      10.1G  -
root@kadlubek:~ # zfs get used tank/tst-64k
NAME          PROPERTY  VALUE  SOURCE
tank/tst-64k  used      10.0G  -
root@kadlubek:~ # zfs get used tank/tst-128k
NAME           PROPERTY  VALUE  SOURCE
tank/tst-128k  used      10.0G  -
root@kadlubek:~ #

So it might be related not only to raidz pools.

I also noted, that snapshots impact used stats far much, than
usedbysnapshot value:
zfs get volsize,used,referenced,compressratio,volblocksize,usedbysnapshots,=
usedbydataset,usedbychildren
tank/dkr-thinpool
NAME               PROPERTY         VALUE      SOURCE
tank/dkr-thinpool  volsize          10G        local
tank/dkr-thinpool  used             12.0G      -
tank/dkr-thinpool  referenced       1.87G      -
tank/dkr-thinpool  compressratio    1.91x      -
tank/dkr-thinpool  volblocksize     64K        -
tank/dkr-thinpool  usedbysnapshots  90.4M      -
tank/dkr-thinpool  usedbydataset    1.87G      -
tank/dkr-thinpool  usedbychildren   0          -


On a 10G volume, filled with 2G of data, and 90M used by snapshosts,
used is 2G. When I destroy the snapshots, used will drop to 10.0G.

Cheers,

Wiktor

2017-02-22 0:31 GMT+01:00 Eric A. Borisch <eborisch@gmail.com>:
> On Tue, Feb 21, 2017 at 2:45 AM, Eugene M. Zheganin <emz@norma.perm.ru>
> wrote:
>
>
>
> Hi.
>
> There's an interesting case described here:
> http://serverfault.com/questions/512018/strange-zfs-disk-
> space-usage-report-for-a-zvol
> [1]
>
> It's a user story who encountered that under some situations zfs on
> raidz could use up to 200% of the space for a zvol.
>
> I have also seen this. For instance:
>
> [root@san1:~]# zfs get volsize gamestop/reference1
>  NAME PROPERTY VALUE SOURCE
>  gamestop/reference1 volsize 2,50T local
>  [root@san1:~]# zfs get all gamestop/reference1
>  NAME PROPERTY VALUE SOURCE
>  gamestop/reference1 type volume -
>  gamestop/reference1 creation =D1=87=D1=82 =D0=BD=D0=BE=D1=8F=D0=B1. 24 9=
:09 2016 -
>  gamestop/reference1 used 4,38T -
>  gamestop/reference1 available 1,33T -
>  gamestop/reference1 referenced 4,01T -
>  gamestop/reference1 compressratio 1.00x -
>  gamestop/reference1 reservation none default
>  gamestop/reference1 volsize 2,50T local
>  gamestop/reference1 volblocksize 8K -
>  gamestop/reference1 checksum on default
>  gamestop/reference1 compression off default
>  gamestop/reference1 readonly off default
>  gamestop/reference1 copies 1 default
>  gamestop/reference1 refreservation none received
>  gamestop/reference1 primarycache all default
>  gamestop/reference1 secondarycache all default
>  gamestop/reference1 usedbysnapshots 378G -
>  gamestop/reference1 usedbydataset 4,01T -
>  gamestop/reference1 usedbychildren 0 -
>  gamestop/reference1 usedbyrefreservation 0 -
>  gamestop/reference1 logbias latency default
>  gamestop/reference1 dedup off default
>  gamestop/reference1 mlslabel -
>  gamestop/reference1 sync standard default
>  gamestop/reference1 refcompressratio 1.00x -
>  gamestop/reference1 written 4,89G -
>  gamestop/reference1 logicalused 2,72T -
>  gamestop/reference1 logicalreferenced 2,49T -
>  gamestop/reference1 volmode default default
>  gamestop/reference1 snapshot_limit none default
>  gamestop/reference1 snapshot_count none default
>  gamestop/reference1 redundant_metadata all default
>
> [root@san1:~]# zpool status gamestop
>  pool: gamestop
>  state: ONLINE
>  scan: none requested
>  config:
>
>  NAME STATE READ WRITE CKSUM
>  gamestop ONLINE 0 0 0
>  raidz1-0 ONLINE 0 0 0
>  da6 ONLINE 0 0 0
>  da7 ONLINE 0 0 0
>  da8 ONLINE 0 0 0
>  da9 ONLINE 0 0 0
>  da11 ONLINE 0 0 0
>
>  errors: No known data errors
>
> or, another server (overhead in this case isn't that big, but still
> considerable):
>
> [root@san01:~]# zfs get all data/reference1
>  NAME PROPERTY VALUE SOURCE
>  data/reference1 type volume -
>  data/reference1 creation Fri Jan 6 11:23 2017 -
>  data/reference1 used 3.82T -
>  data/reference1 available 13.0T -
>  data/reference1 referenced 3.22T -
>  data/reference1 compressratio 1.00x -
>  data/reference1 reservation none default
>  data/reference1 volsize 2T local
>  data/reference1 volblocksize 8K -
>  data/reference1 checksum on default
>  data/reference1 compression off default
>  data/reference1 readonly off default
>  data/reference1 copies 1 default
>  data/reference1 refreservation none received
>  data/reference1 primarycache all default
>  data/reference1 secondarycache all default
>  data/reference1 usedbysnapshots 612G -
>  data/reference1 usedbydataset 3.22T -
>  data/reference1 usedbychildren 0 -
>  data/reference1 usedbyrefreservation 0 -
>  data/reference1 logbias latency default
>  data/reference1 dedup off default
>  data/reference1 mlslabel -
>  data/reference1 sync standard default
>  data/reference1 refcompressratio 1.00x -
>  data/reference1 written 498K -
>  data/reference1 logicalused 2.37T -
>  data/reference1 logicalreferenced 2.00T -
>  data/reference1 volmode default default
>  data/reference1 snapshot_limit none default
>  data/reference1 snapshot_count none default
>  data/reference1 redundant_metadata all default
>  [root@san01:~]# zpool status gamestop
>  pool: data
>  state: ONLINE
>  scan: none requested
>  config:
>
>  NAME STATE READ WRITE CKSUM
>  data ONLINE 0 0 0
>  raidz1-0 ONLINE 0 0 0
>  da3 ONLINE 0 0 0
>  da4 ONLINE 0 0 0
>  da5 ONLINE 0 0 0
>  da6 ONLINE 0 0 0
>  da7 ONLINE 0 0 0
>  raidz1-1 ONLINE 0 0 0
>  da8 ONLINE 0 0 0
>  da9 ONLINE 0 0 0
>  da10 ONLINE 0 0 0
>  da11 ONLINE 0 0 0
>  da12 ONLINE 0 0 0
>  raidz1-2 ONLINE 0 0 0
>  da13 ONLINE 0 0 0
>  da14 ONLINE 0 0 0
>  da15 ONLINE 0 0 0
>  da16 ONLINE 0 0 0
>  da17 ONLINE 0 0 0
>
>  errors: No known data errors
>
> So my question is - how to avoid it ? Right now I'm experimenting with
> the volblocksize, making it around 64k. I'm also suspecting that such
> overhead may be the subsequence of the various resizing operations, like
> extening the volsize of the volume or adding new disks into the pool,
> because I have a couple of servers with raidz where the initial
> disk/volsize configuration didn't change, and the referenced/volsize
> numbers are pretty much close to each other.
>
> Eugene.
>
> Links:
> ------
> [1]
> http://serverfault.com/questions/512018/strange-zfs-disk-
> space-usage-report-for-a-zvol
>
>
> It comes down to the zpool's sector size (2^ashift) and the volblocksize =
--
> I'm guessing your old servers are at ashift=3D9 (512), and the new one is=
 at
> 12 (4096), likely with 4k drives. This is the smallest/atomic size of rea=
ds
> & writes to a drive from ZFS.
>
> As described in [1]:
>  * Allocations need to be a multiple of (p+1) sectors, where p is your
> parity level; for raidz1, p=3D=3D1, and allocations need to be in multipl=
es of
> (1+1)=3D2 sectors, or 8k (for ashift=3D12; this is the physical size /
> alignment on drive).
>  * It also needs to have enough parity for failures, so it also depends [=
2]
> on number of drives in pool at larger block/record sizes.
>
> So considering those requirements, and your zvol with volblocksize=3D8k a=
nd
> compression=3Doff, allocations for one logical 8k block are always compos=
ed
> physically of two (4k) data sectors, one (p=3D1) parity sector (4k), and =
one
> padding sector (4k) to satisfy being a multiple of (p+1=3D) 2, or 16k
> (allocated on disk space), hence your observed 2x data size being actuall=
y
> allocated. Each of these blocks will be on a different drive. This is
> different from the sector-level parity in RAID5
>
> As Matthew Ahrens [1] points out: "Note that setting a small recordsize
> with 4KB sector devices results in universally poor space efficiency --
> RAIDZ-p is no better than p-way mirrors for recordsize=3D4K or 8K."
>
> Things you can do:
>
>  * Use ashift=3D9 (and perhaps 512-byte sector drives). The same layout r=
ules
> still apply, but now your 'atomic' size is 512b. You will want to test
> performance.
>  * Use a larger volblocksize, especially if the filesystem on the zvol us=
es
> a larger block size. If you aren't performance sensitive, use a larger
> volblocksize even if the hosted filesystem doesn't. (But test this out to
> see how performance sensitive you really are! ;) You'll need to use
> something like dd to move data between different block size zvols.
>  * Enable compression if the contents are compressible (some likely will
> be.)
>  * Use a pool created from mirrors instead of raidz if you need
> high-performance small blocks while retaining redundancy.
>
> You don't get efficient (better than mirrors) redundancy, performant smal=
l
> (as in small multiple of zpool's sector size) block sizes, and zfs's
> flexibility all at once.
>
>  - Eric
>
> [1] https://www.delphix.com/blog/delphix-engineering/zfs-rai
> dz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz
> [2] My spin on Ahren's spreadsheet: https://docs.google.com/spread
> sheets/d/13sJPc6ZW6_441vWAUiSvKMReJW4z34Ix5JSs44YXRyM/edit?usp=3Dsharing
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAH17caWPRtJVpTQNrqaabtYt7xR%2Boc-eL87tvea=pXjG12oEJg>