Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 12 Oct 2011 21:38:02 +0300
From:      Daniel Kalchev <daniel@digsys.bg>
To:        Jeremy Chadwick <freebsd@jdc.parodius.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: AF (4096 byte sector) drives: Can you mix/match in a ZFS pool?
Message-ID:  <C9D5BB73-37C6-42FA-88BA-78DA2A4780B9@digsys.bg>
In-Reply-To: <20111012172912.GA27013@icarus.home.lan>
References:  <4E95AE08.7030105@lerctr.org> <20111012155938.GA24649@icarus.home.lan> <4E95C546.70904@digsys.bg> <20111012172912.GA27013@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help

On Oct 12, 2011, at 20:29 , Jeremy Chadwick wrote:

>> The gnop trick is used not because you will ask a 512-byte sector
>> drive to write 8 sectors with one I/O, but because you may ask an
>> 4096-byte sector drive to write only 512 bytes -- which for the
>> drive means it has to read 4096 bytes, modify 512 of these bytes and
>> write back 4096 bytes.
>=20
> If I'm reading this correctly, you're effectively stating ashift
> actually just defines (or helps in calculating) an LBA offset for the
> start of the pool-related data on that device?  "ashift" seems like a
> badly-named term/variable for what this does, but oh well.

ashift defines the minimum block size of the vdev. The choice is fine, I =
believe as it describes how one get's a power of 2 size (by shifting 1 =
that number of times) :-)

>> The proper way to handle this is to create your zpool with 4096-byte
>> alignment, that is, for the time being by using the above gnop
>> 'hack'.
>=20
> ...which brings into question why this is needed at all, meaning, why
> the ZFS code cannot be changed to default to an ashift value that's
> calculated as 12 (or equivalent) regardless of 512-byte or 4096-byte
> sector drives.

Currently the ZFS block size is 512 bytes to 128 kilobytes. That is with =
ashift of 9. If you have shift of 12, that effectively means minimum =
block size of 4k and maximum block size of 128k.


> How was this addressed on Solaris/OpenSolaris?
>=20

I don't think they do.

>> There should be no implications to having one vdev with 512 byte
>> alignment and another with 4096 byte alignment. ZFS is smart enough
>> to issue minimum of 512 byte writes to the former and 4096 bytes to
>> the latter thus not creating any bottleneck.
>=20
> How does ZFS determine this?  I was under the impression that this
> behaviour was determined by (or "assisted by") shift.

ZFS has a piece of data, say 20 kbyte block to write. If you have say 4 =
vdevs, one with shift=3D9 (512 bytes), another with ashift=3D12 (4096 =
bytes). All other issues ignored (equal size vdev's, full at the same =
capacity etc.) it has to write minimum of 9kb (512+512+4096+4096) -- =
apparently ZFS wants to fill all vdevs equally, so it will likely issue =
one 4k to vdev1, one 4k to vdev2, two 512b to vdev3 and two 512b to =
vdev4.=20

If for example, it had 16k to write, it would write one 4k I/O to the 4k =
vdev's and 4 x 512b I/O (or a single write of 4k, depending on layering =
abstraction) to the 512b vdevs.

So yes, it is assisted by shift.

But, for the time being you need to assist ZFS how to create the vdev's =
with the proper shift value. This is because today's 4k drives lie that =
their geometry is 512b. As mentioned, there are patches for FreeBSD to =
'discover' this behavior. Another approach is via gnop. Only at vdev =
creation time. Haven't seen anything like this for Solaris.

Daniel=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C9D5BB73-37C6-42FA-88BA-78DA2A4780B9>