Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 29 Oct 2012 03:59:05 -0700
From:      Jeremy Chadwick <>
Subject:   Re: 9.1 and gmirror with GPT?
Message-ID:  <20121029105905.GA358@icarus.home.lan>

Next in thread | Raw E-Mail | Index | Archive | Help
(I won't be responding to any public or private mails relating to this
topic after this point, just as an FYI)

Just a reminder for readers:

If you're truly using 4096-byte sectors disks -- specifically MECHANICAL
hard disks (MHDDs) -- use of 4KByte alignment is fine.

But if you ever plan on using an SSD the future, you need to align
things to 1MBytes or 2MBytes.

I have read on the mailing lists where some users "don't know why / what
the justification is" behind this, so I'll explain it:

The reason is that FTLs within SSDs do not issue erases (resetting bits
to zero) on a per-flash-page basis (a flash page is commonly 4KBytes),
but on a "block" basis (a group of pages).  This is usually referred to
as the "NAND erase block size".

Let me make this clear: this is not the same thing as filesystem block
size or similar "block size" you might see mentioned throughout the
zillions of layers of I/O abstraction in a *IX system and its kernel.
Do not mix up the terms (yes I know it's confusing).  Anyway...

Most SSD vendors do not disclose what the NAND erase block size is in
their products, and that's disappointing.

However poking and prodding (usually performance testing) has shown that
most vendors use either 1MByte or 2MByte NAND erase block sizes (as of
this writing).  I haven't seen larger in the field yet, for consumer
products anyway (i.e. don't ask me about FusionIO).

This is why Windows Vista and Windows 7 aligns its partitions to 1MByte

...and quite honestly FreeBSD should too.  I am aware 9.1-RELEASE
supposedly addresses this -- however I have not determined if the
alignment size chosen by the committer was 4096 or 1MB/2MB.  I have a
gut feeling it's the former, and that's bad.

With 1MByte or 2MByte alignment, performance on 512-byte MHDDs would be
fine, performance on 4096-byte MHDDs would be fine, and performance on
SSDs would be fine.

If folks want to be on the "extra super duper safe side", align to 2MB.
Otherwise align to 1MB and don't worry about it.

Lack of proper alignment to NAND erase block size can result in excess
wear/tear on the NAND flash, which means diminishing the effectiveness
of wear levelling and the performance of your drive.  Do not ask me for
numbers; I do not have them.  Read Wikipedia's article on wear levelling
for details.

Next: in case it's not made clear to readers from Warren's statements:
the magical "8" divisor he's using comes from 4096/512 ("how many 512
bytes are there in a 4096-byte sector").  Thus, for 1MByte alignment the
value would be 1048576/512 or 2048.  For 2MByte alignment the value
would be 2097152/512 or 4096.

The general rule-of-thumb I tend to use is to use GPT and start my
FreeBSD partitions at LBA 4096, and make sure all the partition sizes
are divisible by 2MBytes.  If there is a GPT+GEOM conflict, I tend to
recommend to people, with the introduction of graid(8), that they make
use of BIOS-level RAID and then use GPT.

There is one known caveat to this (as of this writing) where a ZFS root
filesystem used on top of graid(8) results in a problem, but mav@ is
looking into that.  And don't ask me why you'd want to do that anyway --
some people apparently like complicating their lives and shunning KISS
principle entirely.

P.S. -- Linux md solved their equivalent of the "GEOM vs. GPT" issue
with the introduction of md superblock version 1.2 (superblock=metadata
in this context).  They stuck the superblock 4096 bytes after the start
of the device.  This does limit the number of GPT partitions supported
(from 128 down to 8), but I question the reasoning/sanity of anyone
who's got more than 8 GPT partitions on a single disk anyway (use a
volume manager already).

| Jeremy Chadwick                          |
| UNIX Systems Administrator       |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |

Want to link to this message? Use this URL: <>