From owner-freebsd-stable@FreeBSD.ORG  Thu Sep 11 07:12:09 2014
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id EBCC1972;
 Thu, 11 Sep 2014 07:12:09 +0000 (UTC)
Received: from mail-wg0-x22d.google.com (mail-wg0-x22d.google.com
 [IPv6:2a00:1450:400c:c00::22d])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 5F6079D3;
 Thu, 11 Sep 2014 07:12:09 +0000 (UTC)
Received: by mail-wg0-f45.google.com with SMTP id z12so5260529wgg.16
 for <multiple recipients>; Thu, 11 Sep 2014 00:12:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=message-id:date:from:user-agent:mime-version:to:subject:references
 :in-reply-to:content-type:content-transfer-encoding;
 bh=an7+BsqrrPp3muilUAn7cTZIq3Ndqw82wMHVIBk3CjQ=;
 b=fEqoKuacfWkiQ0Voxqt0VR+geyQc2+j9pvmj3uV634lWMSIoBY1xVDsljfgbbl5JGC
 1ya/ggunxGx+M94qPoQQNnt5NBQM6Efs+T1dCvikrGNLM2xQ7/r2LV2vY4CZpcY1QcF6
 emNzCYs2TPW9WhLlpV2n8eNBiOSGJ5TPJICkQO05psi8C6DGsDMqM0iIhaYULclNAWju
 kICarCrA8MD/nEWeJU/+8UWa8Mh0Io8Z5l27OPDZf2Z3qGkjmt0mAb/8i2l5Hz8ZqUad
 1vcbINzckERXOSYSBxQJi9sYx/B+K52qbsl94wIqbRjTNIlByKlUhjnVwwltcuhxUpHn
 FP2A==
X-Received: by 10.194.209.205 with SMTP id mo13mr653607wjc.122.1410419527617; 
 Thu, 11 Sep 2014 00:12:07 -0700 (PDT)
Received: from [192.168.1.145] ([193.173.55.180])
 by mx.google.com with ESMTPSA id bj7sm123531wjc.33.2014.09.11.00.12.06
 for <multiple recipients>
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Thu, 11 Sep 2014 00:12:07 -0700 (PDT)
Message-ID: <54114B46.3020407@gmail.com>
Date: Thu, 11 Sep 2014 09:12:06 +0200
From: Johan Hendriks <joh.hendriks@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.0
MIME-Version: 1.0
To: Stefan Esser <se@freebsd.org>, freebsd-stable@freebsd.org
Subject: Re: getting to 4K disk blocks in ZFS
References: <540FF3C4.6010305@ish.com.au> <54100258.2000505@freebsd.org>
In-Reply-To: <54100258.2000505@freebsd.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Sep 2014 07:12:10 -0000


Op 10-09-14 om 09:48 schreef Stefan Esser:
> Am 10.09.2014 um 08:46 schrieb Aristedes Maniatis:
>> As we all know, it is important to ensure that modern disks are set
>> up properly with the correct block size. Everything is good if all
>> the disks and the pool are "ashift=9" (512 byte blocks). But as soon
>> as one new drive requires 4k blocks, performance drops through the
>> floor of the enture pool.
>>
>>
>> In order to upgrade there appear to be two separate things that must
>> be done for a ZFS pool.
>>
>> 1. Create partitions on 4K boundaries. This is simple with the
>> "-a 4k" option in gpart, and it isn't hard to remove disks one at a
>> time from a pool, reformat them on the right boundaries and put them
>> back. Hopefully you've left a few spare bytes on the disk to ensure
>> that your partition doesn't get smaller when you reinsert it to the
>> pool.
>>
>> 2. Create a brand new pool which has ashift=12 and zfs send|receive
>> all the data over.
>>
>>
>> I guess I don't understand enough about zpool to know why the pool
>> itself has a block size, since I understood ZFS to have variable
>> stripe widths.
> I'm not a ZFS internals expert, just a long time user, but I'll try to
> answer your questions.
>
> ZFS is based on a copy-on-write paradigm, which ensures, that no data
> is ever overwritten in place. All writes go to new blank blocks, and
> only after the last reference to an "old" block is lost (when no TXG
> or snapshot has references to it), is the old block freed and put back
> on the free block map.
>
> ZFS uses variable block sizes by breaking down large blocks to smaller
> fragments as suitable for the data to be stored. The largest block to
> be used is configurable (128 KByte by default) and the smallest fragment
> is the sector size (i.e. 512 or 4096 bytes), as configured by "ashift".
>
> The problem with 4K sector disks that report 512 byte sectors is, that
> ZFS still assumes, that no data is overwritten in place, while the disk
> drive does it behind the curtains. ZFS thinks it can atomically write
> 512 bytes, but the drive reads 4K, places the 512 bytes of data within
> that 4K physical sector in the drive's cache, and then writes back the
> 4K of data in one go.
>
> The cost is not only the latency of this read-modify-write sequence,
> but also that an elementary ZFS assumption is violated: Data that is
> in other (logical) 512 byte sectors of the physical 4 KByte sector
> can be lost, if that write operation fails, resulting in loss of data
> in those files that happen to share the physical sector with the one
> that received the write operation.
>
> This may never hit you, but ZFS is built on the assumption, that it
> cannot happen at all, which is no longer true with 4KB drives that
> are used with ashift=9.
>
>> The problem with step 2 is that you need to have enough hard disks
>> spare to create a whole new pool and throw away the old disks. Plus
>> a disk controller with lots of spare ports. Plus the ability to take
>> the system offline for hours or days while the migration happens.
>>
>> One way to reduce this slightly is to create a new pool with reduced
>> redundancy. For example, create a RAIDZ2 with two fake disks, then
>> off-line those disks.
> Both methods are dangerous! Studies have found, that the risk of
> another disk failure during resilvering is substantial. That was
> the reason for higher RAIDZ redundancy groups (raidz2, raidz3).
>
> With 1) you have to copy the data multiple times and the load
> could lead to loss of one of the source drives (and since you
> are in the process of overwriting the drive that provided
> redundancy, you loose your pool that way).
>
> The copying to a degraded pool that you describe in 2) is a
> possibility (and I've done it, once). You should make sure, that
> all source data is still available until a successful resilvering
> of the "new" pool with the fake disks replaced. You could do this
> by moving the redundant disks from the old pool the new pool (i.e.
> degrading the old pool, after all data has been copied, to use the
> redundant drives to complete the new pool). But this assumes, that
> the technologies of the drives match - I'll soon go from 4*2TB to
> 3*4TB (raidz1 in both cases), since I had 2 of the 2TB drives fail
> over the course of last year (replaced under warranty).
>
>> So, given how much this problem sucks (it is extremely easy to add
>> a 4K disk by mistake as a replacement for a failed disk), and how
>> painful the workaround is... will ZFS ever gain the ability to change
>> block size for the pool? Or is this so deep in the internals of ZFS
>> it is as likely as being able to dynamically add disks to an existing
>> zvol in the "never going to happen" basket?
> You can add a 4 KB physical drive that emulates 512 byte sectors
> (nearly all drives do) to an ashift=9 ZFS pool, but performance
> will suffer and you'll be violating a ZFS assumption as explained
> above.
>
>> And secondly, is it also bad to have ashift 9 disks inside a ashift
>> 12 pool? That is, do we need to replace all our disks in one go and
>> forever keep big sticky labels on each disk so we never mix them?
> The ashift parameter is per pool, not per disk. You can have a
> drive with emulated 512 byte sectors in an ashift=9 pool, but
> you cannot change the ashift value of a pool after creation.
The ashift parameter is per vdev, not per pool.


> Regards, STefan
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"