From owner-freebsd-stable@FreeBSD.ORG Thu Sep 11 07:12:09 2014 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id EBCC1972; Thu, 11 Sep 2014 07:12:09 +0000 (UTC) Received: from mail-wg0-x22d.google.com (mail-wg0-x22d.google.com [IPv6:2a00:1450:400c:c00::22d]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 5F6079D3; Thu, 11 Sep 2014 07:12:09 +0000 (UTC) Received: by mail-wg0-f45.google.com with SMTP id z12so5260529wgg.16 for ; Thu, 11 Sep 2014 00:12:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=an7+BsqrrPp3muilUAn7cTZIq3Ndqw82wMHVIBk3CjQ=; b=fEqoKuacfWkiQ0Voxqt0VR+geyQc2+j9pvmj3uV634lWMSIoBY1xVDsljfgbbl5JGC 1ya/ggunxGx+M94qPoQQNnt5NBQM6Efs+T1dCvikrGNLM2xQ7/r2LV2vY4CZpcY1QcF6 emNzCYs2TPW9WhLlpV2n8eNBiOSGJ5TPJICkQO05psi8C6DGsDMqM0iIhaYULclNAWju kICarCrA8MD/nEWeJU/+8UWa8Mh0Io8Z5l27OPDZf2Z3qGkjmt0mAb/8i2l5Hz8ZqUad 1vcbINzckERXOSYSBxQJi9sYx/B+K52qbsl94wIqbRjTNIlByKlUhjnVwwltcuhxUpHn FP2A== X-Received: by 10.194.209.205 with SMTP id mo13mr653607wjc.122.1410419527617; Thu, 11 Sep 2014 00:12:07 -0700 (PDT) Received: from [192.168.1.145] ([193.173.55.180]) by mx.google.com with ESMTPSA id bj7sm123531wjc.33.2014.09.11.00.12.06 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 11 Sep 2014 00:12:07 -0700 (PDT) Message-ID: <54114B46.3020407@gmail.com> Date: Thu, 11 Sep 2014 09:12:06 +0200 From: Johan Hendriks User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0 MIME-Version: 1.0 To: Stefan Esser , freebsd-stable@freebsd.org Subject: Re: getting to 4K disk blocks in ZFS References: <540FF3C4.6010305@ish.com.au> <54100258.2000505@freebsd.org> In-Reply-To: <54100258.2000505@freebsd.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Sep 2014 07:12:10 -0000 Op 10-09-14 om 09:48 schreef Stefan Esser: > Am 10.09.2014 um 08:46 schrieb Aristedes Maniatis: >> As we all know, it is important to ensure that modern disks are set >> up properly with the correct block size. Everything is good if all >> the disks and the pool are "ashift=9" (512 byte blocks). But as soon >> as one new drive requires 4k blocks, performance drops through the >> floor of the enture pool. >> >> >> In order to upgrade there appear to be two separate things that must >> be done for a ZFS pool. >> >> 1. Create partitions on 4K boundaries. This is simple with the >> "-a 4k" option in gpart, and it isn't hard to remove disks one at a >> time from a pool, reformat them on the right boundaries and put them >> back. Hopefully you've left a few spare bytes on the disk to ensure >> that your partition doesn't get smaller when you reinsert it to the >> pool. >> >> 2. Create a brand new pool which has ashift=12 and zfs send|receive >> all the data over. >> >> >> I guess I don't understand enough about zpool to know why the pool >> itself has a block size, since I understood ZFS to have variable >> stripe widths. > I'm not a ZFS internals expert, just a long time user, but I'll try to > answer your questions. > > ZFS is based on a copy-on-write paradigm, which ensures, that no data > is ever overwritten in place. All writes go to new blank blocks, and > only after the last reference to an "old" block is lost (when no TXG > or snapshot has references to it), is the old block freed and put back > on the free block map. > > ZFS uses variable block sizes by breaking down large blocks to smaller > fragments as suitable for the data to be stored. The largest block to > be used is configurable (128 KByte by default) and the smallest fragment > is the sector size (i.e. 512 or 4096 bytes), as configured by "ashift". > > The problem with 4K sector disks that report 512 byte sectors is, that > ZFS still assumes, that no data is overwritten in place, while the disk > drive does it behind the curtains. ZFS thinks it can atomically write > 512 bytes, but the drive reads 4K, places the 512 bytes of data within > that 4K physical sector in the drive's cache, and then writes back the > 4K of data in one go. > > The cost is not only the latency of this read-modify-write sequence, > but also that an elementary ZFS assumption is violated: Data that is > in other (logical) 512 byte sectors of the physical 4 KByte sector > can be lost, if that write operation fails, resulting in loss of data > in those files that happen to share the physical sector with the one > that received the write operation. > > This may never hit you, but ZFS is built on the assumption, that it > cannot happen at all, which is no longer true with 4KB drives that > are used with ashift=9. > >> The problem with step 2 is that you need to have enough hard disks >> spare to create a whole new pool and throw away the old disks. Plus >> a disk controller with lots of spare ports. Plus the ability to take >> the system offline for hours or days while the migration happens. >> >> One way to reduce this slightly is to create a new pool with reduced >> redundancy. For example, create a RAIDZ2 with two fake disks, then >> off-line those disks. > Both methods are dangerous! Studies have found, that the risk of > another disk failure during resilvering is substantial. That was > the reason for higher RAIDZ redundancy groups (raidz2, raidz3). > > With 1) you have to copy the data multiple times and the load > could lead to loss of one of the source drives (and since you > are in the process of overwriting the drive that provided > redundancy, you loose your pool that way). > > The copying to a degraded pool that you describe in 2) is a > possibility (and I've done it, once). You should make sure, that > all source data is still available until a successful resilvering > of the "new" pool with the fake disks replaced. You could do this > by moving the redundant disks from the old pool the new pool (i.e. > degrading the old pool, after all data has been copied, to use the > redundant drives to complete the new pool). But this assumes, that > the technologies of the drives match - I'll soon go from 4*2TB to > 3*4TB (raidz1 in both cases), since I had 2 of the 2TB drives fail > over the course of last year (replaced under warranty). > >> So, given how much this problem sucks (it is extremely easy to add >> a 4K disk by mistake as a replacement for a failed disk), and how >> painful the workaround is... will ZFS ever gain the ability to change >> block size for the pool? Or is this so deep in the internals of ZFS >> it is as likely as being able to dynamically add disks to an existing >> zvol in the "never going to happen" basket? > You can add a 4 KB physical drive that emulates 512 byte sectors > (nearly all drives do) to an ashift=9 ZFS pool, but performance > will suffer and you'll be violating a ZFS assumption as explained > above. > >> And secondly, is it also bad to have ashift 9 disks inside a ashift >> 12 pool? That is, do we need to replace all our disks in one go and >> forever keep big sticky labels on each disk so we never mix them? > The ashift parameter is per pool, not per disk. You can have a > drive with emulated 512 byte sectors in an ashift=9 pool, but > you cannot change the ashift value of a pool after creation. The ashift parameter is per vdev, not per pool. > Regards, STefan > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"