From owner-freebsd-fs@FreeBSD.ORG Wed Nov 14 05:21:13 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9FE02B1D; Wed, 14 Nov 2012 05:21:13 +0000 (UTC) (envelope-from ryao@gentoo.org) Received: from smtp.gentoo.org (smtp.gentoo.org [140.211.166.183]) by mx1.freebsd.org (Postfix) with ESMTP id 0D5838FC08; Wed, 14 Nov 2012 05:21:13 +0000 (UTC) Received: from [192.168.1.2] (pool-173-77-245-118.nycmny.fios.verizon.net [173.77.245.118]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: ryao) by smtp.gentoo.org (Postfix) with ESMTPSA id EC3D033DA81; Wed, 14 Nov 2012 05:21:11 +0000 (UTC) Message-ID: <50A329A4.9090304@gentoo.org> Date: Wed, 14 Nov 2012 00:18:28 -0500 From: Richard Yao User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.10) Gecko/20121107 Thunderbird/10.0.10 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Port of ZFSOnLinux solution for illumos-gate issue #2663 to FreeBSD X-Enigmail-Version: 1.3.5 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig11BA0E7B797DBD67F4D1FE6F" Cc: Eitan Adler X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Nov 2012 05:21:13 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig11BA0E7B797DBD67F4D1FE6F Content-Type: multipart/mixed; boundary="------------000801070506070506070302" This is a multi-part message in MIME format. --------------000801070506070506070302 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Dear Everyone, I am the Gentoo Linux ZFS maintainer as well as part of the Gentoo BSD team. I have attached a patch that ports the ZFSOnLinux solution of illumos-gate issue #2663 to FreeBSD-HEAD. It should also apply against stable, with fuzz. This permits users to avoid fiddling with gnop when making pools on drives that lie about their sector sizes. There are a few things to note about this patch: 1. This does not apply to `zpool add`, `zpool attach` and `zpool replace`. A separate patch for that is being reviewed in ZFSOnLinux at this time. I will port it separately after it is committed. 2. This has not been sent to Illumos upstream. As a Gentoo BSD team developer, I am in a much better position to send code to FreeBSD than to send code to Illumos. I expect that Martin Matuska will port this change to Illumos after it is accepted into FreeBSD, so I assume that this is okay. 3. ZFSOnLinux enforces the CDDL's attribution requirement by relying on commit messages and metadata. FreeBSD and Illumos satisfy it by adding copyright notices to files. I have tried to translate the ZFS attribution policy by adding appropriate copyright notices for non-trivial changes. I would expect this to pass review by the Gentoo Foundation members that review licensing for Gentoo, so I assume that this is okay. I have discussed committing this patch to FreeBSD with Eitan Adler. He requires one of the FreeBSD Filesystem developers to acknowledge it as being appropriate for the tree before he will commit it. Yours truly, Richard Yao --------------000801070506070506070302 Content-Type: text/x-patch; name="0001-Add-ashift-property-to-zpool-create.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0001-Add-ashift-property-to-zpool-create.patch" =46rom 6a07263af0d51eb5031ba43e299e5d6cdc7ea1f0 Mon Sep 17 00:00:00 2001 From: =3D?UTF-8?q?Christian=3D20Kohlsch=3DC3=3DBCtter?=3D Date: Sat, 10 Nov 2012 12:09:35 -0500 Subject: [PATCH] Add "ashift" property to zpool create Some disks with internal sectors larger than 512 bytes (e.g., 4k) can suffer from bad write performance when ashift is not configured correctly. This is caused by the disk not reporting its actual sector size, but a sector size of 512 bytes. The drive may behave this way for compatibility reasons. For example, the WDC WD20EARS disks are known to exhibit this behavior. When creating a zpool, ZFS takes that wrong sector size and sets the "ashift" property accordingly (to 9: 1<<9=3D512), whereas it should be set to 12 for 4k sectors (1<<12=3D4096). This patch allows an adminstrator to manual specify the known correct ashift size at 'zpool create' time. This can significantly improve performance in certain cases. However, it will have an impact on your total pool capacity. See the updated ashift property description in the zpool.8 man page for additional details. Valid values for the ashift property range from 9 to 13 (512B-8KB). Additionally, you may set the ashift to 0 if you wish to auto-detect the sector size based on what the disk reports, this is the default behavior. The most common ashift values are 9 and 12. Example: zpool create -o ashift=3D12 tank raidz2 sda sdb sdc sdd Closed zfsonlinux/zfs#280 This patch was modified during its port by by Richard Yao to include a la= ter change he wrote to reduce the effective maximum ashift value to 13 due to= pool corruption issues discovered with higher values. Original-patch-by: Richard Laager Signed-off-by: Brian Behlendorf Ported-by: Richard Yao Man-page-ported-by: Eitan Adler --- cddl/contrib/opensolaris/cmd/zpool/zpool.8 | 20 ++++++++++++++- cddl/contrib/opensolaris/cmd/zpool/zpool_main.c | 6 ++--- cddl/contrib/opensolaris/cmd/zpool/zpool_util.h | 4 +-- cddl/contrib/opensolaris/cmd/zpool/zpool_vdev.c | 29 ++++++++++++++++= ------ .../opensolaris/lib/libzfs/common/libzfs_pool.c | 21 ++++++++++++++++= .../contrib/opensolaris/common/zfs/zpool_prop.c | 4 +++ .../contrib/opensolaris/uts/common/sys/fs/zfs.h | 1 + 7 files changed, 72 insertions(+), 13 deletions(-) diff --git a/cddl/contrib/opensolaris/cmd/zpool/zpool.8 b/cddl/contrib/op= ensolaris/cmd/zpool/zpool.8 index 88fc79b..8ed120a 100644 --- a/cddl/contrib/opensolaris/cmd/zpool/zpool.8 +++ b/cddl/contrib/opensolaris/cmd/zpool/zpool.8 @@ -22,10 +22,11 @@ .\" Copyright (c) 2011, Justin T. Gibbs .\" Copyright (c) 2012 by Delphix. All Rights Reserved. .\" Copyright (c) 2012, Glen Barber +.\" Copyright (c) 2012, Richard Laager .\" .\" $FreeBSD$ .\" -.Dd November 28, 2011 +.Dd November 11, 2012 .Dt ZPOOL 8 .Os .Sh NAME @@ -589,6 +590,23 @@ command does not. For non-full pools of a reasonable= size, these effects should be invisible. For small pools, or pools that are close to being complete= ly full, these discrepancies may become more noticeable. .Pp +The following property can be set at creation time: +.Bl -tag -width 2n +.It Sy afigt Ns =3D Ns Ar number +Pool sector size exponent, to the power of 2 (internally referred +to as "ashift"). +I/O operations will be aligned to the specified size boundaries. +Additionally, the minimum (disk) write size will be set to the +specified size, +so this represents a space vs. performance trade-off. +The typical case for setting this property is when performance is +important and the underlying disks use 4KiB sectors but report 512B +sectors to the OS for compatibility reasons; +in that case, set +.Cm ashift=3D12 +(which is 1<<12 =3D 4096). +.El +.Pp The following property can be set at creation time and import time: .Bl -tag -width 2n .It Sy altroot diff --git a/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c b/cddl/contr= ib/opensolaris/cmd/zpool/zpool_main.c index b57c816..9840dc1 100644 --- a/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c +++ b/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c @@ -544,7 +544,7 @@ zpool_do_add(int argc, char **argv) } =20 /* pass off to get_vdev_spec for processing */ - nvroot =3D make_root_vdev(zhp, force, !force, B_FALSE, dryrun, + nvroot =3D make_root_vdev(zhp, NULL, force, !force, B_FALSE, dryrun, argc, argv); if (nvroot =3D=3D NULL) { zpool_close(zhp); @@ -884,7 +884,7 @@ zpool_do_create(int argc, char **argv) } =20 /* pass off to get_vdev_spec for bulk processing */ - nvroot =3D make_root_vdev(NULL, force, !force, B_FALSE, dryrun, + nvroot =3D make_root_vdev(NULL, props, force, !force, B_FALSE, dryrun, argc - 1, argv + 1); if (nvroot =3D=3D NULL) goto errout; @@ -3162,7 +3162,7 @@ zpool_do_attach_or_replace(int argc, char **argv, i= nt replacing) return (1); } =20 - nvroot =3D make_root_vdev(zhp, force, B_FALSE, replacing, B_FALSE, + nvroot =3D make_root_vdev(zhp, NULL, force, B_FALSE, replacing, B_FALSE= , argc, argv); if (nvroot =3D=3D NULL) { zpool_close(zhp); diff --git a/cddl/contrib/opensolaris/cmd/zpool/zpool_util.h b/cddl/contr= ib/opensolaris/cmd/zpool/zpool_util.h index 134c730..b67ff8b 100644 --- a/cddl/contrib/opensolaris/cmd/zpool/zpool_util.h +++ b/cddl/contrib/opensolaris/cmd/zpool/zpool_util.h @@ -43,8 +43,8 @@ uint_t num_logs(nvlist_t *nv); * Virtual device functions */ =20 -nvlist_t *make_root_vdev(zpool_handle_t *zhp, int force, int check_rep, - boolean_t replacing, boolean_t dryrun, int argc, char **argv); +nvlist_t *make_root_vdev(zpool_handle_t *zhp, nvlist_t *props, int force= , + int check_rep, boolean_t replacing, boolean_t dryrun, int argc, char= **argv); nvlist_t *split_mirror_vdev(zpool_handle_t *zhp, char *newname, nvlist_t *props, splitflags_t flags, int argc, char **argv); =20 diff --git a/cddl/contrib/opensolaris/cmd/zpool/zpool_vdev.c b/cddl/contr= ib/opensolaris/cmd/zpool/zpool_vdev.c index 5ffd39a..75dbb94 100644 --- a/cddl/contrib/opensolaris/cmd/zpool/zpool_vdev.c +++ b/cddl/contrib/opensolaris/cmd/zpool/zpool_vdev.c @@ -21,6 +21,8 @@ =20 /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights re= served. + * Copyright (c) 2011 by Christian Kohlsch=C3=BCtter . + * All rights reserved. */ =20 /* @@ -414,7 +416,7 @@ is_whole_disk(const char *arg) * xxx Shorthand for /dev/dsk/xxx */ static nvlist_t * -make_leaf_vdev(const char *arg, uint64_t is_log) +make_leaf_vdev(nvlist_t *props, const char *arg, uint64_t is_log) { char path[MAXPATHLEN]; struct stat64 statbuf; @@ -512,6 +514,19 @@ make_leaf_vdev(const char *arg, uint64_t is_log) verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_WHOLE_DISK, (uint64_t)wholedisk) =3D=3D 0); =20 + if (props !=3D NULL) { + uint64_t ashift =3D 0; + char *value =3D NULL; + + if (nvlist_lookup_string(props, + zpool_prop_to_name(ZPOOL_PROP_ASHIFT), &value) =3D=3D 0) + zfs_nicestrtonum(NULL, value, &ashift); + + if (ashift > 0) + verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_ASHIFT, + ashift) =3D=3D 0); + } + /* * For a whole disk, defer getting its devid until after labeling it. */ @@ -1198,7 +1213,7 @@ is_grouping(const char *type, int *mindev, int *max= dev) * because the program is just going to exit anyway. */ nvlist_t * -construct_spec(int argc, char **argv) +construct_spec(nvlist_t *props, int argc, char **argv) { nvlist_t *nvroot, *nv, **top, **spares, **l2cache; int t, toplevels, mindev, maxdev, nspares, nlogs, nl2cache; @@ -1287,7 +1302,7 @@ construct_spec(int argc, char **argv) children * sizeof (nvlist_t *)); if (child =3D=3D NULL) zpool_no_memory(); - if ((nv =3D make_leaf_vdev(argv[c], B_FALSE)) + if ((nv =3D make_leaf_vdev(props, argv[c], B_FALSE)) =3D=3D NULL) return (NULL); child[children - 1] =3D nv; @@ -1343,7 +1358,7 @@ construct_spec(int argc, char **argv) * We have a device. Pass off to make_leaf_vdev() to * construct the appropriate nvlist describing the vdev. */ - if ((nv =3D make_leaf_vdev(argv[0], is_log)) =3D=3D NULL) + if ((nv =3D make_leaf_vdev(props, argv[0], is_log)) =3D=3D NULL) return (NULL); if (is_log) nlogs++; @@ -1409,7 +1424,7 @@ split_mirror_vdev(zpool_handle_t *zhp, char *newnam= e, nvlist_t *props, uint_t c, children; =20 if (argc > 0) { - if ((newroot =3D construct_spec(argc, argv)) =3D=3D NULL) { + if ((newroot =3D construct_spec(props, argc, argv)) =3D=3D NULL) { (void) fprintf(stderr, gettext("Unable to build a " "pool from the specified devices\n")); return (NULL); @@ -1461,7 +1476,7 @@ split_mirror_vdev(zpool_handle_t *zhp, char *newnam= e, nvlist_t *props, * added, even if they appear in use. */ nvlist_t * -make_root_vdev(zpool_handle_t *zhp, int force, int check_rep, +make_root_vdev(zpool_handle_t *zhp, nvlist_t *props, int force, int chec= k_rep, boolean_t replacing, boolean_t dryrun, int argc, char **argv) { nvlist_t *newroot; @@ -1473,7 +1488,7 @@ make_root_vdev(zpool_handle_t *zhp, int force, int = check_rep, * that we have a valid specification, and that all devices can be * opened. */ - if ((newroot =3D construct_spec(argc, argv)) =3D=3D NULL) + if ((newroot =3D construct_spec(props, argc, argv)) =3D=3D NULL) return (NULL); =20 if (zhp && ((poolconfig =3D zpool_get_config(zhp, NULL)) =3D=3D NULL)) diff --git a/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c b/c= ddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c index 03bc3e6..a904f4f 100644 --- a/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c +++ b/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c @@ -22,6 +22,8 @@ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights re= served. * Copyright 2011 Nexenta Systems, Inc. All rights reserved. + * Copyright (c) 2011 by Christian Kohlsch=C3=BCtter . + * All rights reserved. * Copyright (c) 2012 by Delphix. All rights reserved. */ =20 @@ -304,6 +306,7 @@ zpool_get_prop(zpool_handle_t *zhp, zpool_prop_t prop= , char *buf, size_t len, case ZPOOL_PROP_FREE: case ZPOOL_PROP_FREEING: case ZPOOL_PROP_EXPANDSZ: + case ZPOOL_PROP_ASHIFT: (void) zfs_nicenum(intval, buf, len); break; =20 @@ -512,6 +515,24 @@ zpool_valid_proplist(libzfs_handle_t *hdl, const cha= r *poolname, } break; =20 + case ZPOOL_PROP_ASHIFT: + if (!flags.create) { + zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, + "property '%s' can only be set at " + "creation time"), propname); + (void) zfs_error(hdl, EZFS_BADPROP, errbuf); + goto error; + } + + if (intval !=3D 0 && (intval < 9 || intval > 13)) { + zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, + "property '%s' number %d is invalid."), + propname, intval); + (void) zfs_error(hdl, EZFS_BADPROP, errbuf); + goto error; + } + break; + case ZPOOL_PROP_BOOTFS: if (flags.create || flags.import) { zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, diff --git a/sys/cddl/contrib/opensolaris/common/zfs/zpool_prop.c b/sys/c= ddl/contrib/opensolaris/common/zfs/zpool_prop.c index 72db879..06e1cff 100644 --- a/sys/cddl/contrib/opensolaris/common/zfs/zpool_prop.c +++ b/sys/cddl/contrib/opensolaris/common/zfs/zpool_prop.c @@ -95,6 +95,10 @@ zpool_prop_init(void) PROP_READONLY, ZFS_TYPE_POOL, "<1.00x or higher if deduped>", "DEDUP"); =20 + /* readonly onetime number properties */ + zprop_register_number(ZPOOL_PROP_ASHIFT, "ashift", 0, PROP_ONETIME, + ZFS_TYPE_POOL, "", "ASHIFT"); + /* default number properties */ zprop_register_number(ZPOOL_PROP_VERSION, "version", SPA_VERSION, PROP_DEFAULT, ZFS_TYPE_POOL, "", "VERSION"); diff --git a/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h b/sys/c= ddl/contrib/opensolaris/uts/common/sys/fs/zfs.h index 64fd2e6..e113064 100644 --- a/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h +++ b/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h @@ -170,6 +170,7 @@ typedef enum { ZPOOL_PROP_FREE, ZPOOL_PROP_ALLOCATED, ZPOOL_PROP_READONLY, + ZPOOL_PROP_ASHIFT, ZPOOL_PROP_COMMENT, ZPOOL_PROP_EXPANDSZ, ZPOOL_PROP_FREEING, --=20 1.8.0 --------------000801070506070506070302-- --------------enig11BA0E7B797DBD67F4D1FE6F Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJQoymrAAoJECDuEZm+6ExkULAQAIY8/oC81TxnbtL+AOHBr2hL A8Z0goK6HB/eI/4syOquP42CUcBBWSMNw9/trkqEMqI+VUuRVYk+G4A17+iZO6wo L6EnGU6EvZmh5Uj44yGSez33ffe4nrtiLzIOmrKm2SLHkyXTZ+C360ACLkAOIZr0 tT0n/cr9p0wR3EUv809Hi/ofzzXS+Pb8MmT1rmQgTYQNAuNYddPxB3vOdDpz4khA v0PhFKKB6wsje2L4KRCbxK8zdvWaNOE0ITF9gY+n52+eHfh9ZSlINhzGcfFV8UT/ YAbvhvejDqo4KsDYovp/hkDZCk7amRUn2mPBP2IDGZTyCaXsnxlgesjeJG6Ptd/O DtKaN8afsiipAtzWCynv5YJzwwDkLt1uCaFOvxwO9/rNgyBdxfWA0FtDezKnKOxn rjStdCujtWkmDF9ensAH/Vk7LElL6fSIq4PX9IK/Y+43OpSd9wOhX/iUPgOPE4Ar cV0okw68aLrJcuhIjMXMsaj6z50TOurHUmcqKnzZpFEKPgH77As5iQMovF8aZFOG pRzcSszxuSUNZo9M5j7kcn/8cB7v1jul7tJ7PjF5Qw5rUdqNx197VPDY5ZOUfIu4 HEovTxLk5kJaYrgqi8miY+3R4TKbRIDZ1wEhEL5yqUp5qz2DNnSFTZkIC+PbKjtJ 3hkGG0jC1hudRk/gjbxW =kWL+ -----END PGP SIGNATURE----- --------------enig11BA0E7B797DBD67F4D1FE6F--