Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 14 Nov 2012 00:18:28 -0500
From:      Richard Yao <ryao@gentoo.org>
To:        freebsd-fs@freebsd.org
Cc:        Eitan Adler <eadler@freebsd.org>
Subject:   Port of ZFSOnLinux solution for illumos-gate issue #2663 to FreeBSD
Message-ID:  <50A329A4.9090304@gentoo.org>

next in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig11BA0E7B797DBD67F4D1FE6F
Content-Type: multipart/mixed;
 boundary="------------000801070506070506070302"

This is a multi-part message in MIME format.
--------------000801070506070506070302
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Dear Everyone,

I am the Gentoo Linux ZFS maintainer as well as part of the Gentoo BSD
team. I have attached a patch that ports the ZFSOnLinux solution of
illumos-gate issue #2663 to FreeBSD-HEAD. It should also apply against
stable, with fuzz. This permits users to avoid fiddling with gnop when
making pools on drives that lie about their sector sizes.

There are a few things to note about this patch:

1. This does not apply to `zpool add`, `zpool attach` and `zpool
replace`. A separate patch for that is being reviewed in ZFSOnLinux at
this time. I will port it separately after it is committed.

2. This has not been sent to Illumos upstream. As a Gentoo BSD team
developer, I am in a much better position to send code to FreeBSD than
to send code to Illumos. I expect that Martin Matuska will port this
change to Illumos after it is accepted into FreeBSD, so I assume that
this is okay.

3. ZFSOnLinux enforces the CDDL's attribution requirement by relying on
commit messages and metadata. FreeBSD and Illumos satisfy it by adding
copyright notices to files. I have tried to translate the ZFS
attribution policy by adding appropriate copyright notices for
non-trivial changes. I would expect this to pass review by the Gentoo
Foundation members that review licensing for Gentoo, so I assume that
this is okay.

I have discussed committing this patch to FreeBSD with Eitan Adler. He
requires one of the FreeBSD Filesystem developers to acknowledge it as
being appropriate for the tree before he will commit it.

Yours truly,
Richard Yao

--------------000801070506070506070302
Content-Type: text/x-patch;
 name="0001-Add-ashift-property-to-zpool-create.patch"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
 filename="0001-Add-ashift-property-to-zpool-create.patch"

=46rom 6a07263af0d51eb5031ba43e299e5d6cdc7ea1f0 Mon Sep 17 00:00:00 2001
From: =3D?UTF-8?q?Christian=3D20Kohlsch=3DC3=3DBCtter?=3D <christian@kohl=
schutter.com>
Date: Sat, 10 Nov 2012 12:09:35 -0500
Subject: [PATCH] Add "ashift" property to zpool create

Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
suffer from bad write performance when ashift is not configured
correctly.  This is caused by the disk not reporting its actual sector
size, but a sector size of 512 bytes.  The drive may behave this way
for compatibility reasons.  For example, the WDC WD20EARS disks are
known to exhibit this behavior.

When creating a zpool, ZFS takes that wrong sector size and sets the
"ashift" property accordingly (to 9: 1<<9=3D512), whereas it should be
set to 12 for 4k sectors (1<<12=3D4096).

This patch allows an adminstrator to manual specify the known correct
ashift size at 'zpool create' time.  This can significantly improve
performance in certain cases.  However, it will have an impact on your
total pool capacity.  See the updated ashift property description
in the zpool.8 man page for additional details.

Valid values for the ashift property range from 9 to 13 (512B-8KB).
Additionally, you may set the ashift to 0 if you wish to auto-detect
the sector size based on what the disk reports, this is the default
behavior.  The most common ashift values are 9 and 12.

Example:
zpool create -o ashift=3D12 tank raidz2 sda sdb sdc sdd

Closed zfsonlinux/zfs#280

This patch was modified during its port by by Richard Yao to include a la=
ter
change he wrote to reduce the effective maximum ashift value to 13 due to=
 pool
corruption issues discovered with higher values.

Original-patch-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Richard Yao <ryao@gentoo.org>
Man-page-ported-by: Eitan Adler <eadler@FreeBSD.org>
---
 cddl/contrib/opensolaris/cmd/zpool/zpool.8         | 20 ++++++++++++++-
 cddl/contrib/opensolaris/cmd/zpool/zpool_main.c    |  6 ++---
 cddl/contrib/opensolaris/cmd/zpool/zpool_util.h    |  4 +--
 cddl/contrib/opensolaris/cmd/zpool/zpool_vdev.c    | 29 ++++++++++++++++=
------
 .../opensolaris/lib/libzfs/common/libzfs_pool.c    | 21 ++++++++++++++++=

 .../contrib/opensolaris/common/zfs/zpool_prop.c    |  4 +++
 .../contrib/opensolaris/uts/common/sys/fs/zfs.h    |  1 +
 7 files changed, 72 insertions(+), 13 deletions(-)

diff --git a/cddl/contrib/opensolaris/cmd/zpool/zpool.8 b/cddl/contrib/op=
ensolaris/cmd/zpool/zpool.8
index 88fc79b..8ed120a 100644
--- a/cddl/contrib/opensolaris/cmd/zpool/zpool.8
+++ b/cddl/contrib/opensolaris/cmd/zpool/zpool.8
@@ -22,10 +22,11 @@
 .\" Copyright (c) 2011, Justin T. Gibbs <gibbs@FreeBSD.org>
 .\" Copyright (c) 2012 by Delphix. All Rights Reserved.
 .\" Copyright (c) 2012, Glen Barber <gjb@FreeBSD.org>
+.\" Copyright (c) 2012, Richard Laager <rlaager@wiktel.com>
 .\"
 .\" $FreeBSD$
 .\"
-.Dd November 28, 2011
+.Dd November 11, 2012
 .Dt ZPOOL 8
 .Os
 .Sh NAME
@@ -589,6 +590,23 @@ command does not. For non-full pools of a reasonable=
 size, these effects should
 be invisible. For small pools, or pools that are close to being complete=
ly
 full, these discrepancies may become more noticeable.
 .Pp
+The following property can be set at creation time:
+.Bl -tag -width 2n
+.It Sy afigt Ns =3D Ns Ar number
+Pool sector size exponent, to the power of 2 (internally referred
+to as "ashift").
+I/O operations will be aligned to the specified size boundaries.
+Additionally, the minimum (disk) write size will be set to the
+specified size,
+so this represents a space vs. performance trade-off.
+The typical case for setting this property is when performance is
+important and the underlying disks use 4KiB sectors but report 512B
+sectors to the OS for compatibility reasons;
+in that case, set
+.Cm ashift=3D12
+(which is 1<<12 =3D 4096).
+.El
+.Pp
 The following property can be set at creation time and import time:
 .Bl -tag -width 2n
 .It Sy altroot
diff --git a/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c b/cddl/contr=
ib/opensolaris/cmd/zpool/zpool_main.c
index b57c816..9840dc1 100644
--- a/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c
+++ b/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c
@@ -544,7 +544,7 @@ zpool_do_add(int argc, char **argv)
 	}
=20
 	/* pass off to get_vdev_spec for processing */
-	nvroot =3D make_root_vdev(zhp, force, !force, B_FALSE, dryrun,
+	nvroot =3D make_root_vdev(zhp, NULL, force, !force, B_FALSE, dryrun,
 	    argc, argv);
 	if (nvroot =3D=3D NULL) {
 		zpool_close(zhp);
@@ -884,7 +884,7 @@ zpool_do_create(int argc, char **argv)
 	}
=20
 	/* pass off to get_vdev_spec for bulk processing */
-	nvroot =3D make_root_vdev(NULL, force, !force, B_FALSE, dryrun,
+	nvroot =3D make_root_vdev(NULL, props, force, !force, B_FALSE, dryrun,
 	    argc - 1, argv + 1);
 	if (nvroot =3D=3D NULL)
 		goto errout;
@@ -3162,7 +3162,7 @@ zpool_do_attach_or_replace(int argc, char **argv, i=
nt replacing)
 		return (1);
 	}
=20
-	nvroot =3D make_root_vdev(zhp, force, B_FALSE, replacing, B_FALSE,
+	nvroot =3D make_root_vdev(zhp, NULL, force, B_FALSE, replacing, B_FALSE=
,
 	    argc, argv);
 	if (nvroot =3D=3D NULL) {
 		zpool_close(zhp);
diff --git a/cddl/contrib/opensolaris/cmd/zpool/zpool_util.h b/cddl/contr=
ib/opensolaris/cmd/zpool/zpool_util.h
index 134c730..b67ff8b 100644
--- a/cddl/contrib/opensolaris/cmd/zpool/zpool_util.h
+++ b/cddl/contrib/opensolaris/cmd/zpool/zpool_util.h
@@ -43,8 +43,8 @@ uint_t num_logs(nvlist_t *nv);
  * Virtual device functions
  */
=20
-nvlist_t *make_root_vdev(zpool_handle_t *zhp, int force, int check_rep,
-    boolean_t replacing, boolean_t dryrun, int argc, char **argv);
+nvlist_t *make_root_vdev(zpool_handle_t *zhp, nvlist_t *props, int force=
,
+    int check_rep, boolean_t replacing, boolean_t dryrun, int argc, char=
 **argv);
 nvlist_t *split_mirror_vdev(zpool_handle_t *zhp, char *newname,
     nvlist_t *props, splitflags_t flags, int argc, char **argv);
=20
diff --git a/cddl/contrib/opensolaris/cmd/zpool/zpool_vdev.c b/cddl/contr=
ib/opensolaris/cmd/zpool/zpool_vdev.c
index 5ffd39a..75dbb94 100644
--- a/cddl/contrib/opensolaris/cmd/zpool/zpool_vdev.c
+++ b/cddl/contrib/opensolaris/cmd/zpool/zpool_vdev.c
@@ -21,6 +21,8 @@
=20
 /*
  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights re=
served.
+ * Copyright (c) 2011 by Christian Kohlsch=C3=BCtter <christian@kohlschu=
tter.com>.
+ * 	All rights reserved.
  */
=20
 /*
@@ -414,7 +416,7 @@ is_whole_disk(const char *arg)
  * 	xxx		Shorthand for /dev/dsk/xxx
  */
 static nvlist_t *
-make_leaf_vdev(const char *arg, uint64_t is_log)
+make_leaf_vdev(nvlist_t *props, const char *arg, uint64_t is_log)
 {
 	char path[MAXPATHLEN];
 	struct stat64 statbuf;
@@ -512,6 +514,19 @@ make_leaf_vdev(const char *arg, uint64_t is_log)
 		verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_WHOLE_DISK,
 		    (uint64_t)wholedisk) =3D=3D 0);
=20
+	if (props !=3D NULL) {
+		uint64_t ashift =3D 0;
+		char *value =3D NULL;
+
+		if (nvlist_lookup_string(props,
+		    zpool_prop_to_name(ZPOOL_PROP_ASHIFT), &value) =3D=3D 0)
+			zfs_nicestrtonum(NULL, value, &ashift);
+
+		if (ashift > 0)
+			verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_ASHIFT,
+			    ashift) =3D=3D 0);
+	}
+
 	/*
 	 * For a whole disk, defer getting its devid until after labeling it.
 	 */
@@ -1198,7 +1213,7 @@ is_grouping(const char *type, int *mindev, int *max=
dev)
  * because the program is just going to exit anyway.
  */
 nvlist_t *
-construct_spec(int argc, char **argv)
+construct_spec(nvlist_t *props, int argc, char **argv)
 {
 	nvlist_t *nvroot, *nv, **top, **spares, **l2cache;
 	int t, toplevels, mindev, maxdev, nspares, nlogs, nl2cache;
@@ -1287,7 +1302,7 @@ construct_spec(int argc, char **argv)
 				    children * sizeof (nvlist_t *));
 				if (child =3D=3D NULL)
 					zpool_no_memory();
-				if ((nv =3D make_leaf_vdev(argv[c], B_FALSE))
+				if ((nv =3D make_leaf_vdev(props, argv[c], B_FALSE))
 				    =3D=3D NULL)
 					return (NULL);
 				child[children - 1] =3D nv;
@@ -1343,7 +1358,7 @@ construct_spec(int argc, char **argv)
 			 * We have a device.  Pass off to make_leaf_vdev() to
 			 * construct the appropriate nvlist describing the vdev.
 			 */
-			if ((nv =3D make_leaf_vdev(argv[0], is_log)) =3D=3D NULL)
+			if ((nv =3D make_leaf_vdev(props, argv[0], is_log)) =3D=3D NULL)
 				return (NULL);
 			if (is_log)
 				nlogs++;
@@ -1409,7 +1424,7 @@ split_mirror_vdev(zpool_handle_t *zhp, char *newnam=
e, nvlist_t *props,
 	uint_t c, children;
=20
 	if (argc > 0) {
-		if ((newroot =3D construct_spec(argc, argv)) =3D=3D NULL) {
+		if ((newroot =3D construct_spec(props, argc, argv)) =3D=3D NULL) {
 			(void) fprintf(stderr, gettext("Unable to build a "
 			    "pool from the specified devices\n"));
 			return (NULL);
@@ -1461,7 +1476,7 @@ split_mirror_vdev(zpool_handle_t *zhp, char *newnam=
e, nvlist_t *props,
  * added, even if they appear in use.
  */
 nvlist_t *
-make_root_vdev(zpool_handle_t *zhp, int force, int check_rep,
+make_root_vdev(zpool_handle_t *zhp, nvlist_t *props, int force, int chec=
k_rep,
     boolean_t replacing, boolean_t dryrun, int argc, char **argv)
 {
 	nvlist_t *newroot;
@@ -1473,7 +1488,7 @@ make_root_vdev(zpool_handle_t *zhp, int force, int =
check_rep,
 	 * that we have a valid specification, and that all devices can be
 	 * opened.
 	 */
-	if ((newroot =3D construct_spec(argc, argv)) =3D=3D NULL)
+	if ((newroot =3D construct_spec(props, argc, argv)) =3D=3D NULL)
 		return (NULL);
=20
 	if (zhp && ((poolconfig =3D zpool_get_config(zhp, NULL)) =3D=3D NULL))
diff --git a/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c b/c=
ddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c
index 03bc3e6..a904f4f 100644
--- a/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c
+++ b/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c
@@ -22,6 +22,8 @@
 /*
  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights re=
served.
  * Copyright 2011 Nexenta Systems, Inc. All rights reserved.
+ * Copyright (c) 2011 by Christian Kohlsch=C3=BCtter <christian@kohlschu=
tter.com>.
+ * 	All rights reserved.
  * Copyright (c) 2012 by Delphix. All rights reserved.
  */
=20
@@ -304,6 +306,7 @@ zpool_get_prop(zpool_handle_t *zhp, zpool_prop_t prop=
, char *buf, size_t len,
 		case ZPOOL_PROP_FREE:
 		case ZPOOL_PROP_FREEING:
 		case ZPOOL_PROP_EXPANDSZ:
+		case ZPOOL_PROP_ASHIFT:
 			(void) zfs_nicenum(intval, buf, len);
 			break;
=20
@@ -512,6 +515,24 @@ zpool_valid_proplist(libzfs_handle_t *hdl, const cha=
r *poolname,
 			}
 			break;
=20
+		case ZPOOL_PROP_ASHIFT:
+			if (!flags.create) {
+				zfs_error_aux(hdl, dgettext(TEXT_DOMAIN,
+				    "property '%s' can only be set at "
+				    "creation time"), propname);
+				(void) zfs_error(hdl, EZFS_BADPROP, errbuf);
+				goto error;
+			}
+
+			if (intval !=3D 0 && (intval < 9 || intval > 13)) {
+				zfs_error_aux(hdl, dgettext(TEXT_DOMAIN,
+				    "property '%s' number %d is invalid."),
+				    propname, intval);
+				(void) zfs_error(hdl, EZFS_BADPROP, errbuf);
+				goto error;
+			}
+			break;
+
 		case ZPOOL_PROP_BOOTFS:
 			if (flags.create || flags.import) {
 				zfs_error_aux(hdl, dgettext(TEXT_DOMAIN,
diff --git a/sys/cddl/contrib/opensolaris/common/zfs/zpool_prop.c b/sys/c=
ddl/contrib/opensolaris/common/zfs/zpool_prop.c
index 72db879..06e1cff 100644
--- a/sys/cddl/contrib/opensolaris/common/zfs/zpool_prop.c
+++ b/sys/cddl/contrib/opensolaris/common/zfs/zpool_prop.c
@@ -95,6 +95,10 @@ zpool_prop_init(void)
 	    PROP_READONLY, ZFS_TYPE_POOL, "<1.00x or higher if deduped>",
 	    "DEDUP");
=20
+	/* readonly onetime number properties */
+	zprop_register_number(ZPOOL_PROP_ASHIFT, "ashift", 0, PROP_ONETIME,
+	    ZFS_TYPE_POOL, "<ashift, 9-13, or 0=3Ddefault>", "ASHIFT");
+
 	/* default number properties */
 	zprop_register_number(ZPOOL_PROP_VERSION, "version", SPA_VERSION,
 	    PROP_DEFAULT, ZFS_TYPE_POOL, "<version>", "VERSION");
diff --git a/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h b/sys/c=
ddl/contrib/opensolaris/uts/common/sys/fs/zfs.h
index 64fd2e6..e113064 100644
--- a/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h
+++ b/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h
@@ -170,6 +170,7 @@ typedef enum {
 	ZPOOL_PROP_FREE,
 	ZPOOL_PROP_ALLOCATED,
 	ZPOOL_PROP_READONLY,
+	ZPOOL_PROP_ASHIFT,
 	ZPOOL_PROP_COMMENT,
 	ZPOOL_PROP_EXPANDSZ,
 	ZPOOL_PROP_FREEING,
--=20
1.8.0


--------------000801070506070506070302--

--------------enig11BA0E7B797DBD67F4D1FE6F
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJQoymrAAoJECDuEZm+6ExkULAQAIY8/oC81TxnbtL+AOHBr2hL
A8Z0goK6HB/eI/4syOquP42CUcBBWSMNw9/trkqEMqI+VUuRVYk+G4A17+iZO6wo
L6EnGU6EvZmh5Uj44yGSez33ffe4nrtiLzIOmrKm2SLHkyXTZ+C360ACLkAOIZr0
tT0n/cr9p0wR3EUv809Hi/ofzzXS+Pb8MmT1rmQgTYQNAuNYddPxB3vOdDpz4khA
v0PhFKKB6wsje2L4KRCbxK8zdvWaNOE0ITF9gY+n52+eHfh9ZSlINhzGcfFV8UT/
YAbvhvejDqo4KsDYovp/hkDZCk7amRUn2mPBP2IDGZTyCaXsnxlgesjeJG6Ptd/O
DtKaN8afsiipAtzWCynv5YJzwwDkLt1uCaFOvxwO9/rNgyBdxfWA0FtDezKnKOxn
rjStdCujtWkmDF9ensAH/Vk7LElL6fSIq4PX9IK/Y+43OpSd9wOhX/iUPgOPE4Ar
cV0okw68aLrJcuhIjMXMsaj6z50TOurHUmcqKnzZpFEKPgH77As5iQMovF8aZFOG
pRzcSszxuSUNZo9M5j7kcn/8cB7v1jul7tJ7PjF5Qw5rUdqNx197VPDY5ZOUfIu4
HEovTxLk5kJaYrgqi8miY+3R4TKbRIDZ1wEhEL5yqUp5qz2DNnSFTZkIC+PbKjtJ
3hkGG0jC1hudRk/gjbxW
=kWL+
-----END PGP SIGNATURE-----

--------------enig11BA0E7B797DBD67F4D1FE6F--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50A329A4.9090304>