Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 26 Jul 2014 10:20:48 +0000 (UTC)
From:      Xin LI <delphij@FreeBSD.org>
To:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   svn commit: r269118 - in head: cddl/contrib/opensolaris/cmd/zdb cddl/contrib/opensolaris/cmd/zpool cddl/contrib/opensolaris/lib/libzfs/common sys/cddl/contrib/opensolaris/common/zfs sys/cddl/contri...
Message-ID:  <201407261020.s6QAKmuX034649@svn.freebsd.org>

next in thread | raw e-mail | index | archive | help
Author: delphij
Date: Sat Jul 26 10:20:48 2014
New Revision: 269118
URL: http://svnweb.freebsd.org/changeset/base/269118

Log:
  MFV r269010:
  
  Import Illumos changes to address the following Illumos issues:
    4976 zfs should only avoid writing to a failing non-redundant
         top-level vdev
    4978 ztest fails in get_metaslab_refcount()
    4979 extend free space histogram to device and pool
    4980 metaslabs should have a fragmentation metric
    4981 remove fragmented ops vector from block allocator
    4982 space_map object should proactively upgrade when feature
         is enabled
    4984 device selection should use fragmentation metric
  
  MFC after:	2 weeks

Modified:
  head/cddl/contrib/opensolaris/cmd/zdb/zdb.8
  head/cddl/contrib/opensolaris/cmd/zdb/zdb.c
  head/cddl/contrib/opensolaris/cmd/zpool/zpool.8
  head/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c
  head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c
  head/sys/cddl/contrib/opensolaris/common/zfs/zpool_prop.c
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/range_tree.c
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/space_map.h
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_debug.h
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c
  head/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h
Directory Properties:
  head/cddl/contrib/opensolaris/   (props changed)
  head/cddl/contrib/opensolaris/lib/libzfs/   (props changed)
  head/sys/cddl/contrib/opensolaris/   (props changed)

Modified: head/cddl/contrib/opensolaris/cmd/zdb/zdb.8
==============================================================================
--- head/cddl/contrib/opensolaris/cmd/zdb/zdb.8	Sat Jul 26 09:09:14 2014	(r269117)
+++ head/cddl/contrib/opensolaris/cmd/zdb/zdb.8	Sat Jul 26 10:20:48 2014	(r269118)
@@ -19,7 +19,7 @@
 .\"
 .\" $FreeBSD$
 .\"
-.Dd July 1, 2014
+.Dd July 26, 2014
 .Dt ZDB 8
 .Os
 .Sh NAME
@@ -27,11 +27,11 @@
 .Nd Display zpool debugging and consistency information
 .Sh SYNOPSIS
 .Nm
-.Op Fl CumdibcsDvhLXFPA
+.Op Fl CumdibcsDvhLMXFPA
 .Op Fl e Op Fl p Ar path...
 .Op Fl t Ar txg
 .Op Fl U Ar cache
-.Op Fl M Ar inflight I/Os
+.Op Fl I Ar inflight I/Os
 .Op Fl x Ar dumpdir
 .Ar poolname
 .Op Ar object ...
@@ -42,7 +42,7 @@
 .Ar dataset
 .Op Ar object ...
 .Nm
-.Fl m Op Fl LXFPA
+.Fl m Op Fl MLXFPA
 .Op Fl t Ar txg
 .Op Fl e Op Fl p Ar path...
 .Op Fl U Ar cache
@@ -155,6 +155,13 @@ By default,
 verifies that all non-free blocks are referenced, which can be very expensive.
 .It Fl m
 Display the offset, spacemap, and free space of each metaslab.
+When specified twice, also display information about the on-disk free
+space histogram associated with each metaslab. When specified three time,
+display the maximum contiguous free space, the in-core free space histogram,
+and the percentage of free space in each space map.  When specified
+four times display every spacemap record.
+.It Fl M
+Display the offset, spacemap, and free space of each metaslab.
 When specified twice, also display information about the maximum contiguous
 free space and the percentage of free space in each space map.
 When specified three times display every spacemap record.
@@ -229,7 +236,7 @@ all metadata on the pool.
 .It Fl F
 Attempt to make an unreadable pool readable by trying progressively older
 transactions.
-.It Fl M Ar inflight I/Os
+.It Fl I Ar inflight I/Os
 Limit the number of outstanding checksum I/Os to the specified value.
 The default value is 200. This option affects the performance of the
 .Fl c

Modified: head/cddl/contrib/opensolaris/cmd/zdb/zdb.c
==============================================================================
--- head/cddl/contrib/opensolaris/cmd/zdb/zdb.c	Sat Jul 26 09:09:14 2014	(r269117)
+++ head/cddl/contrib/opensolaris/cmd/zdb/zdb.c	Sat Jul 26 10:20:48 2014	(r269118)
@@ -111,11 +111,11 @@ static void
 usage(void)
 {
 	(void) fprintf(stderr,
-	    "Usage: %s [-CumdibcsDvhLXFPA] [-t txg] [-e [-p path...]] "
-	    "[-U config] [-M inflight I/Os] [-x dumpdir] poolname [object...]\n"
+	    "Usage: %s [-CumMdibcsDvhLXFPA] [-t txg] [-e [-p path...]] "
+	    "[-U config] [-I inflight I/Os] [-x dumpdir] poolname [object...]\n"
 	    "       %s [-divPA] [-e -p path...] [-U config] dataset "
 	    "[object...]\n"
-	    "       %s -m [-LXFPA] [-t txg] [-e [-p path...]] [-U config] "
+	    "       %s -mM [-LXFPA] [-t txg] [-e [-p path...]] [-U config] "
 	    "poolname [vdev [metaslab...]]\n"
 	    "       %s -R [-A] [-e [-p path...]] poolname "
 	    "vdev:offset:size[:flags]\n"
@@ -138,6 +138,7 @@ usage(void)
 	(void) fprintf(stderr, "        -h pool history\n");
 	(void) fprintf(stderr, "        -b block statistics\n");
 	(void) fprintf(stderr, "        -m metaslabs\n");
+	(void) fprintf(stderr, "        -M metaslab groups\n");
 	(void) fprintf(stderr, "        -c checksum all metadata (twice for "
 	    "all data) blocks\n");
 	(void) fprintf(stderr, "        -s report stats on zdb's I/O\n");
@@ -168,7 +169,7 @@ usage(void)
 	(void) fprintf(stderr, "        -P print numbers in parseable form\n");
 	(void) fprintf(stderr, "        -t <txg> -- highest txg to use when "
 	    "searching for uberblocks\n");
-	(void) fprintf(stderr, "        -M <number of inflight I/Os> -- "
+	(void) fprintf(stderr, "        -I <number of inflight I/Os> -- "
 	    "specify the maximum number of "
 	    "checksumming I/Os [default is 200]\n");
 	(void) fprintf(stderr, "Specify an option more than once (e.g. -bb) "
@@ -548,7 +549,7 @@ get_metaslab_refcount(vdev_t *vd)
 {
 	int refcount = 0;
 
-	if (vd->vdev_top == vd) {
+	if (vd->vdev_top == vd && !vd->vdev_removing) {
 		for (int m = 0; m < vd->vdev_ms_count; m++) {
 			space_map_t *sm = vd->vdev_ms[m]->ms_sm;
 
@@ -686,9 +687,10 @@ dump_metaslab(metaslab_t *msp)
 		 * The space map histogram represents free space in chunks
 		 * of sm_shift (i.e. bucket 0 refers to 2^sm_shift).
 		 */
-		(void) printf("\tOn-disk histogram:\n");
+		(void) printf("\tOn-disk histogram:\t\tfragmentation %llu\n",
+		    (u_longlong_t)msp->ms_fragmentation);
 		dump_histogram(sm->sm_phys->smp_histogram,
-		    SPACE_MAP_HISTOGRAM_SIZE(sm), sm->sm_shift);
+		    SPACE_MAP_HISTOGRAM_SIZE, sm->sm_shift);
 	}
 
 	if (dump_opt['d'] > 5 || dump_opt['m'] > 3) {
@@ -713,6 +715,47 @@ print_vdev_metaslab_header(vdev_t *vd)
 }
 
 static void
+dump_metaslab_groups(spa_t *spa)
+{
+	vdev_t *rvd = spa->spa_root_vdev;
+	metaslab_class_t *mc = spa_normal_class(spa);
+	uint64_t fragmentation;
+
+	metaslab_class_histogram_verify(mc);
+
+	for (int c = 0; c < rvd->vdev_children; c++) {
+		vdev_t *tvd = rvd->vdev_child[c];
+		metaslab_group_t *mg = tvd->vdev_mg;
+
+		if (mg->mg_class != mc)
+			continue;
+
+		metaslab_group_histogram_verify(mg);
+		mg->mg_fragmentation = metaslab_group_fragmentation(mg);
+
+		(void) printf("\tvdev %10llu\t\tmetaslabs%5llu\t\t"
+		    "fragmentation",
+		    (u_longlong_t)tvd->vdev_id,
+		    (u_longlong_t)tvd->vdev_ms_count);
+		if (mg->mg_fragmentation == ZFS_FRAG_INVALID) {
+			(void) printf("%3s\n", "-");
+		} else {
+			(void) printf("%3llu%%\n",
+			    (u_longlong_t)mg->mg_fragmentation);
+		}
+		dump_histogram(mg->mg_histogram, RANGE_TREE_HISTOGRAM_SIZE, 0);
+	}
+
+	(void) printf("\tpool %s\tfragmentation", spa_name(spa));
+	fragmentation = metaslab_class_fragmentation(mc);
+	if (fragmentation == ZFS_FRAG_INVALID)
+		(void) printf("\t%3s\n", "-");
+	else
+		(void) printf("\t%3llu%%\n", (u_longlong_t)fragmentation);
+	dump_histogram(mc->mc_histogram, RANGE_TREE_HISTOGRAM_SIZE, 0);
+}
+
+static void
 dump_metaslabs(spa_t *spa)
 {
 	vdev_t *vd, *rvd = spa->spa_root_vdev;
@@ -2369,8 +2412,7 @@ zdb_leak(void *arg, uint64_t start, uint
 }
 
 static metaslab_ops_t zdb_metaslab_ops = {
-	NULL,	/* alloc */
-	NULL	/* fragmented */
+	NULL	/* alloc */
 };
 
 static void
@@ -2865,6 +2907,8 @@ dump_zpool(spa_t *spa)
 
 	if (dump_opt['d'] > 2 || dump_opt['m'])
 		dump_metaslabs(spa);
+	if (dump_opt['M'])
+		dump_metaslab_groups(spa);
 
 	if (dump_opt['d'] || dump_opt['i']) {
 		dump_dir(dp->dp_meta_objset);
@@ -3360,7 +3404,7 @@ main(int argc, char **argv)
 	dprintf_setup(&argc, argv);
 
 	while ((c = getopt(argc, argv,
-	    "bcdhilmM:suCDRSAFLXx:evp:t:U:P")) != -1) {
+	    "bcdhilmMI:suCDRSAFLXx:evp:t:U:P")) != -1) {
 		switch (c) {
 		case 'b':
 		case 'c':
@@ -3373,6 +3417,7 @@ main(int argc, char **argv)
 		case 'u':
 		case 'C':
 		case 'D':
+		case 'M':
 		case 'R':
 		case 'S':
 			dump_opt[c]++;
@@ -3386,10 +3431,7 @@ main(int argc, char **argv)
 		case 'P':
 			dump_opt[c]++;
 			break;
-		case 'v':
-			verbose++;
-			break;
-		case 'M':
+		case 'I':
 			max_inflight = strtoull(optarg, NULL, 0);
 			if (max_inflight == 0) {
 				(void) fprintf(stderr, "maximum number "
@@ -3413,9 +3455,6 @@ main(int argc, char **argv)
 			}
 			searchdirs[nsearch++] = optarg;
 			break;
-		case 'x':
-			vn_dumpdir = optarg;
-			break;
 		case 't':
 			max_txg = strtoull(optarg, NULL, 0);
 			if (max_txg < TXG_INITIAL) {
@@ -3427,6 +3466,12 @@ main(int argc, char **argv)
 		case 'U':
 			spa_config_path = optarg;
 			break;
+		case 'v':
+			verbose++;
+			break;
+		case 'x':
+			vn_dumpdir = optarg;
+			break;
 		default:
 			usage();
 			break;

Modified: head/cddl/contrib/opensolaris/cmd/zpool/zpool.8
==============================================================================
--- head/cddl/contrib/opensolaris/cmd/zpool/zpool.8	Sat Jul 26 09:09:14 2014	(r269117)
+++ head/cddl/contrib/opensolaris/cmd/zpool/zpool.8	Sat Jul 26 10:20:48 2014	(r269118)
@@ -21,12 +21,12 @@
 .\" Copyright (c) 2010, Sun Microsystems, Inc. All Rights Reserved.
 .\" Copyright 2011, Nexenta Systems, Inc. All Rights Reserved.
 .\" Copyright (c) 2011, Justin T. Gibbs <gibbs@FreeBSD.org>
-.\" Copyright (c) 2012 by Delphix. All Rights Reserved.
+.\" Copyright (c) 2013 by Delphix. All Rights Reserved.
 .\" Copyright (c) 2012, Glen Barber <gjb@FreeBSD.org>
 .\"
 .\" $FreeBSD$
 .\"
-.Dd July 25, 2014
+.Dd July 26, 2014
 .Dt ZPOOL 8
 .Os
 .Sh NAME
@@ -543,6 +543,15 @@ For example, a
 value of 1.76 indicates that 1.76 units of data were stored but only 1 unit of disk space was actually consumed. See
 .Xr zfs 8
 for a description of the deduplication feature.
+.It Sy expandsize
+Amount of uninitialized space within the pool or device that can be used to
+increase the total capacity of the pool.
+Uninitialized space consists of
+any space on an EFI labeled vdev which has not been brought online
+.Pq i.e. zpool online -e .
+This space occurs when a LUN is dynamically expanded.
+.It Sy fragmentation
+The amount of fragmentation in the pool.
 .It Sy free
 Number of blocks within the pool that are not allocated.
 .It Sy freeing
@@ -555,13 +564,6 @@ Over time
 will decrease while
 .Sy free
 increases.
-.It Sy expandsize
-Amount of uninitialized space within the pool or device that can be used to
-increase the total capacity of the pool.
-Uninitialized space consists of
-any space on an EFI labeled vdev which has not been brought online
-.Pq i.e. zpool online -e .
-This space occurs when a LUN is dynamically expanded.
 .It Sy guid
 A unique identifier for the pool.
 .It Sy health
@@ -1408,6 +1410,7 @@ section for a list of valid properties. 
 .Sy size ,
 .Sy used ,
 .Sy available ,
+.Sy fragmentation ,
 .Sy expandsize ,
 .Sy capacity  ,
 .Sy health ,
@@ -1794,9 +1797,9 @@ is immediately available to any datasets
 The following command lists all available pools on the system.
 .Bd -literal -offset 2n
 .Li # Ic zpool list
-NAME   SIZE  ALLOC   FREE  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
-pool  2.70T   473G  2.24T         -    17%  1.00x  ONLINE  -
-test  1.98G  89.5K  1.98G         -     0%  1.00x  ONLINE  -
+NAME   SIZE  ALLOC   FREE   FRAG  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
+pool  2.70T   473G  2.24T    33%         -    17%  1.00x  ONLINE  -
+test  1.98G  89.5K  1.98G    48%         -     0%  1.00x  ONLINE  -
 .Ed
 .It Sy Example 7 No Listing All Properties for a Pool
 .Pp
@@ -1924,7 +1927,35 @@ subcommand as follows:
 .Bd -literal -offset 2n
 .Li # Ic zpool iostat -v pool 5
 .Ed
-.It Sy Example 15 No Removing a Mirrored Log Device
+.It Xo
+.Sy Example 15
+Displaying expanded space on a device
+.Xc
+.Pp
+The following command dipslays the detailed information for the
+.Em data
+pool.
+This pool is comprised of a single
+.Em raidz
+vdev where one of its
+devices increased its capacity by 10GB.
+In this example, the pool will not
+be able to utilized this extra capacity until all the devices under the
+.Em raidz
+vdev have been expanded.
+.Bd -literal -offset 2n
+.Li # Ic zpool list -v data
+NAME       SIZE  ALLOC   FREE   FRAG  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
+data      23.9G  14.6G  9.30G    48%         -    61%  1.00x  ONLINE  -
+  raidz1  23.9G  14.6G  9.30G    48%         -
+    ada0      -      -      -      -         -
+    ada1      -      -      -      -       10G
+    ada2      -      -      -      -         -
+.Ed
+.It Xo
+.Sy Example 16
+Removing a Mirrored Log Device
+.Xc
 .Pp
 The following command removes the mirrored log device
 .Em mirror-2 .
@@ -1956,7 +1987,12 @@ is:
 .Bd -literal -offset 2n
 .Li # Ic zpool remove tank mirror-2
 .Ed
-.It Sy Example 16 No Recovering a Faulted Tn ZFS No Pool
+.It Xo
+.Sy Example 17
+Recovering a Faulted
+.Tn ZFS
+Pool
+.Xc
 .Pp
 If a pool is faulted but recoverable, a message indicating this state is
 provided by

Modified: head/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c
==============================================================================
--- head/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c	Sat Jul 26 09:09:14 2014	(r269117)
+++ head/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c	Sat Jul 26 10:20:48 2014	(r269118)
@@ -2900,10 +2900,15 @@ print_one_column(zpool_prop_t prop, uint
 	boolean_t fixed;
 	size_t width = zprop_width(prop, &fixed, ZFS_TYPE_POOL);
 
-	zfs_nicenum(value, propval, sizeof (propval));
 
 	if (prop == ZPOOL_PROP_EXPANDSZ && value == 0)
 		(void) strlcpy(propval, "-", sizeof (propval));
+	else if (prop == ZPOOL_PROP_FRAGMENTATION && value == ZFS_FRAG_INVALID)
+		(void) strlcpy(propval, "-", sizeof (propval));
+	else if (prop == ZPOOL_PROP_FRAGMENTATION)
+		(void) snprintf(propval, sizeof (propval), "%llu%%", value);
+	else
+		zfs_nicenum(value, propval, sizeof (propval));
 
 	if (scripted)
 		(void) printf("\t%s", propval);
@@ -2936,9 +2941,9 @@ print_list_stats(zpool_handle_t *zhp, co
 		/* only toplevel vdevs have capacity stats */
 		if (vs->vs_space == 0) {
 			if (scripted)
-				(void) printf("\t-\t-\t-");
+				(void) printf("\t-\t-\t-\t-");
 			else
-				(void) printf("      -      -      -");
+				(void) printf("      -      -      -      -");
 		} else {
 			print_one_column(ZPOOL_PROP_SIZE, vs->vs_space,
 			    scripted);
@@ -2946,6 +2951,8 @@ print_list_stats(zpool_handle_t *zhp, co
 			    scripted);
 			print_one_column(ZPOOL_PROP_FREE,
 			    vs->vs_space - vs->vs_alloc, scripted);
+			print_one_column(ZPOOL_PROP_FRAGMENTATION,
+			    vs->vs_fragmentation, scripted);
 		}
 		print_one_column(ZPOOL_PROP_EXPANDSZ, vs->vs_esize,
 		    scripted);
@@ -3031,8 +3038,8 @@ zpool_do_list(int argc, char **argv)
 	int ret;
 	list_cbdata_t cb = { 0 };
 	static char default_props[] =
-	    "name,size,allocated,free,expandsize,capacity,dedupratio,"
-	    "health,altroot";
+	    "name,size,allocated,free,fragmentation,expandsize,capacity,"
+	    "dedupratio,health,altroot";
 	char *props = default_props;
 	unsigned long interval = 0, count = 0;
 	zpool_list_t *list;

Modified: head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c
==============================================================================
--- head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c	Sat Jul 26 09:09:14 2014	(r269117)
+++ head/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c	Sat Jul 26 10:20:48 2014	(r269118)
@@ -322,6 +322,14 @@ zpool_get_prop(zpool_handle_t *zhp, zpoo
 				    (u_longlong_t)intval);
 			}
 			break;
+		case ZPOOL_PROP_FRAGMENTATION:
+			if (intval == UINT64_MAX) {
+				(void) strlcpy(buf, "-", len);
+			} else {
+				(void) snprintf(buf, len, "%llu%%",
+				    (u_longlong_t)intval);
+			}
+			break;
 
 		case ZPOOL_PROP_DEDUPRATIO:
 			(void) snprintf(buf, len, "%llu.%02llux",

Modified: head/sys/cddl/contrib/opensolaris/common/zfs/zpool_prop.c
==============================================================================
--- head/sys/cddl/contrib/opensolaris/common/zfs/zpool_prop.c	Sat Jul 26 09:09:14 2014	(r269117)
+++ head/sys/cddl/contrib/opensolaris/common/zfs/zpool_prop.c	Sat Jul 26 10:20:48 2014	(r269118)
@@ -21,7 +21,7 @@
 /*
  * Copyright (c) 2007, 2010, Oracle and/or its affiliates. All rights reserved.
  * Copyright 2011 Nexenta Systems, Inc. All rights reserved.
- * Copyright (c) 2012 by Delphix. All rights reserved.
+ * Copyright (c) 2012, 2014 by Delphix. All rights reserved.
  */
 
 #include <sys/zio.h>
@@ -87,6 +87,8 @@ zpool_prop_init(void)
 	    PROP_READONLY, ZFS_TYPE_POOL, "<size>", "ALLOC");
 	zprop_register_number(ZPOOL_PROP_EXPANDSZ, "expandsize", 0,
 	    PROP_READONLY, ZFS_TYPE_POOL, "<size>", "EXPANDSZ");
+	zprop_register_number(ZPOOL_PROP_FRAGMENTATION, "fragmentation", 0,
+	    PROP_READONLY, ZFS_TYPE_POOL, "<percent>", "FRAG");
 	zprop_register_number(ZPOOL_PROP_CAPACITY, "capacity", 0, PROP_READONLY,
 	    ZFS_TYPE_POOL, "<size>", "CAP");
 	zprop_register_number(ZPOOL_PROP_GUID, "guid", 0, PROP_READONLY,

Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c
==============================================================================
--- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c	Sat Jul 26 09:09:14 2014	(r269117)
+++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c	Sat Jul 26 10:20:48 2014	(r269118)
@@ -32,6 +32,7 @@
 #include <sys/vdev_impl.h>
 #include <sys/zio.h>
 #include <sys/spa_impl.h>
+#include <sys/zfeature.h>
 
 SYSCTL_DECL(_vfs_zfs);
 SYSCTL_NODE(_vfs_zfs, OID_AUTO, metaslab, CTLFLAG_RW, 0, "ZFS metaslab");
@@ -89,7 +90,7 @@ int zfs_metaslab_condense_block_threshol
 /*
  * The zfs_mg_noalloc_threshold defines which metaslab groups should
  * be eligible for allocation. The value is defined as a percentage of
- * a free space. Metaslab groups that have more free space than
+ * free space. Metaslab groups that have more free space than
  * zfs_mg_noalloc_threshold are always eligible for allocations. Once
  * a metaslab group's free space is less than or equal to the
  * zfs_mg_noalloc_threshold the allocator will avoid allocating to that
@@ -106,6 +107,23 @@ SYSCTL_INT(_vfs_zfs, OID_AUTO, mg_noallo
     " to make it eligible for allocation");
 
 /*
+ * Metaslab groups are considered eligible for allocations if their
+ * fragmenation metric (measured as a percentage) is less than or equal to
+ * zfs_mg_fragmentation_threshold. If a metaslab group exceeds this threshold
+ * then it will be skipped unless all metaslab groups within the metaslab
+ * class have also crossed this threshold.
+ */
+int zfs_mg_fragmentation_threshold = 85;
+
+/*
+ * Allow metaslabs to keep their active state as long as their fragmentation
+ * percentage is less than or equal to zfs_metaslab_fragmentation_threshold. An
+ * active metaslab that exceeds this threshold will no longer keep its active
+ * status allowing better metaslabs to be selected.
+ */
+int zfs_metaslab_fragmentation_threshold = 70;
+
+/*
  * When set will load all metaslabs when pool is first opened.
  */
 int metaslab_debug_load = 0;
@@ -173,13 +191,6 @@ SYSCTL_INT(_vfs_zfs_metaslab, OID_AUTO, 
     "Number of TXGs that an unused metaslab can be kept in memory");
 
 /*
- * Should we be willing to write data to degraded vdevs?
- */
-boolean_t zfs_write_to_degraded = B_FALSE;
-SYSCTL_INT(_vfs_zfs, OID_AUTO, write_to_degraded, CTLFLAG_RWTUN,
-    &zfs_write_to_degraded, 0, "Allow writing data to degraded vdevs");
-
-/*
  * Max number of metaslabs per group to preload.
  */
 int metaslab_preload_limit = SPA_DVAS_PER_BP;
@@ -196,13 +207,30 @@ SYSCTL_INT(_vfs_zfs_metaslab, OID_AUTO, 
     "Max number of metaslabs per group to preload");
 
 /*
- * Enable/disable additional weight factor for each metaslab.
+ * Enable/disable fragmentation weighting on metaslabs.
+ */
+boolean_t metaslab_fragmentation_factor_enabled = B_TRUE;
+SYSCTL_INT(_vfs_zfs_metaslab, OID_AUTO, fragmentation_factor_enabled, CTLFLAG_RWTUN,
+    &metaslab_fragmentation_factor_enabled, 0,
+    "Enable fragmentation weighting on metaslabs");
+
+/*
+ * Enable/disable lba weighting (i.e. outer tracks are given preference).
+ */
+boolean_t metaslab_lba_weighting_enabled = B_TRUE;
+SYSCTL_INT(_vfs_zfs_metaslab, OID_AUTO, lba_weighting_enabled, CTLFLAG_RWTUN,
+    &metaslab_lba_weighting_enabled, 0,
+    "Enable LBA weighting (i.e. outer tracks are given preference)");
+
+/*
+ * Enable/disable metaslab group biasing.
  */
-boolean_t metaslab_weight_factor_enable = B_FALSE;
-SYSCTL_INT(_vfs_zfs_metaslab, OID_AUTO, weight_factor_enable, CTLFLAG_RWTUN,
-    &metaslab_weight_factor_enable, 0,
-    "Enable additional weight factor for each metaslab");
+boolean_t metaslab_bias_enabled = B_TRUE;
+SYSCTL_INT(_vfs_zfs_metaslab, OID_AUTO, bias_enabled, CTLFLAG_RWTUN,
+    &metaslab_bias_enabled, 0,
+    "Enable metaslab group biasing");
 
+static uint64_t metaslab_fragmentation(metaslab_t *);
 
 /*
  * ==========================================================================
@@ -322,6 +350,121 @@ metaslab_class_get_minblocksize(metaslab
 	return (mc->mc_minblocksize);
 }
 
+void
+metaslab_class_histogram_verify(metaslab_class_t *mc)
+{
+	vdev_t *rvd = mc->mc_spa->spa_root_vdev;
+	uint64_t *mc_hist;
+	int i;
+
+	if ((zfs_flags & ZFS_DEBUG_HISTOGRAM_VERIFY) == 0)
+		return;
+
+	mc_hist = kmem_zalloc(sizeof (uint64_t) * RANGE_TREE_HISTOGRAM_SIZE,
+	    KM_SLEEP);
+
+	for (int c = 0; c < rvd->vdev_children; c++) {
+		vdev_t *tvd = rvd->vdev_child[c];
+		metaslab_group_t *mg = tvd->vdev_mg;
+
+		/*
+		 * Skip any holes, uninitialized top-levels, or
+		 * vdevs that are not in this metalab class.
+		 */
+		if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
+		    mg->mg_class != mc) {
+			continue;
+		}
+
+		for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
+			mc_hist[i] += mg->mg_histogram[i];
+	}
+
+	for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++)
+		VERIFY3U(mc_hist[i], ==, mc->mc_histogram[i]);
+
+	kmem_free(mc_hist, sizeof (uint64_t) * RANGE_TREE_HISTOGRAM_SIZE);
+}
+
+/*
+ * Calculate the metaslab class's fragmentation metric. The metric
+ * is weighted based on the space contribution of each metaslab group.
+ * The return value will be a number between 0 and 100 (inclusive), or
+ * ZFS_FRAG_INVALID if the metric has not been set. See comment above the
+ * zfs_frag_table for more information about the metric.
+ */
+uint64_t
+metaslab_class_fragmentation(metaslab_class_t *mc)
+{
+	vdev_t *rvd = mc->mc_spa->spa_root_vdev;
+	uint64_t fragmentation = 0;
+
+	spa_config_enter(mc->mc_spa, SCL_VDEV, FTAG, RW_READER);
+
+	for (int c = 0; c < rvd->vdev_children; c++) {
+		vdev_t *tvd = rvd->vdev_child[c];
+		metaslab_group_t *mg = tvd->vdev_mg;
+
+		/*
+		 * Skip any holes, uninitialized top-levels, or
+		 * vdevs that are not in this metalab class.
+		 */
+		if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
+		    mg->mg_class != mc) {
+			continue;
+		}
+
+		/*
+		 * If a metaslab group does not contain a fragmentation
+		 * metric then just bail out.
+		 */
+		if (mg->mg_fragmentation == ZFS_FRAG_INVALID) {
+			spa_config_exit(mc->mc_spa, SCL_VDEV, FTAG);
+			return (ZFS_FRAG_INVALID);
+		}
+
+		/*
+		 * Determine how much this metaslab_group is contributing
+		 * to the overall pool fragmentation metric.
+		 */
+		fragmentation += mg->mg_fragmentation *
+		    metaslab_group_get_space(mg);
+	}
+	fragmentation /= metaslab_class_get_space(mc);
+
+	ASSERT3U(fragmentation, <=, 100);
+	spa_config_exit(mc->mc_spa, SCL_VDEV, FTAG);
+	return (fragmentation);
+}
+
+/*
+ * Calculate the amount of expandable space that is available in
+ * this metaslab class. If a device is expanded then its expandable
+ * space will be the amount of allocatable space that is currently not
+ * part of this metaslab class.
+ */
+uint64_t
+metaslab_class_expandable_space(metaslab_class_t *mc)
+{
+	vdev_t *rvd = mc->mc_spa->spa_root_vdev;
+	uint64_t space = 0;
+
+	spa_config_enter(mc->mc_spa, SCL_VDEV, FTAG, RW_READER);
+	for (int c = 0; c < rvd->vdev_children; c++) {
+		vdev_t *tvd = rvd->vdev_child[c];
+		metaslab_group_t *mg = tvd->vdev_mg;
+
+		if (tvd->vdev_ishole || tvd->vdev_ms_shift == 0 ||
+		    mg->mg_class != mc) {
+			continue;
+		}
+
+		space += tvd->vdev_max_asize - tvd->vdev_asize;
+	}
+	spa_config_exit(mc->mc_spa, SCL_VDEV, FTAG);
+	return (space);
+}
+
 /*
  * ==========================================================================
  * Metaslab groups
@@ -374,7 +517,15 @@ metaslab_group_alloc_update(metaslab_gro
 	mg->mg_free_capacity = ((vs->vs_space - vs->vs_alloc) * 100) /
 	    (vs->vs_space + 1);
 
-	mg->mg_allocatable = (mg->mg_free_capacity > zfs_mg_noalloc_threshold);
+	/*
+	 * A metaslab group is considered allocatable if it has plenty
+	 * of free space or is not heavily fragmented. We only take
+	 * fragmentation into account if the metaslab group has a valid
+	 * fragmentation metric (i.e. a value between 0 and 100).
+	 */
+	mg->mg_allocatable = (mg->mg_free_capacity > zfs_mg_noalloc_threshold &&
+	    (mg->mg_fragmentation == ZFS_FRAG_INVALID ||
+	    mg->mg_fragmentation <= zfs_mg_fragmentation_threshold));
 
 	/*
 	 * The mc_alloc_groups maintains a count of the number of
@@ -395,6 +546,7 @@ metaslab_group_alloc_update(metaslab_gro
 		mc->mc_alloc_groups--;
 	else if (!was_allocatable && mg->mg_allocatable)
 		mc->mc_alloc_groups++;
+
 	mutex_exit(&mg->mg_lock);
 }
 
@@ -485,6 +637,7 @@ metaslab_group_passivate(metaslab_group_
 	}
 
 	taskq_wait(mg->mg_taskq);
+	metaslab_group_alloc_update(mg);
 
 	mgprev = mg->mg_prev;
 	mgnext = mg->mg_next;
@@ -502,20 +655,113 @@ metaslab_group_passivate(metaslab_group_
 	metaslab_class_minblocksize_update(mc);
 }
 
+uint64_t
+metaslab_group_get_space(metaslab_group_t *mg)
+{
+	return ((1ULL << mg->mg_vd->vdev_ms_shift) * mg->mg_vd->vdev_ms_count);
+}
+
+void
+metaslab_group_histogram_verify(metaslab_group_t *mg)
+{
+	uint64_t *mg_hist;
+	vdev_t *vd = mg->mg_vd;
+	uint64_t ashift = vd->vdev_ashift;
+	int i;
+
+	if ((zfs_flags & ZFS_DEBUG_HISTOGRAM_VERIFY) == 0)
+		return;
+
+	mg_hist = kmem_zalloc(sizeof (uint64_t) * RANGE_TREE_HISTOGRAM_SIZE,
+	    KM_SLEEP);
+
+	ASSERT3U(RANGE_TREE_HISTOGRAM_SIZE, >=,
+	    SPACE_MAP_HISTOGRAM_SIZE + ashift);
+
+	for (int m = 0; m < vd->vdev_ms_count; m++) {
+		metaslab_t *msp = vd->vdev_ms[m];
+
+		if (msp->ms_sm == NULL)
+			continue;
+
+		for (i = 0; i < SPACE_MAP_HISTOGRAM_SIZE; i++)
+			mg_hist[i + ashift] +=
+			    msp->ms_sm->sm_phys->smp_histogram[i];
+	}
+
+	for (i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i ++)
+		VERIFY3U(mg_hist[i], ==, mg->mg_histogram[i]);
+
+	kmem_free(mg_hist, sizeof (uint64_t) * RANGE_TREE_HISTOGRAM_SIZE);
+}
+
 static void
-metaslab_group_add(metaslab_group_t *mg, metaslab_t *msp)
+metaslab_group_histogram_add(metaslab_group_t *mg, metaslab_t *msp)
 {
+	metaslab_class_t *mc = mg->mg_class;
+	uint64_t ashift = mg->mg_vd->vdev_ashift;
+
+	ASSERT(MUTEX_HELD(&msp->ms_lock));
+	if (msp->ms_sm == NULL)
+		return;
+
 	mutex_enter(&mg->mg_lock);
+	for (int i = 0; i < SPACE_MAP_HISTOGRAM_SIZE; i++) {
+		mg->mg_histogram[i + ashift] +=
+		    msp->ms_sm->sm_phys->smp_histogram[i];
+		mc->mc_histogram[i + ashift] +=
+		    msp->ms_sm->sm_phys->smp_histogram[i];
+	}
+	mutex_exit(&mg->mg_lock);
+}
+
+void
+metaslab_group_histogram_remove(metaslab_group_t *mg, metaslab_t *msp)
+{
+	metaslab_class_t *mc = mg->mg_class;
+	uint64_t ashift = mg->mg_vd->vdev_ashift;
+
+	ASSERT(MUTEX_HELD(&msp->ms_lock));
+	if (msp->ms_sm == NULL)
+		return;
+
+	mutex_enter(&mg->mg_lock);
+	for (int i = 0; i < SPACE_MAP_HISTOGRAM_SIZE; i++) {
+		ASSERT3U(mg->mg_histogram[i + ashift], >=,
+		    msp->ms_sm->sm_phys->smp_histogram[i]);
+		ASSERT3U(mc->mc_histogram[i + ashift], >=,
+		    msp->ms_sm->sm_phys->smp_histogram[i]);
+
+		mg->mg_histogram[i + ashift] -=
+		    msp->ms_sm->sm_phys->smp_histogram[i];
+		mc->mc_histogram[i + ashift] -=
+		    msp->ms_sm->sm_phys->smp_histogram[i];
+	}
+	mutex_exit(&mg->mg_lock);
+}
+
+static void
+metaslab_group_add(metaslab_group_t *mg, metaslab_t *msp)
+{
 	ASSERT(msp->ms_group == NULL);
+	mutex_enter(&mg->mg_lock);
 	msp->ms_group = mg;
 	msp->ms_weight = 0;
 	avl_add(&mg->mg_metaslab_tree, msp);
 	mutex_exit(&mg->mg_lock);
+
+	mutex_enter(&msp->ms_lock);
+	metaslab_group_histogram_add(mg, msp);
+	mutex_exit(&msp->ms_lock);
 }
 
 static void
 metaslab_group_remove(metaslab_group_t *mg, metaslab_t *msp)
 {
+	mutex_enter(&msp->ms_lock);
+	metaslab_group_histogram_remove(mg, msp);
+	mutex_exit(&msp->ms_lock);
+
 	mutex_enter(&mg->mg_lock);
 	ASSERT(msp->ms_group == mg);
 	avl_remove(&mg->mg_metaslab_tree, msp);
@@ -528,9 +774,9 @@ metaslab_group_sort(metaslab_group_t *mg
 {
 	/*
 	 * Although in principle the weight can be any value, in
-	 * practice we do not use values in the range [1, 510].
+	 * practice we do not use values in the range [1, 511].
 	 */
-	ASSERT(weight >= SPA_MINBLOCKSIZE-1 || weight == 0);
+	ASSERT(weight >= SPA_MINBLOCKSIZE || weight == 0);
 	ASSERT(MUTEX_HELD(&msp->ms_lock));
 
 	mutex_enter(&mg->mg_lock);
@@ -542,9 +788,42 @@ metaslab_group_sort(metaslab_group_t *mg
 }
 
 /*
+ * Calculate the fragmentation for a given metaslab group. We can use
+ * a simple average here since all metaslabs within the group must have
+ * the same size. The return value will be a value between 0 and 100
+ * (inclusive), or ZFS_FRAG_INVALID if less than half of the metaslab in this
+ * group have a fragmentation metric.
+ */
+uint64_t
+metaslab_group_fragmentation(metaslab_group_t *mg)
+{
+	vdev_t *vd = mg->mg_vd;
+	uint64_t fragmentation = 0;
+	uint64_t valid_ms = 0;
+
+	for (int m = 0; m < vd->vdev_ms_count; m++) {
+		metaslab_t *msp = vd->vdev_ms[m];
+
+		if (msp->ms_fragmentation == ZFS_FRAG_INVALID)
+			continue;
+
+		valid_ms++;
+		fragmentation += msp->ms_fragmentation;
+	}
+
+	if (valid_ms <= vd->vdev_ms_count / 2)
+		return (ZFS_FRAG_INVALID);
+
+	fragmentation /= valid_ms;
+	ASSERT3U(fragmentation, <=, 100);
+	return (fragmentation);
+}
+
+/*
  * Determine if a given metaslab group should skip allocations. A metaslab
- * group should avoid allocations if its used capacity has crossed the
- * zfs_mg_noalloc_threshold and there is at least one metaslab group
+ * group should avoid allocations if its free capacity is less than the
+ * zfs_mg_noalloc_threshold or its fragmentation metric is greater than
+ * zfs_mg_fragmentation_threshold and there is at least one metaslab group
  * that can still handle allocations.
  */
 static boolean_t
@@ -555,12 +834,19 @@ metaslab_group_allocatable(metaslab_grou
 	metaslab_class_t *mc = mg->mg_class;
 
 	/*
-	 * A metaslab group is considered allocatable if its free capacity
-	 * is greater than the set value of zfs_mg_noalloc_threshold, it's
-	 * associated with a slog, or there are no other metaslab groups
-	 * with free capacity greater than zfs_mg_noalloc_threshold.
-	 */
-	return (mg->mg_free_capacity > zfs_mg_noalloc_threshold ||
+	 * We use two key metrics to determine if a metaslab group is
+	 * considered allocatable -- free space and fragmentation. If
+	 * the free space is greater than the free space threshold and
+	 * the fragmentation is less than the fragmentation threshold then
+	 * consider the group allocatable. There are two case when we will
+	 * not consider these key metrics. The first is if the group is
+	 * associated with a slog device and the second is if all groups
+	 * in this metaslab class have already been consider ineligible
+	 * for allocations.
+	 */
+	return ((mg->mg_free_capacity > zfs_mg_noalloc_threshold &&
+	    (mg->mg_fragmentation == ZFS_FRAG_INVALID ||
+	    mg->mg_fragmentation <= zfs_mg_fragmentation_threshold)) ||
 	    mc != spa_normal_class(spa) || mc->mc_alloc_groups == 0);
 }
 
@@ -784,16 +1070,8 @@ metaslab_ff_alloc(metaslab_t *msp, uint6
 	return (metaslab_block_picker(t, cursor, size, align));
 }
 
-/* ARGSUSED */
-static boolean_t
-metaslab_ff_fragmented(metaslab_t *msp)
-{
-	return (B_TRUE);
-}
-
 static metaslab_ops_t metaslab_ff_ops = {
-	metaslab_ff_alloc,
-	metaslab_ff_fragmented
+	metaslab_ff_alloc
 };
 
 /*
@@ -840,23 +1118,8 @@ metaslab_df_alloc(metaslab_t *msp, uint6
 	return (metaslab_block_picker(t, cursor, size, 1ULL));
 }
 
-static boolean_t
-metaslab_df_fragmented(metaslab_t *msp)
-{
-	range_tree_t *rt = msp->ms_tree;
-	uint64_t max_size = metaslab_block_maxsize(msp);
-	int free_pct = range_tree_space(rt) * 100 / msp->ms_size;
-
-	if (max_size >= metaslab_df_alloc_threshold &&
-	    free_pct >= metaslab_df_free_pct)
-		return (B_FALSE);
-
-	return (B_TRUE);
-}
-
 static metaslab_ops_t metaslab_df_ops = {
-	metaslab_df_alloc,
-	metaslab_df_fragmented
+	metaslab_df_alloc
 };
 
 /*
@@ -899,15 +1162,8 @@ metaslab_cf_alloc(metaslab_t *msp, uint6
 	return (offset);
 }
 
-static boolean_t
-metaslab_cf_fragmented(metaslab_t *msp)
-{
-	return (metaslab_block_maxsize(msp) < metaslab_min_alloc_size);
-}
-
 static metaslab_ops_t metaslab_cf_ops = {
-	metaslab_cf_alloc,
-	metaslab_cf_fragmented
+	metaslab_cf_alloc
 };
 
 /*
@@ -964,16 +1220,8 @@ metaslab_ndf_alloc(metaslab_t *msp, uint
 	return (-1ULL);
 }
 
-static boolean_t
-metaslab_ndf_fragmented(metaslab_t *msp)
-{
-	return (metaslab_block_maxsize(msp) <=
-	    (metaslab_min_alloc_size << metaslab_ndf_clump_shift));
-}
-
 static metaslab_ops_t metaslab_ndf_ops = {
-	metaslab_ndf_alloc,
-	metaslab_ndf_fragmented
+	metaslab_ndf_alloc
 };
 
 metaslab_ops_t *zfs_metaslab_ops = &metaslab_df_ops;
@@ -1075,6 +1323,7 @@ metaslab_init(metaslab_group_t *mg, uint
 	msp->ms_tree = range_tree_create(&metaslab_rt_ops, msp, &msp->ms_lock);
 	metaslab_group_add(mg, msp);
 
+	msp->ms_fragmentation = metaslab_fragmentation(msp);
 	msp->ms_ops = mg->mg_class->mc_ops;
 
 	/*
@@ -1140,69 +1389,113 @@ metaslab_fini(metaslab_t *msp)
 	kmem_free(msp, sizeof (metaslab_t));
 }
 
+#define	FRAGMENTATION_TABLE_SIZE	17
+
 /*
- * Apply a weighting factor based on the histogram information for this
- * metaslab. The current weighting factor is somewhat arbitrary and requires
- * additional investigation. The implementation provides a measure of
- * "weighted" free space and gives a higher weighting for larger contiguous
- * regions. The weighting factor is determined by counting the number of
- * sm_shift sectors that exist in each region represented by the histogram.
- * That value is then multiplied by the power of 2 exponent and the sm_shift
- * value.
+ * This table defines a segment size based fragmentation metric that will
+ * allow each metaslab to derive its own fragmentation value. This is done
+ * by calculating the space in each bucket of the spacemap histogram and
+ * multiplying that by the fragmetation metric in this table. Doing
+ * this for all buckets and dividing it by the total amount of free
+ * space in this metaslab (i.e. the total free space in all buckets) gives
+ * us the fragmentation metric. This means that a high fragmentation metric
+ * equates to most of the free space being comprised of small segments.
+ * Conversely, if the metric is low, then most of the free space is in
+ * large segments. A 10% change in fragmentation equates to approximately
+ * double the number of segments.
  *
- * For example, assume the 2^21 histogram bucket has 4 2MB regions and the
- * metaslab has an sm_shift value of 9 (512B):

*** DIFF OUTPUT TRUNCATED AT 1000 LINES ***



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201407261020.s6QAKmuX034649>