From owner-svn-src-all@freebsd.org Fri Jan 20 15:01:05 2017 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EEF39CB93AD; Fri, 20 Jan 2017 15:01:05 +0000 (UTC) (envelope-from jpaetzel@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B0C181CA2; Fri, 20 Jan 2017 15:01:05 +0000 (UTC) (envelope-from jpaetzel@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id v0KF14CT084441; Fri, 20 Jan 2017 15:01:04 GMT (envelope-from jpaetzel@FreeBSD.org) Received: (from jpaetzel@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id v0KF14QP084438; Fri, 20 Jan 2017 15:01:04 GMT (envelope-from jpaetzel@FreeBSD.org) Message-Id: <201701201501.v0KF14QP084438@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: jpaetzel set sender to jpaetzel@FreeBSD.org using -f From: Josh Paetzel Date: Fri, 20 Jan 2017 15:01:04 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r312535 - in head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs: . sys X-SVN-Group: head MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jan 2017 15:01:06 -0000 Author: jpaetzel Date: Fri Jan 20 15:01:04 2017 New Revision: 312535 URL: https://svnweb.freebsd.org/changeset/base/312535 Log: MFV 312436 6569 large file delete can starve out write ops illumos/illumos-gate@ff5177ee8bf9a355131ce2cc61ae2da6a5a6fdd6 https://github.com/illumos/illumos-gate/commit/ff5177ee8bf9a355131ce2cc61ae2da6a5a6fdd6 https://www.illumos.org/issues/6569 The core issue I've found is that there is no throttle for how many deletes get assigned to one TXG. As a results when deleting large files we end up filling consecutive TXGs with deletes/frees, then write throttling other (more important) ops. There is an easy test case for this problem. Try deleting several large files (at least 1/2 TB) while you do write ops on the same pool. What we've seen is performance of these write ops (let's call it sideload I/O) would drop to zero. More specifically the problem is that dmu_free_long_range_impl() can/will fill up all of the dirty data in the pool "instantly", before many of the sideload ops can get in. So sideload performance will be impacted until all the files are freed. The solution we have tested at Nexenta (with positive results) creates a relatively simple throttle for how many "free" ops we let into one TXG. However this solution exposes other problems that should also be addressed. If we are to slow down freeing of data that means one has to wait even longer (assuming vnode ref count of 1) to get shell back after an rm or for NFS thread to finish the free-ing op. To avoid this the proposed solution is to call zfs_inactive() async for "large" files. Async freeing then begs for the reclaimed space to be accounted for in the zpool's "freeing" prop. The other issue with having a longer delete is the inability to export/unmount for a longer period of time. The proposed solution is to interrupt freeing of blocks when a fs is unmounted. Author: Alek Pinchuk Reviewed by: Matt Ahrens Reviewed by: Sanjay Nadkarni Reviewed by: Pavel Zakharov Approved by: Dan McDonald Reviewed by: avg Differential Revision: D9008 Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_pool.h Directory Properties: head/sys/cddl/contrib/opensolaris/ (props changed) Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c Fri Jan 20 14:59:56 2017 (r312534) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c Fri Jan 20 15:01:04 2017 (r312535) @@ -60,6 +60,16 @@ SYSCTL_DECL(_vfs_zfs); SYSCTL_INT(_vfs_zfs, OID_AUTO, nopwrite_enabled, CTLFLAG_RDTUN, &zfs_nopwrite_enabled, 0, "Enable nopwrite feature"); +/* + * Tunable to control percentage of dirtied blocks from frees in one TXG. + * After this threshold is crossed, additional dirty blocks from frees + * wait until the next TXG. + * A value of zero will disable this throttle. + */ +uint32_t zfs_per_txg_dirty_frees_percent = 30; +SYSCTL_INT(_vfs_zfs, OID_AUTO, per_txg_dirty_frees_percent, CTLFLAG_RWTUN, + &zfs_per_txg_dirty_frees_percent, 0, "Percentage of dirtied blocks from frees in one txg"); + const dmu_object_type_info_t dmu_ot[DMU_OT_NUMTYPES] = { { DMU_BSWAP_UINT8, TRUE, "unallocated" }, { DMU_BSWAP_ZAP, TRUE, "object directory" }, @@ -718,15 +728,25 @@ dmu_free_long_range_impl(objset_t *os, d { uint64_t object_size = (dn->dn_maxblkid + 1) * dn->dn_datablksz; int err; + uint64_t dirty_frees_threshold; + dsl_pool_t *dp = dmu_objset_pool(os); if (offset >= object_size) return (0); + if (zfs_per_txg_dirty_frees_percent <= 100) + dirty_frees_threshold = + zfs_per_txg_dirty_frees_percent * zfs_dirty_data_max / 100; + else + dirty_frees_threshold = zfs_dirty_data_max / 4; + if (length == DMU_OBJECT_END || offset + length > object_size) length = object_size - offset; while (length != 0) { - uint64_t chunk_end, chunk_begin; + uint64_t chunk_end, chunk_begin, chunk_len; + uint64_t long_free_dirty_all_txgs = 0; + dmu_tx_t *tx; chunk_end = chunk_begin = offset + length; @@ -737,9 +757,28 @@ dmu_free_long_range_impl(objset_t *os, d ASSERT3U(chunk_begin, >=, offset); ASSERT3U(chunk_begin, <=, chunk_end); - dmu_tx_t *tx = dmu_tx_create(os); - dmu_tx_hold_free(tx, dn->dn_object, - chunk_begin, chunk_end - chunk_begin); + chunk_len = chunk_end - chunk_begin; + + mutex_enter(&dp->dp_lock); + for (int t = 0; t < TXG_SIZE; t++) { + long_free_dirty_all_txgs += + dp->dp_long_free_dirty_pertxg[t]; + } + mutex_exit(&dp->dp_lock); + + /* + * To avoid filling up a TXG with just frees wait for + * the next TXG to open before freeing more chunks if + * we have reached the threshold of frees + */ + if (dirty_frees_threshold != 0 && + long_free_dirty_all_txgs >= dirty_frees_threshold) { + txg_wait_open(dp, 0); + continue; + } + + tx = dmu_tx_create(os); + dmu_tx_hold_free(tx, dn->dn_object, chunk_begin, chunk_len); /* * Mark this transaction as typically resulting in a net @@ -751,10 +790,18 @@ dmu_free_long_range_impl(objset_t *os, d dmu_tx_abort(tx); return (err); } - dnode_free_range(dn, chunk_begin, chunk_end - chunk_begin, tx); + + mutex_enter(&dp->dp_lock); + dp->dp_long_free_dirty_pertxg[dmu_tx_get_txg(tx) & TXG_MASK] += + chunk_len; + mutex_exit(&dp->dp_lock); + DTRACE_PROBE3(free__long__range, + uint64_t, long_free_dirty_all_txgs, uint64_t, chunk_len, + uint64_t, dmu_tx_get_txg(tx)); + dnode_free_range(dn, chunk_begin, chunk_len, tx); dmu_tx_commit(tx); - length -= chunk_end - chunk_begin; + length -= chunk_len; } return (0); } Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c Fri Jan 20 14:59:56 2017 (r312534) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c Fri Jan 20 15:01:04 2017 (r312535) @@ -24,6 +24,7 @@ * Copyright (c) 2013 Steven Hartland. All rights reserved. * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved. * Copyright (c) 2014 Integros [integros.com] + * Copyright 2016 Nexenta Systems, Inc. All rights reserved. */ #include @@ -593,6 +594,16 @@ dsl_pool_sync(dsl_pool_t *dp, uint64_t t dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg); /* + * Update the long range free counter after + * we're done syncing user data + */ + mutex_enter(&dp->dp_lock); + ASSERT(spa_sync_pass(dp->dp_spa) == 1 || + dp->dp_long_free_dirty_pertxg[txg & TXG_MASK] == 0); + dp->dp_long_free_dirty_pertxg[txg & TXG_MASK] = 0; + mutex_exit(&dp->dp_lock); + + /* * After the data blocks have been written (ensured by the zio_wait() * above), update the user/group space accounting. */ Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_pool.h ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_pool.h Fri Jan 20 14:59:56 2017 (r312534) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_pool.h Fri Jan 20 15:01:04 2017 (r312535) @@ -21,6 +21,7 @@ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2013 by Delphix. All rights reserved. + * Copyright 2016 Nexenta Systems, Inc. All rights reserved. */ #ifndef _SYS_DSL_POOL_H @@ -103,6 +104,7 @@ typedef struct dsl_pool { kcondvar_t dp_spaceavail_cv; uint64_t dp_dirty_pertxg[TXG_SIZE]; uint64_t dp_dirty_total; + uint64_t dp_long_free_dirty_pertxg[TXG_SIZE]; uint64_t dp_mos_used_delta; uint64_t dp_mos_compressed_delta; uint64_t dp_mos_uncompressed_delta;