Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 27 Jul 2017 10:26:58 +0000 (UTC)
From:      Alexander Motin <mav@FreeBSD.org>
To:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-11@freebsd.org
Subject:   svn commit: r321611 - in stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs: . sys
Message-ID:  <201707271026.v6RAQwjB015526@repo.freebsd.org>

next in thread | raw e-mail | index | archive | help
Author: mav
Date: Thu Jul 27 10:26:58 2017
New Revision: 321611
URL: https://svnweb.freebsd.org/changeset/base/321611

Log:
  MFC r320237: MFV r318947: 7578 Fix/improve some aspects of ZIL writing.
  
  FreeBSD note: this commit removes small differences between what mav
  committed to FreeBSD in r308782 and what ended up committed to illumos
  after addressing all review comments.
  
  illumos/illumos-gate@c5ee46810f82e8a53d2cc5a487568a573f449039
  https://github.com/illumos/illumos-gate/commit/c5ee46810f82e8a53d2cc5a487568a573f449039
  
  https://www.illumos.org/issues/7578
    After some ZIL changes 6 years ago zil_slog_limit got partially broken
    due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
    Actually because of other changes about that time zl_itx_list_sz is not
    really required to implement the functionality, so this patch removes
    some unneeded broken code and variables.
    Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
    single heavy logger, that increased latency for other (more latency critical)
    loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
    huge latency increase for heavy writers, this implementation caused double
    write of all data, since the log records were explicitly prepared for SLOG.
    Since we now have I/O scheduler, I've found it can be much more efficient
    to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
    to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
    Existing ZIL implementation had problem with space efficiency when it
    has to write large chunks of data into log blocks of limited size. In some
    cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
    spinning rust, that also reduced log write speed in half, since head had to
    uselessly fly over allocated but not written areas. This change improves
    the situation by offloading problematic operations from z*_log_write() to
    zil_lwb_commit(), which knows real situation of log blocks allocation and
    can split large requests into pieces much more efficiently. Also as side
    effect it removes one of two data copy operations done by ZIL code WR_COPIED
    case.
    While there, untangle and unify code of z*_log_write() functions.
    Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
    block boundary, that may also improve efficiency if ZPL is made to do that.
  
  Reviewed by: Matthew Ahrens <mahrens@delphix.com>
  Reviewed by: Prakash Surya <prakash.surya@delphix.com>
  Reviewed by: Andriy Gapon <avg@FreeBSD.org>
  Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
  Reviewed by: Brad Lewis <brad.lewis@delphix.com>
  Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
  Approved by: Robert Mustacchi <rm@joyent.com>
  Author: Alexander Motin <mav@FreeBSD.org>

Modified:
  stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h
  stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c
Directory Properties:
  stable/11/   (props changed)

Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h
==============================================================================
--- stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h	Thu Jul 27 10:25:18 2017	(r321610)
+++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h	Thu Jul 27 10:26:58 2017	(r321611)
@@ -139,10 +139,27 @@ typedef struct zil_bp_node {
 	avl_node_t	zn_node;
 } zil_bp_node_t;
 
+/*
+ * Maximum amount of write data that can be put into single log block.
+ */
 #define	ZIL_MAX_LOG_DATA (SPA_OLD_MAXBLOCKSIZE - sizeof (zil_chain_t) - \
     sizeof (lr_write_t))
 #define	ZIL_MAX_COPIED_DATA \
     ((SPA_OLD_MAXBLOCKSIZE - sizeof (zil_chain_t)) / 2 - sizeof (lr_write_t))
+
+/*
+ * Maximum amount of log space we agree to waste to reduce number of
+ * WR_NEED_COPY chunks to reduce zl_get_data() overhead (~12%).
+ */
+#define	ZIL_MAX_WASTE_SPACE (ZIL_MAX_LOG_DATA / 8)
+
+/*
+ * Maximum amount of write data for WR_COPIED.  Fall back to WR_NEED_COPY
+ * as more space efficient if we can't fit at least two log records into
+ * maximum sized log block.
+ */
+#define	ZIL_MAX_COPIED_DATA ((SPA_OLD_MAXBLOCKSIZE - \
+    sizeof (zil_chain_t)) / 2 - sizeof (lr_write_t))
 
 #ifdef	__cplusplus
 }

Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c
==============================================================================
--- stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c	Thu Jul 27 10:25:18 2017	(r321610)
+++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c	Thu Jul 27 10:26:58 2017	(r321611)
@@ -90,12 +90,12 @@ SYSCTL_INT(_vfs_zfs_trim, OID_AUTO, enabled, CTLFLAG_R
 
 /*
  * Limit SLOG write size per commit executed with synchronous priority.
- * Any writes above that executed with lower (asynchronous) priority to
- * limit potential SLOG device abuse by single active ZIL writer.
+ * Any writes above that will be executed with lower (asynchronous) priority
+ * to limit potential SLOG device abuse by single active ZIL writer.
  */
-uint64_t zil_slog_limit = 768 * 1024;
-SYSCTL_QUAD(_vfs_zfs, OID_AUTO, zil_slog_limit, CTLFLAG_RWTUN,
-    &zil_slog_limit, 0, "Maximal SLOG commit size with sync priority");
+uint64_t zil_slog_bulk = 768 * 1024;
+SYSCTL_QUAD(_vfs_zfs, OID_AUTO, zil_slog_bulk, CTLFLAG_RWTUN,
+    &zil_slog_bulk, 0, "Maximal SLOG commit size with sync priority");
 
 static kmem_cache_t *zil_lwb_cache;
 
@@ -923,7 +923,7 @@ zil_lwb_write_init(zilog_t *zilog, lwb_t *lwb)
 	if (lwb->lwb_zio == NULL) {
 		abd_t *lwb_abd = abd_get_from_buf(lwb->lwb_buf,
 		    BP_GET_LSIZE(&lwb->lwb_blk));
-		if (zilog->zl_cur_used <= zil_slog_limit || !lwb->lwb_slog)
+		if (!lwb->lwb_slog || zilog->zl_cur_used <= zil_slog_bulk)
 			prio = ZIO_PRIORITY_SYNC_WRITE;
 		else
 			prio = ZIO_PRIORITY_ASYNC_WRITE;
@@ -1068,36 +1068,38 @@ zil_lwb_write_start(zilog_t *zilog, lwb_t *lwb, boolea
 static lwb_t *
 zil_lwb_commit(zilog_t *zilog, itx_t *itx, lwb_t *lwb)
 {
-	lr_t *lrcb, *lrc = &itx->itx_lr; /* common log record */
-	lr_write_t *lrwb, *lrw = (lr_write_t *)lrc;
+	lr_t *lrcb, *lrc;
+	lr_write_t *lrwb, *lrw;
 	char *lr_buf;
-	uint64_t txg = lrc->lrc_txg;
-	uint64_t reclen = lrc->lrc_reclen;
-	uint64_t dlen = 0;
-	uint64_t dnow, lwb_sp;
+	uint64_t dlen, dnow, lwb_sp, reclen, txg;
 
 	if (lwb == NULL)
 		return (NULL);
 
 	ASSERT(lwb->lwb_buf != NULL);
 
-	if (lrc->lrc_txtype == TX_WRITE && itx->itx_wr_state == WR_NEED_COPY)
+	lrc = &itx->itx_lr;		/* Common log record inside itx. */
+	lrw = (lr_write_t *)lrc;	/* Write log record inside itx. */
+	if (lrc->lrc_txtype == TX_WRITE && itx->itx_wr_state == WR_NEED_COPY) {
 		dlen = P2ROUNDUP_TYPED(
 		    lrw->lr_length, sizeof (uint64_t), uint64_t);
-
+	} else {
+		dlen = 0;
+	}
+	reclen = lrc->lrc_reclen;
 	zilog->zl_cur_used += (reclen + dlen);
+	txg = lrc->lrc_txg;
 
 	zil_lwb_write_init(zilog, lwb);
 
 cont:
 	/*
 	 * If this record won't fit in the current log block, start a new one.
-	 * For WR_NEED_COPY optimize layout for minimal number of chunks, but
-	 * try to keep wasted space withing reasonable range (12%).
+	 * For WR_NEED_COPY optimize layout for minimal number of chunks.
 	 */
 	lwb_sp = lwb->lwb_sz - lwb->lwb_nused;
 	if (reclen > lwb_sp || (reclen + dlen > lwb_sp &&
-	    lwb_sp < ZIL_MAX_LOG_DATA / 8 && (dlen % ZIL_MAX_LOG_DATA == 0 ||
+	    lwb_sp < ZIL_MAX_WASTE_SPACE && (dlen % ZIL_MAX_LOG_DATA == 0 ||
 	    lwb_sp < reclen + dlen % ZIL_MAX_LOG_DATA))) {
 		lwb = zil_lwb_write_start(zilog, lwb, B_FALSE);
 		if (lwb == NULL)
@@ -1105,14 +1107,14 @@ cont:
 		zil_lwb_write_init(zilog, lwb);
 		ASSERT(LWB_EMPTY(lwb));
 		lwb_sp = lwb->lwb_sz - lwb->lwb_nused;
-		ASSERT3U(reclen + MIN(dlen, sizeof(uint64_t)), <=, lwb_sp);
+		ASSERT3U(reclen + MIN(dlen, sizeof (uint64_t)), <=, lwb_sp);
 	}
 
 	dnow = MIN(dlen, lwb_sp - reclen);
 	lr_buf = lwb->lwb_buf + lwb->lwb_nused;
 	bcopy(lrc, lr_buf, reclen);
-	lrcb = (lr_t *)lr_buf;
-	lrwb = (lr_write_t *)lrcb;
+	lrcb = (lr_t *)lr_buf;		/* Like lrc, but inside lwb. */
+	lrwb = (lr_write_t *)lrcb;	/* Like lrw, but inside lwb. */
 
 	/*
 	 * If it's a write, fetch the data or get its blkptr as appropriate.
@@ -1328,6 +1330,8 @@ zil_itx_assign(zilog_t *zilog, itx_t *itx, dmu_tx_t *t
 			 * this itxg. Save the itxs for release below.
 			 * This should be rare.
 			 */
+			zfs_dbgmsg("zil_itx_assign: missed itx cleanup for "
+			    "txg %llu", itxg->itxg_txg);
 			clean = itxg->itxg_itxs;
 		}
 		itxg->itxg_txg = txg;



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201707271026.v6RAQwjB015526>