From owner-svn-src-all@freebsd.org Mon Aug 24 08:10:53 2015 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1E9FF9BF29B; Mon, 24 Aug 2015 08:10:53 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2001:1900:2254:2068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0BD311177; Mon, 24 Aug 2015 08:10:53 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.70]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id t7O8AqwO096320; Mon, 24 Aug 2015 08:10:52 GMT (envelope-from avg@FreeBSD.org) Received: (from avg@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id t7O8Aq0J096319; Mon, 24 Aug 2015 08:10:52 GMT (envelope-from avg@FreeBSD.org) Message-Id: <201508240810.t7O8Aq0J096319@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: avg set sender to avg@FreeBSD.org using -f From: Andriy Gapon Date: Mon, 24 Aug 2015 08:10:52 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r287099 - head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs X-SVN-Group: head MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Aug 2015 08:10:53 -0000 Author: avg Date: Mon Aug 24 08:10:52 2015 New Revision: 287099 URL: https://svnweb.freebsd.org/changeset/base/287099 Log: account for ashift when gathering buffers to be written to l2arc device The change that introduced the L2ARC compression support also introduced a bug where the on-disk size of the selected buffers could end up larger than the target size if the ashift is greater than 9. This was because the buffer selection could did not take into account the fact that on-disk size could be larger than the in-memory buffer size due to the alignment requirements. At the moment b_asize is a misnomer as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded (on FreeBSD), but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem could also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). This change is already in ZFS-on-Linux: zfsonlinux/zfs@ef56b0780c80ebb0b1e637b8b8c79530a8ab3201 PR: 198242 PR: 195746 (possibly related) Reviewed by: mahrens (https://reviews.csiden.org/r/229/) Tested by: gkontos@aicom.gr (most recently) MFC after: 15 days X-MFC note: patch does not apply as is at the moment Relnotes: yes Sponsored by: ClusterHQ Differential Revision: https://reviews.freebsd.org/D2764 Reviewed by: noone (@FreeBSD.org) Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c ============================================================================== --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c Mon Aug 24 07:49:27 2015 (r287098) +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c Mon Aug 24 08:10:52 2015 (r287099) @@ -6182,8 +6182,7 @@ l2arc_write_buffers(spa_t *spa, l2arc_de boolean_t *headroom_boost) { arc_buf_hdr_t *hdr, *hdr_prev, *head; - uint64_t write_asize, write_psize, write_sz, headroom, - buf_compress_minsz; + uint64_t write_asize, write_sz, headroom, buf_compress_minsz; void *buf_data; boolean_t full; l2arc_write_callback_t *cb; @@ -6198,7 +6197,7 @@ l2arc_write_buffers(spa_t *spa, l2arc_de *headroom_boost = B_FALSE; pio = NULL; - write_sz = write_asize = write_psize = 0; + write_sz = write_asize = 0; full = B_FALSE; head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE); head->b_flags |= ARC_FLAG_L2_WRITE_HEAD; @@ -6240,6 +6239,7 @@ l2arc_write_buffers(spa_t *spa, l2arc_de for (; hdr; hdr = hdr_prev) { kmutex_t *hash_lock; uint64_t buf_sz; + uint64_t buf_a_sz; if (arc_warm == B_FALSE) hdr_prev = multilist_sublist_next(mls, hdr); @@ -6271,7 +6271,15 @@ l2arc_write_buffers(spa_t *spa, l2arc_de continue; } - if ((write_sz + hdr->b_size) > target_sz) { + /* + * Assume that the buffer is not going to be compressed + * and could take more space on disk because of a larger + * disk block size. + */ + buf_sz = hdr->b_size; + buf_a_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz); + + if ((write_asize + buf_a_sz) > target_sz) { full = B_TRUE; mutex_exit(hash_lock); ARCSTAT_BUMP(arcstat_l2_write_full); @@ -6336,8 +6344,6 @@ l2arc_write_buffers(spa_t *spa, l2arc_de * using it to denote the header's state change. */ hdr->b_l2hdr.b_daddr = L2ARC_ADDR_UNSET; - - buf_sz = hdr->b_size; hdr->b_flags |= ARC_FLAG_HAS_L2HDR; mutex_enter(&dev->l2ad_mtx); @@ -6354,6 +6360,7 @@ l2arc_write_buffers(spa_t *spa, l2arc_de mutex_exit(hash_lock); write_sz += buf_sz; + write_asize += buf_a_sz; } multilist_sublist_unlock(mls); @@ -6373,6 +6380,19 @@ l2arc_write_buffers(spa_t *spa, l2arc_de mutex_enter(&dev->l2ad_mtx); /* + * Note that elsewhere in this file arcstat_l2_asize + * and the used space on l2ad_vdev are updated using b_asize, + * which is not necessarily rounded up to the device block size. + * Too keep accounting consistent we do the same here as well: + * stats_size accumulates the sum of b_asize of the written buffers, + * while write_asize accumulates the sum of b_asize rounded up + * to the device block size. + * The latter sum is used only to validate the corectness of the code. + */ + uint64_t stats_size = 0; + write_asize = 0; + + /* * Now start writing the buffers. We're starting at the write head * and work backwards, retracing the course of the buffer selector * loop above. @@ -6425,7 +6445,7 @@ l2arc_write_buffers(spa_t *spa, l2arc_de /* Compression may have squashed the buffer to zero length. */ if (buf_sz != 0) { - uint64_t buf_p_sz; + uint64_t buf_a_sz; wzio = zio_write_phys(pio, dev->l2ad_vdev, dev->l2ad_hand, buf_sz, buf_data, ZIO_CHECKSUM_OFF, @@ -6436,14 +6456,14 @@ l2arc_write_buffers(spa_t *spa, l2arc_de zio_t *, wzio); (void) zio_nowait(wzio); - write_asize += buf_sz; + stats_size += buf_sz; /* * Keep the clock hand suitably device-aligned. */ - buf_p_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz); - write_psize += buf_p_sz; - dev->l2ad_hand += buf_p_sz; + buf_a_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz); + write_asize += buf_a_sz; + dev->l2ad_hand += buf_a_sz; } } @@ -6453,8 +6473,8 @@ l2arc_write_buffers(spa_t *spa, l2arc_de ARCSTAT_BUMP(arcstat_l2_writes_sent); ARCSTAT_INCR(arcstat_l2_write_bytes, write_asize); ARCSTAT_INCR(arcstat_l2_size, write_sz); - ARCSTAT_INCR(arcstat_l2_asize, write_asize); - vdev_space_update(dev->l2ad_vdev, write_asize, 0, 0); + ARCSTAT_INCR(arcstat_l2_asize, stats_size); + vdev_space_update(dev->l2ad_vdev, stats_size, 0, 0); /* * Bump device hand to the device start if it is approaching the end.