Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 03 Mar 2014 11:17:05 +0200
From:      Andriy Gapon <avg@FreeBSD.org>
To:        freebsd-current@FreeBSD.org
Cc:        "Justin T. Gibbs" <gibbs@FreeBSD.org>, Steven Hartland <killing@multiplay.co.uk>
Subject:   Re: ZFS secondarycache on SSD problem on r255173
Message-ID:  <53144891.9050001@FreeBSD.org>
In-Reply-To: <4459A6FAB7B8445C97CCB9EFF34FD4F0@multiplay.co.uk>
References:  <20131016080100.GA27758@hell.ukr.net> <3A44A8F6-8B62-4A23-819D-B91A3E6E5EF9@freebsd.org> <E5E6AB7C-C067-4B92-8A38-9DD811011D6F@FreeBSD.org> <7059AA6DCC0D46B8B1D33FC883C31643@multiplay.co.uk> <20131017061248.GA15980@hell.ukr.net> <326B470C65A04BC4BC83E118185B935F@multiplay.co.uk> <20131017073925.GA34958@hell.ukr.net> <2AFE1CBD9B124E3AB9E05A4E483CCE03@multiplay.co.uk> <20131018080148.GA75226@hell.ukr.net> <256B2E5A0BA44DCBB45BB3F3E820E190@multiplay.co.uk> <20131018144524.GA30018@hell.ukr.net> <4459A6FAB7B8445C97CCB9EFF34FD4F0@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
on 18/10/2013 17:57 Steven Hartland said the following:
> I think we we may well need the following patch to set the minblock
> size based on the vdev ashift and not SPA_MINBLOCKSIZE.
> 
> svn diff -x -p sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
> Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
> ===================================================================
> --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c        (revision 256554)
> +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c        (working copy)
> @@ -5147,7 +5147,7 @@ l2arc_compress_buf(l2arc_buf_hdr_t *l2hdr)
>        len = l2hdr->b_asize;
>        cdata = zio_data_buf_alloc(len);
>        csize = zio_compress_data(ZIO_COMPRESS_LZ4, l2hdr->b_tmp_cdata,
> -           cdata, l2hdr->b_asize, (size_t)SPA_MINBLOCKSIZE);
> +           cdata, l2hdr->b_asize, (size_t)(1ULL <<
> l2hdr->b_dev->l2ad_vdev->vdev_ashift));
> 
>        if (csize == 0) {
>                /* zero block, indicate that there's nothing to write */


This is a rather old thread and change, but I think that I have identified
another problem with 4KB cache devices.

I noticed that on some of our systems we were getting a clearly abnormal number
of l2arc checksum errors accounted in l2_cksum_bad.  The hardware appeared to be
in good health.  Using DTrace I noticed that the data seemed to be overwritten
with other data.  After more DTrace analysis I observed that sometimes
l2arc_write_buffers() would advance l2ad_hand by more than target_sz.
This meant that l2arc_write_buffers() would write beyond a region cleared by
l2arc_evict() and thus overwrite data belonging to non-evicted buffers.  Havoc
ensues.

The cache devices in question are all SSDs with logical sector size of 4KB.
I am not sure about other ZFS platforms, but on FreeBSD this fact is detected
and ashift of 12 is used for the cache vdevs.

Looking at l2arc_write_buffers() code you can see that it properly accounts for
ashift when actually writing buffers and advancing l2ad_hand:
                        /*
                         * Keep the clock hand suitably device-aligned.
                         */
                        buf_p_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz);
                        write_psize += buf_p_sz;
                        dev->l2ad_hand += buf_p_sz;

But the same is not done when selecting buffers to be written and checking that
target_sz is not exceeded.
So, if ARC contains a lot of buffers smaller than 4K that means that an aligned
on-disk size of the L2ARC buffers could be quite larger than their non-aligned size.

I propose the following patch which has been tested and seems to fix the problem
without introducing any side effects:
https://github.com/avg-I/freebsd/compare/review;l2arc-write-target-size.diff
https://github.com/avg-I/freebsd/compare/review;l2arc-write-target-size

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53144891.9050001>