From owner-freebsd-fs@FreeBSD.ORG  Wed May  8 22:22:53 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 48A19595
 for <freebsd-fs@freebsd.org>; Wed,  8 May 2013 22:22:53 +0000 (UTC)
 (envelope-from fjwcash@gmail.com)
Received: from mail-qe0-f41.google.com (mail-qe0-f41.google.com
 [209.85.128.41]) by mx1.freebsd.org (Postfix) with ESMTP id 0D27EA3
 for <freebsd-fs@freebsd.org>; Wed,  8 May 2013 22:22:52 +0000 (UTC)
Received: by mail-qe0-f41.google.com with SMTP id b10so1487419qen.0
 for <freebsd-fs@freebsd.org>; Wed, 08 May 2013 15:22:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=WEDv8olqYmGWUkj197hC8zU+r070z8POHN2lRNSCFrg=;
 b=jwGaIKt4wzR6JkBnRvia5beU8sZ7GtSbl4ewLRfyprZgVh1s3GOcUQZmnBnbSc9TTI
 gUYiJx6WzrBPfqGfrR4u/Bvv1OShy+qGJWbsLAU7mJDodVaSN9uOVFzjlI7hfVWf1hil
 fzlkUyWBLTSukj43s6ySEwkfvlYBRopUwcl8YXmnMohMEWufcknU36EtyY1NCFyaRa0O
 ceOb+rZClp/RR8xed4CbKV2EX8+clQv239LbS6CWB1BGejDJJjshJIa8EDwE7OIFVllG
 GJj9PoMRcf90up7SVLW70bEda9EMMtdZDbYc+E2MLtdk9opXcavHgAxAEiZZxvShC7iT
 epxQ==
MIME-Version: 1.0
X-Received: by 10.224.4.202 with SMTP id 10mr6463125qas.70.1368051766489; Wed,
 08 May 2013 15:22:46 -0700 (PDT)
Received: by 10.49.1.44 with HTTP; Wed, 8 May 2013 15:22:46 -0700 (PDT)
In-Reply-To: <CA+XzFFhj-H_LEj8MhZgCLXZJbhiTOfo4GTx2y_i4Ke6YpjvB6A@mail.gmail.com>
References: <CA+XzFFgG+Js2w+HJFXXd=opsdnR7Z0n1ThPPtMM1qFsPg-dsaQ@mail.gmail.com>
 <CAOjFWZ6CzbYSSnso-rqDWaA=VxcDBx+KG=6KX3oT2ijbECm=sQ@mail.gmail.com>
 <CA+XzFFhj-H_LEj8MhZgCLXZJbhiTOfo4GTx2y_i4Ke6YpjvB6A@mail.gmail.com>
Date: Wed, 8 May 2013 15:22:46 -0700
Message-ID: <CAOjFWZ5XHpWgeem7Jjgha96nj4g9kpdi2J+aHG=xpQmAea3tpA@mail.gmail.com>
Subject: Re: Strange slowdown when cache devices enabled in ZFS
From: Freddie Cash <fjwcash@gmail.com>
To: Brendan Gregg <brendan.gregg@joyent.com>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 08 May 2013 22:22:53 -0000

On Wed, May 8, 2013 at 3:02 PM, Brendan Gregg <brendan.gregg@joyent.com>wrote:

> On Wed, May 8, 2013 at 2:45 PM, Freddie Cash <fjwcash@gmail.com> wrote:
>
>> On Wed, May 8, 2013 at 2:35 PM, Brendan Gregg <brendan.gregg@joyent.com>wrote:
>>
>>> Freddie Cash wrote (Mon Apr 29 16:01:55 UTC 2013):
>>> |
>>> | The following settings in /etc/sysctl.conf prevent the "stalls"
>>> completely,
>>> [...]
>>>
>>> To feed at 160 Mbytes/sec, with an 8 Kbyte recsize, you'll need at least
>>> 20,000 random read disk IOPS. How many spindles does that take? A lot. Do
>>> you have a lot?
>>>
>>>
>> 45x 2 TB SATA harddrives, configured in raidz2 vdevs of 6 disks each for
>> a total of 7 vdevs (with a few spare disks).  With 2x SSD for log+OS and 2x
>> SSD for cache.
>>
>
> What's the max random read rate? I'd expect (7 vdevs, modern disks) it to
> be something like 1,000. What is your recsize? (or if it is tiny files,
> then average size?).
>
> On the other hand, if it's caching streaming workloads, then do those 2
> SSDs outperform 45 spindles?
>
> If you are getting 120 Mbytes/sec warmup, then I'm guessing it's either a
> 128 Kbyte recsize random reads, or sequential.
>
>
There's 128 GB of RAM in the box, arc_max set to 124 GB, arc_meta_max set
to 120 GB.  And 16 CPU cores (2x 8-core CPU at 2.0 GHz).

Recordsize property for the pool is left at default (128 KB).

LZJB compression is enabled.

Dedupe is enabled.

"zpool list" shows 76 TB total storage space in the pool, with 29 TB
available (61% cap).

"zfs list" shows just over 18 TB of actual usable space left in the pool.

"zdb -DD" shows the following:
DDT-sha256-zap-duplicate: 110879014 entries, size 557 on disk, 170 in core
DDT-sha256-zap-unique: 259870524 entries, size 571 on disk, 181 in core

DDT histogram (aggregated over all DDTs):

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1     248M   27.2T   18.6T   19.3T     248M   27.2T   18.6T   19.3T
     2    80.0M   9.07T   7.56T   7.72T     175M   19.8T   16.5T   16.9T
     4    16.0M   1.80T   1.40T   1.44T    77.2M   8.67T   6.72T   6.91T
     8    4.51M    498G    345G    358G    47.6M   5.13T   3.51T   3.65T
    16    2.53M    293G    137G    146G    53.8M   6.09T   2.84T   3.05T
    32    1.55M    119G   63.8G   71.4G    72.6M   5.07T   2.77T   3.13T
    64     762K   78.7G   45.6G   49.0G    71.5M   7.45T   4.25T   4.57T
   128     264K   26.3G   18.3G   19.3G    44.8M   4.49T   3.25T   3.41T
   256    57.5K   4.21G   2.28G   2.58G    18.2M   1.30T    704G    805G
   512    9.25K    436M    216M    277M    6.38M    299G    144G    186G
    1K    2.96K    116M   56.8M   76.5M    4.10M    166G   81.4G    109G
    2K    1.15K   56.9M   27.1M   34.7M    3.26M    163G   76.0G   97.6G
    4K      618   16.6M   3.10M   7.65M    3.27M   85.0G   17.0G   41.5G
    8K      169   7.36M   3.11M   4.25M    1.89M   81.4G   33.2G   46.4G
   16K      156   3.54M    948K   2.07M    3.42M   79.9G   20.2G   45.8G
   32K      317   2.11M    763K   3.05M    13.8M   91.7G   32.1G    135G
   64K       15    712K     32K    160K    1.26M   53.2G   2.44G   13.0G
  128K       10   13.5K   8.50K   79.9K    1.60M   2.18G   1.37G   12.8G
  256K        3   1.50K   1.50K   24.0K     926K    463M    463M   7.23G
 Total     354M   39.0T   28.2T   29.1T     848M   86.2T   59.5T   62.3T

dedup = 2.14, compress = 1.45, copies = 1.05, dedup * compress / copies =
2.96

Not sure which zdb command to use to show the average block sizes in use,
though.

This is the off-site replication storage server for our backups systems,
aggregating data from the three main backups servers (schools, non-schools,
groupware).  Each of those backups servers does an rsync of a remote Linux
or FreeBSD server (65, 73, 1 resp) overnight, and then does a "zfs send" to
push the data to this off-site server.

The issue I noticed was during the zfs recv from the other 3 boxes.  Would
run fine without L2ARC devices, saturating the gigabit link between them.
Would run fine with L2ARC devices enabled ... until the L2ARC usage neared
100%, then the l2arc_feed_thread would hit 100% CPU usage, and there would
be 0 I/O to the pool.  If I limited ARC to 64 GB, it would take longer to
reach the "l2arc_feed_thread @ 100%; no I/O" issue.

Turning l2arc_norw off, everything works.  I've been running with the
sysctl.conf settings shown before without any issues for over a week now.
Full 124 GB ARC, 2x 64GB cache devices, L2ARC sitting at near 100% usage,
and l2arc_feed_thread never goes above 50% CPU, usually around 20%.

-- 
Freddie Cash
fjwcash@gmail.com