From owner-freebsd-fs@FreeBSD.ORG Wed May 8 22:22:53 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 48A19595 for ; Wed, 8 May 2013 22:22:53 +0000 (UTC) (envelope-from fjwcash@gmail.com) Received: from mail-qe0-f41.google.com (mail-qe0-f41.google.com [209.85.128.41]) by mx1.freebsd.org (Postfix) with ESMTP id 0D27EA3 for ; Wed, 8 May 2013 22:22:52 +0000 (UTC) Received: by mail-qe0-f41.google.com with SMTP id b10so1487419qen.0 for ; Wed, 08 May 2013 15:22:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=WEDv8olqYmGWUkj197hC8zU+r070z8POHN2lRNSCFrg=; b=jwGaIKt4wzR6JkBnRvia5beU8sZ7GtSbl4ewLRfyprZgVh1s3GOcUQZmnBnbSc9TTI gUYiJx6WzrBPfqGfrR4u/Bvv1OShy+qGJWbsLAU7mJDodVaSN9uOVFzjlI7hfVWf1hil fzlkUyWBLTSukj43s6ySEwkfvlYBRopUwcl8YXmnMohMEWufcknU36EtyY1NCFyaRa0O ceOb+rZClp/RR8xed4CbKV2EX8+clQv239LbS6CWB1BGejDJJjshJIa8EDwE7OIFVllG GJj9PoMRcf90up7SVLW70bEda9EMMtdZDbYc+E2MLtdk9opXcavHgAxAEiZZxvShC7iT epxQ== MIME-Version: 1.0 X-Received: by 10.224.4.202 with SMTP id 10mr6463125qas.70.1368051766489; Wed, 08 May 2013 15:22:46 -0700 (PDT) Received: by 10.49.1.44 with HTTP; Wed, 8 May 2013 15:22:46 -0700 (PDT) In-Reply-To: References: Date: Wed, 8 May 2013 15:22:46 -0700 Message-ID: Subject: Re: Strange slowdown when cache devices enabled in ZFS From: Freddie Cash To: Brendan Gregg Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 May 2013 22:22:53 -0000 On Wed, May 8, 2013 at 3:02 PM, Brendan Gregg wrote: > On Wed, May 8, 2013 at 2:45 PM, Freddie Cash wrote: > >> On Wed, May 8, 2013 at 2:35 PM, Brendan Gregg wrote: >> >>> Freddie Cash wrote (Mon Apr 29 16:01:55 UTC 2013): >>> | >>> | The following settings in /etc/sysctl.conf prevent the "stalls" >>> completely, >>> [...] >>> >>> To feed at 160 Mbytes/sec, with an 8 Kbyte recsize, you'll need at least >>> 20,000 random read disk IOPS. How many spindles does that take? A lot. Do >>> you have a lot? >>> >>> >> 45x 2 TB SATA harddrives, configured in raidz2 vdevs of 6 disks each for >> a total of 7 vdevs (with a few spare disks). With 2x SSD for log+OS and 2x >> SSD for cache. >> > > What's the max random read rate? I'd expect (7 vdevs, modern disks) it to > be something like 1,000. What is your recsize? (or if it is tiny files, > then average size?). > > On the other hand, if it's caching streaming workloads, then do those 2 > SSDs outperform 45 spindles? > > If you are getting 120 Mbytes/sec warmup, then I'm guessing it's either a > 128 Kbyte recsize random reads, or sequential. > > There's 128 GB of RAM in the box, arc_max set to 124 GB, arc_meta_max set to 120 GB. And 16 CPU cores (2x 8-core CPU at 2.0 GHz). Recordsize property for the pool is left at default (128 KB). LZJB compression is enabled. Dedupe is enabled. "zpool list" shows 76 TB total storage space in the pool, with 29 TB available (61% cap). "zfs list" shows just over 18 TB of actual usable space left in the pool. "zdb -DD" shows the following: DDT-sha256-zap-duplicate: 110879014 entries, size 557 on disk, 170 in core DDT-sha256-zap-unique: 259870524 entries, size 571 on disk, 181 in core DDT histogram (aggregated over all DDTs): bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 248M 27.2T 18.6T 19.3T 248M 27.2T 18.6T 19.3T 2 80.0M 9.07T 7.56T 7.72T 175M 19.8T 16.5T 16.9T 4 16.0M 1.80T 1.40T 1.44T 77.2M 8.67T 6.72T 6.91T 8 4.51M 498G 345G 358G 47.6M 5.13T 3.51T 3.65T 16 2.53M 293G 137G 146G 53.8M 6.09T 2.84T 3.05T 32 1.55M 119G 63.8G 71.4G 72.6M 5.07T 2.77T 3.13T 64 762K 78.7G 45.6G 49.0G 71.5M 7.45T 4.25T 4.57T 128 264K 26.3G 18.3G 19.3G 44.8M 4.49T 3.25T 3.41T 256 57.5K 4.21G 2.28G 2.58G 18.2M 1.30T 704G 805G 512 9.25K 436M 216M 277M 6.38M 299G 144G 186G 1K 2.96K 116M 56.8M 76.5M 4.10M 166G 81.4G 109G 2K 1.15K 56.9M 27.1M 34.7M 3.26M 163G 76.0G 97.6G 4K 618 16.6M 3.10M 7.65M 3.27M 85.0G 17.0G 41.5G 8K 169 7.36M 3.11M 4.25M 1.89M 81.4G 33.2G 46.4G 16K 156 3.54M 948K 2.07M 3.42M 79.9G 20.2G 45.8G 32K 317 2.11M 763K 3.05M 13.8M 91.7G 32.1G 135G 64K 15 712K 32K 160K 1.26M 53.2G 2.44G 13.0G 128K 10 13.5K 8.50K 79.9K 1.60M 2.18G 1.37G 12.8G 256K 3 1.50K 1.50K 24.0K 926K 463M 463M 7.23G Total 354M 39.0T 28.2T 29.1T 848M 86.2T 59.5T 62.3T dedup = 2.14, compress = 1.45, copies = 1.05, dedup * compress / copies = 2.96 Not sure which zdb command to use to show the average block sizes in use, though. This is the off-site replication storage server for our backups systems, aggregating data from the three main backups servers (schools, non-schools, groupware). Each of those backups servers does an rsync of a remote Linux or FreeBSD server (65, 73, 1 resp) overnight, and then does a "zfs send" to push the data to this off-site server. The issue I noticed was during the zfs recv from the other 3 boxes. Would run fine without L2ARC devices, saturating the gigabit link between them. Would run fine with L2ARC devices enabled ... until the L2ARC usage neared 100%, then the l2arc_feed_thread would hit 100% CPU usage, and there would be 0 I/O to the pool. If I limited ARC to 64 GB, it would take longer to reach the "l2arc_feed_thread @ 100%; no I/O" issue. Turning l2arc_norw off, everything works. I've been running with the sysctl.conf settings shown before without any issues for over a week now. Full 124 GB ARC, 2x 64GB cache devices, L2ARC sitting at near 100% usage, and l2arc_feed_thread never goes above 50% CPU, usually around 20%. -- Freddie Cash fjwcash@gmail.com