Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 14 Mar 2013 11:13:38 -0700
From:      Freddie Cash <fjwcash@gmail.com>
To:        FreeBSD Filesystems <freebsd-fs@freebsd.org>
Subject:   Strange slowdown when cache devices enabled in ZFS
Message-ID:  <CAOjFWZ6Q=Vs3P-kfGysLzSbw4CnfrJkMEka4AqfSrQJFZDP_qw@mail.gmail.com>

next in thread | raw e-mail | index | archive | help
3 storage systems are running this:
# uname -a
FreeBSD alphadrive.sd73.bc.ca 9.1-STABLE FreeBSD 9.1-STABLE #0 r245466M:
Fri Feb  1 09:38:24 PST 2013
root@alphadrive.sd73.bc.ca:/usr/obj/usr/src/sys/ZFSHOST
amd64

1 storage system is running this:
# uname -a
FreeBSD omegadrive.sd73.bc.ca 9.1-STABLE FreeBSD 9.1-STABLE #0 r247804M:
Mon Mar  4 10:27:26 PST 2013
root@omegadrive.sd73.bc.ca:/usr/obj/usr/src/sys/ZFSHOST
amd64

The last system has manually merged the ZFS "deadman" patch (r 247265 from
-CURRENT).

All 4 systems exhibit the same symptoms:  if a cache device is enabled in
the pool, the l2arc_feed_thread of zfskern will spin until it takes up 100%
of a CPU core, at which point all I/O to the pool stops.  "zpool iostat 1"
and "zpool iostat -v 1" show 0 reads and 0 writes to the pool.  "gstat -I
1s -f gpt" shows 0 activity to the pool disks.

If I remove the cache device from the pool, I/O starts up right away
(although it takes several minutes for the remove operation to complete).

During the "0 I/O period", any attempt to access the pool "hangs".  CTRL+T
shows either spa_namespace_lock or tx->tx_something or other (the one when
trying to write a transaction to disk).  And it will stay like that until
the cache device is removed.

Hardware is almost the same in all 4 boxes:

3x storage boxes:
alphadrive:
    SuperMicro H8DGi-F motherboard
    AMD Opteron 6128 CPU (8 cores at 2.0 GHz)
    64 GB of DDR3 ECC SDRAM in one box
    32 GB SSD for the OS and cache device (GPT partitioned)
    24x 2.0 TB WD and Seagate SATA harddrives (4x 6-drive raidz2 vdevs)
    SuperMicro AOC-USAS-8i SATA controller using mpt driver
    SuperMicro 4U chassis

betadrive:
    SuperMicro H8DGi-F motherboard
    AMD Opteron 6128 CPU (8 cores at 2.0 GHz)
    48 GB of DDR3 ECC SDRAM in one box
    32 GB SSD for the OS and cache device (GPT partitioned)
    16x 2.0 TB WD and Seagate SATA harddrives (3x 5-drive raidz2 vdevs +
spare)
    SuperMicro AOC-USAS2-8i SATA controller using mps driver
    SuperMicro 3U chassis

zuludrive:
    SuperMicro H8DGi-F motherboard
    AMD Opteron 6128 CPU (8 cores at 2.0 GHz)
    32 GB of DDR3 ECC SDRAM in one box
    32 GB SSD for the OS and cache device (GPT partitioned)
    24x 2.0 TB WD and Seagate SATA harddrives (4x 6-drive raidz2 vdevs)
    SuperMicro AOC-USAS2-8i SATA controller using mps driver
    SuperMicro 836 chassis


1x storage box:
omegadrive:
    SuperMicro H8DG6-F motherboard
    2x AMD Opteron 6128 CPU (8 cores at 2.0 GHz; 16 cores total)
    128 GB of DDR3 ECC SDRAM in one box
    2x 60 GB SSD for the OS (gmirror'd) and log devices (ZFS mirror)
    2x 120 GB SSD for cache devices
    45x 2.0 TB WD and Seagate SATA harddrives (7x 6-drive raidz2 vdevs + 3
spares)
    LSI 9211-8e SAS controllers using mps driver
    Onboard LSI 2008 SATA controller using mps driver for OS/log/cache
    SuperMicro 4U JBOD chassis
    SuperMicro 2U chassis for motherboard/OS

alphadrive, betadrive, and omegadrive all have dedup and lzjb compression
enabled.
zuludrive has lzjb compression enabled (no dedup).

alpha/beta/zulu do rsync backups every night from various local and remote
Linux and FreeBSD boxes, then ZFS send the snapshot to omegadrive during
the day.  The "0 I/O periods" occur most often and most quickly on
omegadrive when receiving snapshots, but will eventually occur on all
systems during the rsyncs.

Things I've tried:
  - limiting ARC to only 32 GB on each system
  - limiting L2ARC to 30 GB on each system
  - enabling the "deadman" patch in case it was I/O requests being lost by
the drives/controllers
  - changing primarycache between all and metadata
  - increasing arc_meta_limit to just shy of arc_max
  - removing cache devices completely

So far, only the last option works.  Without L2ARC, the systems are 100%
stable, and can push 200 MB/s of rsync writes and just shy of 500 MB/s of
ZFS recv (saturates gigabit link, bursts writes; usually hovers around
50-80 MB/s continuous writes).

I'm baffled.  An L2ARC is supposed to make things faster, especially when
using dedup as the DDT can be cached.

-- 
Freddie Cash
fjwcash@gmail.com



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOjFWZ6Q=Vs3P-kfGysLzSbw4CnfrJkMEka4AqfSrQJFZDP_qw>