Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 8 May 2013 14:46:52 -0700
From:      Brendan Gregg <brendan.gregg@joyent.com>
To:        freebsd-fs@freebsd.org
Subject:   Re: Strange slowdown when cache devices enabled in ZFS
Message-ID:  <CA%2BXzFFi4BYcVCzainaHzn3=32hdY24dPn2Aky2CstLh_T56orQ@mail.gmail.com>
In-Reply-To: <CA%2BXzFFgG%2BJs2w%2BHJFXXd=opsdnR7Z0n1ThPPtMM1qFsPg-dsaQ@mail.gmail.com>
References:  <CA%2BXzFFgG%2BJs2w%2BHJFXXd=opsdnR7Z0n1ThPPtMM1qFsPg-dsaQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, May 8, 2013 at 2:35 PM, Brendan Gregg <brendan.gregg@joyent.com>wro=
te:

> Freddie Cash wrote (Mon Apr 29 16:01:55 UTC 2013):
> |
> | The following settings in /etc/sysctl.conf prevent the "stalls"
> completely,
> | even when the L2ARC devices are 100% full and all RAM is wired into the
> | ARC.  Been running without issues for 5 days now:
> |
> | vfs.zfs.l2arc_norw=3D0                                  # Default is 1
> | vfs.zfs.l2arc_feed_again=3D0                         # Default is 1
> | vfs.zfs.l2arc_noprefetch=3D0                          # Default is 0
> | vfs.zfs.l2arc_feed_min_ms=3D1000                 # Default is 200
> | vfs.zfs.l2arc_write_boost=3D320000000           # Default is 8 MBps
> | vfs.zfs.l2arc_write_max=3D160000000             # Default is 8 MBps
> |
> | With these settings, I'm also able to expand the ARC to use the full 12=
8
> GB
> | of RAM in the biggest box, and to use both L2ARC devices (60 GB in
> total).
> | And, can set primarycache and secondarycache to all (the default) inste=
ad
> | of just metadata.
> |[...]
>
> The thread earlier described a 100% CPU-bound l2arc_feed_thread, which
> could be caused by these settings:
>
> vfs.zfs.l2arc_write_boost=3D320000000           # Default is 8 MBps
> vfs.zfs.l2arc_write_max=3D160000000             # Default is 8 MBps
>
> If I'm reading that correctly, it's increasing the write max and boost to
> be 160 Mbytes and 320 Mbytes. To satisfy these, the L2ARC must scan memor=
y
> from the tail of the ARC lists, lists which may be composed of tiny buffe=
rs
> (eg, 8k). Increasing that scan 20 fold could saturate a CPU. And, if it
> doesn't find many bytes to write out, then it will rescan the same buffer=
s
> on the next interval, wasting CPU cycles.
>
> I understand the intent was probably to warm up the L2ARC faster. There i=
s
> no easy way to do this: you are bounded by the throughput of random reads
> from the pool disks.
>
> Random read workloads usually have a 4 - 16 Kbyte record size. The l2arc
> feed thread can't eat uncached data faster than the random reads can be
> read from disk. Therefore, at 8 Kbytes, you need at least 1,000 random re=
ad
> disk IOPS to achieve a rate of 8 Mbytes from the ARC list tails, which, f=
or
> rotational disks performing roughly 100 random IOPS (use a different rate
> if you like), means about a dozen disks - depending on the ZFS RAID confi=
g.
> All to feed at 8 Mbytes/sec. This is why 8 Mbytes/sec (plus the boost) is
> the default.
>
> To feed at 160 Mbytes/sec, with an 8 Kbyte recsize, you'll need at least
> 20,000 random read disk IOPS. How many spindles does that take? A lot. Do
> you have a lot?
>
> I wanted to point this out because the warm up problem isn't the
> l2arc_feed_thread (that it scans, how far it scans, whether it rescans,
> etc) =E2=80=93 it's the input to the system.
>
> ...
>
> I just noticed that the https://wiki.freebsd.org/ZFSTuningGuide writes:
>
> "
> vfs.zfs.l2arc_write_max
>
> vfs.zfs.l2arc_write_boost
>
> The former value sets the runtime max that data will be loaded into L2ARC=
.
> The latter can be used to accelerate the loading of a freshly booted
> system. For a device capable of 400MB/sec reasonable values might be 200M=
B
> and 380MB respectively. Note that the same caveats apply about these
> sysctls and pool imports as the previous one. Setting these values proper=
ly
> is the difference between an L2ARC subsystem that can take days to heat u=
p
> versus one that heats up in minutes.
> "
>
> This advise seems a little unwise: you could tune the feed rates that hig=
h
> =E2=80=93 if you have enough spindles to feed it =E2=80=93 but I think fo=
r most people this
> will waste CPU cycles failing to find buffers to cache. Can the author
> please double check?
>

Sorry - just noticed that vfs.zfs.l2arc_noprefetch=3D0 was also set, and, t=
he
guide recommends that. What I described was for the default of 1, where
only random reads feed the L2ARC. Streaming workloads can feed it much
quicker, so, you can increase the feed rate if either you have a lot of
spindles, or, are caching streaming workloads =E2=80=93 both providing the
throughput desired.

Back when the L2ARC was developed, the SSD max throughput (around 200
Mbytes/sec) could not compete with the pool disks (say, 12 x 180
Mbytes/sec), so it didn't make sense to cache sequential workloads in the
L2ARC. It's another subtlety that the ZFSTuningGuide might want to explain:
your pool disks might already be very good at streaming workloads =E2=80=93=
 better
than the L2ARC =E2=80=93 and so you want to leave sequential workloads to t=
hem.

Brendan

--=20
Brendan Gregg, Joyent                      http://dtrace.org/blogs/brendan



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CA%2BXzFFi4BYcVCzainaHzn3=32hdY24dPn2Aky2CstLh_T56orQ>