Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 18 Mar 2014 12:19:32 -0500
From:      Karl Denninger <karl@denninger.net>
To:        avg@FreeBSD.org
Cc:        freebsd-fs@freebsd.org
Subject:   Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
Message-ID:  <53288024.2060005@denninger.net>
In-Reply-To: <201403181520.s2IFK1M3069036@freefall.freebsd.org>
References:  <201403181520.s2IFK1M3069036@freefall.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a cryptographically signed message in MIME format.

--------------ms000607060402000804060701
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable


On 3/18/2014 10:20 AM, Andriy Gapon wrote:
> The following reply was made to PR kern/187594; it has been noted by GN=
ATS.
>
> From: Andriy Gapon <avg@FreeBSD.org>
> To: bug-followup@FreeBSD.org, karl@fs.denninger.net
> Cc:
> Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fi=
x
> Date: Tue, 18 Mar 2014 17:15:05 +0200
>
>   Karl Denninger <karl@fs.denninger.net> wrote:
>   > ZFS can be convinced to engage in pathological behavior due to a ba=
d
>   > low-memory test in arc.c
>   >
>   > The offending file is at
>   > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c; it a=
llegedly
>   > checks for 25% free memory, and if it is less asks for the cache to=
 shrink.
>   >
>   > (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path
>   > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)
>   >
>   > #else /* !sun */
>   > if (kmem_used() > (kmem_size() * 3) / 4)
>   > return (1);
>   > #endif /* sun */
>   >
>   > Unfortunately these two functions do not return what the authors th=
ought
>   > they did. It's clear what they're trying to do from the Solaris-spe=
cific
>   > code up above this test.
>  =20
>   No, these functions do return what the authors think they do.
>   The check is for KVA usage (kernel virtual address space), not for ph=
ysical memory.
I understand, but that's nonsensical in the context of the Solaris=20
code.  "lotsfree" is *not* a declaration of free kvm space, it's a=20
declaration of when the system has "lots" of free *physical* memory.

Further it makes no sense at all to allow the ARC cache to force things=20
into virtual (e.g. swap-space backed) memory.  But that's the behavior=20
that has been observed, and it fits with the code as originally written.

>  =20
>   > The result is that the cache only shrinks when vm_paging_needed() t=
ests
>   > true, but by that time the system is in serious memory trouble and =
by
>  =20
>   No, it is not.
>   The description and numbers here are a little bit outdated but they s=
hould give
>   an idea of how paging works in general:
>   https://wiki.freebsd.org/AvgPageoutAlgorithm
>  =20
>   > triggering only there it actually drives the system further into pa=
ging,
>  =20
>   How does ARC eviction drives the system further into paging?
1. System gets low on physical memory but the ARC cache is looking at=20
available kvm (of which there is plenty.)  The ARC cache continues to=20
expand.

2. vm_paging_needed() returns true and the system begins to page off to=20
the swap.  At the same time the ARC cache is pared down because=20
arc_reclaim_needed has returned "1".

3. As the ARC cache shrinks and paging occurs vm_paging_needed() returns =

false.  Paging out ceases but inactive pages remain on the swap.  They=20
are not recalled until and unless they are scheduled to execute. =20
Arc_reclaim_needed again returns "0".

4. The hold-down timer expires in the ARC cache code ("arc_grow_retry",=20
declared as 60 seconds) and the ARC cache begins to expand again.

Go back to #2 until the system's performance starts to deteriorate badly =

enough due to the paging that you notice it, which occurs when something =

that is actually consuming CPU time has to be called in from swap.

This is consistent with what I and others have observed on both 9.2 and=20
10.0; the ARC will expand until it hits the maximum configured even at=20
the expense of forcing pages onto the swap.  In this specific machine's=20
case left to defaults it will grab nearly all physical memory (over 20GB =

of 24) and wire it down.

Limiting arc_max to 16GB sorta fixes it.  I say "sorta" because it turns =

out that 16GB is still too much for the workload; it prevents the=20
pathological behavior where system "stalls" happen but only in the=20
extreme.  It turns out with the patch in my ARC cache stabilizes at=20
about 13.5GB during the busiest part of the day, growing to about 16=20
off-hours.

One of the problems with just limiting it in /boot/loader.conf is that=20
you have to guess and the system doesn't reasonably adapt to changing=20
memory loads.  The code is clearly intended to do that but it doesn't=20
end up working that way in practice.
>  =20
>   > because the pager will not recall pages from the swap until they ar=
e next
>   > executed. This leads the ARC to try to fill in all the available RA=
M even
>   > though pages have been pushed off onto swap. Not good.
>  =20
>   Unused physical memory is a waste.  It is true that ARC tries to use =
as much of
>   memory as it is allowed.  The same applies to the page cache (Active,=
 Inactive).
>   Memory management is a dynamic system and there are a few competing a=
gents.
>  =20
That's true.  However, what the stock code does is force working set out =

of memory and into the swap.  The ideal situation is one in which there=20
is no free memory because cache has sized itself to consume everything=20
*not* necessary for the working set of the processes that are running. =20
Unfortunately we cannot determine this presciently because a new process =

may come along and we do not necessarily know for how long a process=20
that is blocked on an event will remain blocked (e.g. something waiting=20
on network I/O, etc.)

However, it is my contention that you do not want to evict a process=20
that is scheduled to run (or is going to be) in favor of disk cache=20
because you're defeating yourself by doing so.  The point of the disk=20
cache is to avoid going to the physical disk for I/O, but if you page=20
something you have ditched a physical I/O for data in favor of having to =

go to physical disk *twice* -- first to write the paged-out data to=20
swap, and then to retrieve it when it is to be executed.  This also=20
appears to be consistent with what is present for Solaris machines.

 From the Sun code:

#ifdef sun
         /*
          * take 'desfree' extra pages, so we reclaim sooner, rather than=
 later
          */
         extra =3D desfree;
 =20
         /*
          * check that we're out of range of the pageout scanner.  It sta=
rts to
          * schedule paging if freemem is less than lotsfree and needfree=
=2E
          * lotsfree is the high-water mark for pageout, and needfree is =
the
          * number of needed free pages.  We add extra pages here to make=
 sure
          * the scanner doesn't start up while we're freeing memory.
          */
         if (freemem < lotsfree + needfree + extra)
                 return (1);
 =20
         /*
          * check to make sure that swapfs has enough space so that anon
          * reservations can still succeed. anon_resvmem() checks that th=
e
          * availrmem is greater than swapfs_minfree, and the number of r=
eserved
          * swap pages.  We also add a bit of extra here just to prevent
          * circumstances from getting really dire.
          */
         if (availrmem < swapfs_minfree + swapfs_reserve + extra)
                 return (1);

"freemem" is not virtual memory, it's actual memory.  "Lotsfree" is the=20
point where the system considers free RAM to be "ample"; "needfree" is=20
the "desperation" point and "extra" is the margin (presumably for image=20
activation.)

The base code on FreeBSD doesn't look at physical memory at all; it=20
looks at kvm space instead.

>   It is hard to correctly tune that system using a large hummer such as=
 your
>   patch.  I believe that with your patch ARC will get shrunk to its min=
imum size
>   in due time.  Active + Inactive will grow to use the memory that you =
are denying
>   to ARC driving Free below a threshold, which will reduce ARC.  Repeat=
ed enough
>   times this will drive ARC to its minimum.
I disagree both in design theory and based on the empirical evidence of=20
actual operation.

First, I don't (ever) want to give memory to the ARC cache that=20
otherwise would go to "active", because any time I do that I'm going to=20
force two page events, which is double the amount of I/O I would take on =

a cache *miss*, and even with the ARC at minimum I get a reasonable hit=20
percentage.  If I therefore prefer ARC over "active" pages I am going to =

take *at least* a 200% penalty on physical I/O and if I get an 80% hit=20
ratio with the ARC at a minimum the penalty is closer to 800%!

For inactive pages it's a bit more complicated as those may not be=20
reactivated.  However, I am trusting FreeBSD's VM subsystem to demote=20
those that are unlikely to be reactivated to the cache bucket and then=20
to "free", where they are able to be re-used.  This is consistent with=20
what I actually see on a running system -- the "inact" bucket is=20
typically fairly large (often on a busy machine close to that of=20
"active") but pages demoted to "cache" don't stay there long - they=20
either get re-promoted back up or they are freed and go on the free list.=


The only time I see "inact" get out of control is when there's a kernel=20
memory leak somewhere (such as what I ran into the other day with the=20
in-kernel NAT subsystem on 10-STABLE.)  But that's a bug and if it=20
happens you're going to get bit anyway.

For example right now on one of my very busy systems with 24GB of=20
installed RAM and many terabytes of storage across three ZFS pools I'm=20
seeing 17GB wired of which 13.5 is ARC cache.  That's the adaptive=20
figure it currently is running at, with a maximum of 22.3 and a minimum=20
of 2.79 (8:1 ratio.)  The remainder is wired down for other reasons=20
(there's a fairly large Postgres server running on that box, among other =

things, and it has a big shared buffer declaration -- that's most of the =

difference.)  Cache hit efficiency is currently 97.8%.

Active is 2.26G right now, and inactive is 2.09G.  Both are stable.=20
Overnight inactive will drop to about 1.1GB while active will not change =

all that much since most of it postgres and the middleware that talks to =

it along with apache, which leaves most of its processes present even=20
when they go idle.  Peak load times are about right now (mid-day), and=20
again when the system is running backups nightly.

Cache is 7448, in other words, insignificant.  Free memory is 2.6G.

The tunable is set to 10%, which is almost exactly what free memory is.  =

I find that when the system gets under 1G free transient image=20
activation can drive it into paging and performance starts to suffer for =

my particular workload.

>  =20
>   Also, there are a few technical problems with the patch:
>   - you don't need to use sysctl interface in kernel, the values you ne=
ed are
>   available directly, just take a look at e.g. implementation of vm_pag=
ing_needed()
That's easily fixed.  I will look at it.
>   - similarly, querying vfs.zfs.arc_freepage_percent_target value via
>   kernel_sysctlbyname is just bogus; you can use percent_target directl=
y
I did not know if during setup of the OID the value was copied (and thus =

you had to reference it later on) or the entry simply took the pointer=20
and stashed that.  Easily corrected.
>   - you don't need to sum various page counters to get a total count, t=
here is
>   v_page_count
>  =20
Fair enough as well.
>   Lastly, can you try to test reverting your patch and instead setting
>   vm.lowmem_period=3D0 ?
>  =20
Yes.  By default it's 10; I have not tampered with that default.

Let me do a bit of work and I'll post back with a revised patch. Perhaps =

a tunable for percentage free + a free reserve that is a "floor"?  The=20
problem with that is where to put the defaults.  One option would be to=20
grab total size at init time and compute something similar to what=20
"lotsfree" is for Solaris, allowing that to be tuned with the percentage =

if desired.  I selected 25% because that's what the original test was=20
expressing and it should be reasonable for modest RAM configurations. =20
It's clearly too high for moderately large (or huge) memory machines=20
unless they have a lot of RAM -hungry processes running on them.

The percentage test, however, is an easy knob to twist that is unlikely=20
to severely harm you if you dial it too far in either direction; anyone=20
setting it to zero obviously knows what they're getting into, and if you =

crank it too high all you end up doing is limiting the ARC to the=20
minimum value.

--=20
-- Karl
karl@denninger.net



--------------ms000607060402000804060701
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIFTzCC
BUswggQzoAMCAQICAQgwDQYJKoZIhvcNAQEFBQAwgZ0xCzAJBgNVBAYTAlVTMRAwDgYDVQQI
EwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM
TEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkqhkiG9w0BCQEWIGN1c3Rv
bWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0MB4XDTEzMDgyNDE5MDM0NFoXDTE4MDgyMzE5
MDM0NFowWzELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExFzAVBgNVBAMTDkthcmwg
RGVubmluZ2VyMSEwHwYJKoZIhvcNAQkBFhJrYXJsQGRlbm5pbmdlci5uZXQwggIiMA0GCSqG
SIb3DQEBAQUAA4ICDwAwggIKAoICAQC5n2KBrBmG22nVntVdvgKCB9UcnapNThrW1L+dq6th
d9l4mj+qYMUpJ+8I0rTbY1dn21IXQBoBQmy8t1doKwmTdQ59F0FwZEPt/fGbRgBKVt3Quf6W
6n7kRk9MG6gdD7V9vPpFV41e+5MWYtqGWY3ScDP8SyYLjL/Xgr+5KFKkDfuubK8DeNqdLniV
jHo/vqmIgO+6NgzPGPgmbutzFQXlxUqjiNAAKzF2+Tkddi+WKABrcc/EqnBb0X8GdqcIamO5
SyVmuM+7Zdns7D9pcV16zMMQ8LfNFQCDvbCuuQKMDg2F22x5ekYXpwjqTyfjcHBkWC8vFNoY
5aFMdyiN/Kkz0/kduP2ekYOgkRqcShfLEcG9SQ4LQZgqjMpTjSOGzBr3tOvVn5LkSJSHW2Z8
Q0dxSkvFG2/lsOWFbwQeeZSaBi5vRZCYCOf5tRd1+E93FyQfpt4vsrXshIAk7IK7f0qXvxP4
GDli5PKIEubD2Bn+gp3vB/DkfKySh5NBHVB+OPCoXRUWBkQxme65wBO02OZZt0k8Iq0i4Rci
WV6z+lQHqDKtaVGgMsHn6PoeYhjf5Al5SP+U3imTjF2aCca1iDB5JOccX04MNljvifXgcbJN
nkMgrzmm1ZgJ1PLur/ADWPlnz45quOhHg1TfUCLfI/DzgG7Z6u+oy4siQuFr9QT0MQIDAQAB
o4HWMIHTMAkGA1UdEwQCMAAwEQYJYIZIAYb4QgEBBAQDAgWgMAsGA1UdDwQEAwIF4DAsBglg
hkgBhvhCAQ0EHxYdT3BlblNTTCBHZW5lcmF0ZWQgQ2VydGlmaWNhdGUwHQYDVR0OBBYEFHw4
+LnuALyLA5Cgy7T5ZAX1WzKPMB8GA1UdIwQYMBaAFF3U3hpBZq40HB5VM7B44/gmXiI0MDgG
CWCGSAGG+EIBAwQrFilodHRwczovL2N1ZGFzeXN0ZW1zLm5ldDoxMTQ0My9yZXZva2VkLmNy
bDANBgkqhkiG9w0BAQUFAAOCAQEAZ0L4tQbBd0hd4wuw/YVqEBDDXJ54q2AoqQAmsOlnoxLO
31ehM/LvrTIP4yK2u1VmXtUumQ4Ao15JFM+xmwqtEGsh70RRrfVBAGd7KOZ3GB39FP2TgN/c
L5fJKVxOqvEnW6cL9QtvUlcM3hXg8kDv60OB+LIcSE/P3/s+0tEpWPjxm3LHVE7JmPbZIcJ1
YMoZvHh0NSjY5D0HZlwtbDO7pDz9sZf1QEOgjH828fhtborkaHaUI46pmrMjiBnY6ujXMcWD
pxtikki0zY22nrxfTs5xDWGxyrc/cmucjxClJF6+OYVUSaZhiiHfa9Pr+41okLgsRB0AmNwE
f6ItY3TI8DGCBQowggUGAgEBMIGjMIGdMQswCQYDVQQGEwJVUzEQMA4GA1UECBMHRmxvcmlk
YTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExDMRwwGgYD
VQQDExNDdWRhIFN5c3RlbXMgTExDIENBMS8wLQYJKoZIhvcNAQkBFiBjdXN0b21lci1zZXJ2
aWNlQGN1ZGFzeXN0ZW1zLm5ldAIBCDAJBgUrDgMCGgUAoIICOzAYBgkqhkiG9w0BCQMxCwYJ
KoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDAzMTgxNzE5MzJaMCMGCSqGSIb3DQEJBDEW
BBSu4P0qnTanu1q3c/E0uTfSKE5okTBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL
BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA
MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIG0BgkrBgEEAYI3EAQxgaYwgaMwgZ0xCzAJBgNV
BAYTAlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoT
EEN1ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkq
hkiG9w0BCQEWIGN1c3RvbWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0AgEIMIG2BgsqhkiG
9w0BCRACCzGBpqCBozCBnTELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExEjAQBgNV
BAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1zIExMQzEcMBoGA1UEAxMTQ3Vk
YSBTeXN0ZW1zIExMQyBDQTEvMC0GCSqGSIb3DQEJARYgY3VzdG9tZXItc2VydmljZUBjdWRh
c3lzdGVtcy5uZXQCAQgwDQYJKoZIhvcNAQEBBQAEggIAMK/1qngYhrlSJ5PZ64eBJmH+0k5F
3j4G3LKc1jp9Z9wq5q+WCcK3YeuIe8M/F2ENpPNemRGXJ0s7p/xoz8LnehLJpt/ER+cCqAIa
WQ48ZnHUxC3QqthvstwhA54kSKZW8UlFfXXMTf+D2imGA3DQCjFghHhaEMqlnz+ICoH1KXSU
gtk7JbuSgDWhOapPtXOWPCCbmSgCZA7Wjr0D6NeCmmD1UOrrBgkTi81yKSVzjoWlu8fB7FkW
Op/Qtb2AulRCXxPUoiYADmnB6b37JO/ZS8fcAC/RJF0ogluIY7GkzEbiO0t5yE9HsReaM6g3
skAi6QhldIKYi7poLZwuGJgqKbd4NcAmiS45ApnlqV1V4ZlByfFDApsvYF/HyNkpJfvrSWxf
XZr85qOdgHMe3vAmEb5OAIRa/ymHvBgVGwnOFSI6VlTltMUymVMubzENx7mw8yZIdE7BgPgW
vzAT02qwbv8GRkDbRLhtFs2ASp1l2WX90lD6ba2Lple/IMBo5xKpv9huH5mzLFPbwd9MP3as
Evx8mVNqjyVvGvd7EYRppAKMww7vqzM5FV8RE/8idd6dJ35oyOvJKyVGMfuXE0gwH7e4xzIN
bn3KiA9iV1R7pua32GA4a9782I2x4nyqln1ZkxrlwCNYqU6MCPuXLted0wcgkSJ0CT4y/xMm
h17b5M8AAAAAAAA=
--------------ms000607060402000804060701--





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53288024.2060005>