Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Mar 2014 13:09:23 -0500
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-fs@freebsd.org
Subject:   Fwd: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
Message-ID:  <5329DD53.2020308@denninger.net>
In-Reply-To: <5329DBF2.6060008@denninger.net>
References:  <5329DBF2.6060008@denninger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a cryptographically signed message in MIME format.

--------------ms030707020708050300080101
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable

CC'ing the list on my PR followup; forgot to include it when submitted.

-------- Original Message --------
Subject: 	Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix=

Date: 	Wed, 19 Mar 2014 13:03:30 -0500
From: 	Karl Denninger <karl@denninger.net>
To: 	bug-followup@FreeBSD.org, karl@fs.denninger.net



The 20% invasion of the first-level paging regime looks too aggressive
under very heavy load.  I have changed my system here to 10% (during runt=
ime) and obtain
a materially-better response profile.

At 20% the system will still occasionally page recently-used executable
code to disk before cache is released which is undesirable.  10% looks
better but may STILL be too aggressive (in other words, 5% might be
"just right")

Being able to tune this in real time is a BIG help!

Adjusted patch follows (only a couple of lines have changed)

*** arc.c.original	Thu Mar 13 09:18:48 2014
--- arc.c	Wed Mar 19 13:01:48 2014
***************
*** 18,23 ****
--- 18,99 ----
     *
     * CDDL HEADER END
     */
+
+ /* Karl Denninger (karl@denninger.net), 3/18/2014, FreeBSD-specific
+  *
+  * If "NEWRECLAIM" is defined, change the "low memory" warning that cau=
ses
+  * the ARC cache to be pared down.  The reason for the change is that t=
he
+  * apparent attempted algorithm is to start evicting ARC cache when fre=
e
+  * pages fall below 25% of installed RAM.  This maps reasonably well to=
 how
+  * Solaris is documented to behave; when "lotsfree" is invaded ZFS is t=
old
+  * to pare down.
+  *
+  * The problem is that on FreeBSD machines the system doesn't appear to=
 be
+  * getting what the authors of the original code thought they were look=
ing at
+  * with its test -- or at least not what Solaris did -- and as a result=
 that
+  * test never triggers.  That leaves the only reclaim trigger as the "p=
aging
+  * needed" status flag, and by the time * that trips the system is alre=
ady
+  * in low-memory trouble.  This can lead to severe pathological behavio=
r
+  * under the following scenario:
+  * - The system starts to page and ARC is evicted.
+  * - The system stops paging as ARC's eviction drops wired RAM a bit.
+  * - ARC starts increasing its allocation again, and wired memory grows=
=2E
+  * - A new image is activated, and the system once again attempts to pa=
ge.
+  * - ARC starts to be evicted again.
+  * - Back to #2
+  *
+  * Note that ZFS's ARC default (unless you override it in /boot/loader.=
conf)
+  * is to allow the ARC cache to grab nearly all of free RAM, provided n=
obody
+  * else needs it.  That would be ok if we evicted cache when required.
+  *
+  * Unfortunately the system can get into a state where it never
+  * manages to page anything of materiality back in, as if there is acti=
ve
+  * I/O the ARC will start grabbing space once again as soon as the memo=
ry
+  * contention state drops.  For this reason the "paging is occurring" f=
lag
+  * should be the **last resort** condition for ARC eviction; you want t=
o
+  * (as Solaris does) start when there is material free RAM left BUT the=

+  * vm system thinks it needs to be active to steal pages back in the at=
tempt
+  * to never get into the condition where you're potentially paging off
+  * executables in favor of leaving disk cache allocated.
+  *
+  * To fix this we change how we look at low memory, declaring two new
+  * runtime tunables.
+  *
+  * The new sysctls are:
+  * vfs.zfs.arc_freepages (free pages required to call RAM "sufficient")=

+  * vfs.zfs.arc_freepage_percent (additional reservation percentage, def=
ault 0)
+  *
+  * vfs.zfs.arc_freepages is initialized from vm.stats.vm.v_free_target,=

+  * less 10% if we find that it is zero.  Note that vm.stats.vm.v_free_t=
arget
+  * is not initialized at boot -- the system has to be running first, so=
 we
+  * cannot initialize this in arc_init.  So we check during runtime; thi=
s
+  * also allows the user to return to defaults by setting it to zero.
+  *
+  * This should insure that we allow the VM system to steal pages first,=

+  * but pare the cache before we suspend processes attempting to get mor=
e
+  * memory, thereby avoiding "stalls."  You can set this higher if you w=
ish,
+  * or force a specific percentage reservation as well, but doing so may=

+  * cause the cache to pare back while the VM system remains willing to
+  * allow "inactive" pages to accumulate.  The challenge is that image
+  * activation can force things into the page space on a repeated basis
+  * if you allow this level to be too small (the above pathological
+  * behavior); the defaults should avoid that behavior but the sysctls
+  * are exposed should your workload require adjustment.
+  *
+  * If we're using this check for low memory we are replacing the previo=
us
+  * ones, including the oddball "random" reclaim that appears to fire fa=
r
+  * more often than it should.  We still trigger if the system pages.
+  *
+  * If you turn on NEWRECLAIM_DEBUG then the kernel will print on the co=
nsole
+  * status messages when the reclaim status trips on and off, along with=
 the
+  * page count aggregate that triggered it (and the free space) for each=

+  * event.
+  */
+
+ #define	NEWRECLAIM
+ #undef	NEWRECLAIM_DEBUG
+
+
    /*
     * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights=
 reserved.
     * Copyright (c) 2013 by Delphix. All rights reserved.
***************
*** 139,144 ****
--- 215,226 ----
   =20
    #include <vm/vm_pageout.h>
   =20
+ #ifdef	NEWRECLAIM
+ #ifdef	__FreeBSD__
+ #include <sys/sysctl.h>
+ #endif
+ #endif	/* NEWRECLAIM */
+
    #ifdef illumos
    #ifndef _KERNEL
    /* set with ZFS_DEBUG=3Dwatch, to enable watchpoints on frozen buffer=
s */
***************
*** 203,218 ****
--- 285,320 ----
    int zfs_arc_shrink_shift =3D 0;
    int zfs_arc_p_min_shift =3D 0;
    int zfs_disable_dup_eviction =3D 0;
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ static	int freepages =3D 0;	/* This much memory is considered critical =
*/
+ static	int percent_target =3D 0;	/* Additionally reserve "X" percent fr=
ee RAM */
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
   =20
    TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);
    TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);
    TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ TUNABLE_INT("vfs.zfs.arc_freepages", &freepages);
+ TUNABLE_INT("vfs.zfs.arc_freepage_percent", &percent_target);
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
    SYSCTL_DECL(_vfs_zfs);
    SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max=
, 0,
        "Maximum ARC size");
    SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min=
, 0,
        "Minimum ARC size");
   =20
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepages, CTLFLAG_RWTUN, &freepages=
, 0, "ARC Free RAM Pages Required");
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepage_percent, CTLFLAG_RWTUN, &pe=
rcent_target, 0, "ARC Free RAM Target percentage");
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
    /*
     * Note that buffers can be in one of 6 states:
     *	ARC_anon	- anonymous (discussed below)
***************
*** 2438,2443 ****
--- 2540,2557 ----
    {
   =20
    #ifdef _KERNEL
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ 	u_int	vmfree =3D 0;
+ 	u_int	vmtotal =3D 0;
+ 	size_t	vmsize;
+ #ifdef	NEWRECLAIM_DEBUG
+ 	static	int	xval =3D -1;
+ 	static	int	oldpercent =3D 0;
+ 	static	int	oldfreepages =3D 0;
+ #endif	/* NEWRECLAIM_DEBUG */
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
   =20
    	if (needfree)
    		return (1);
***************
*** 2476,2481 ****
--- 2590,2596 ----
    		return (1);
   =20
    #if defined(__i386)
+
    	/*
    	 * If we're on an i386 platform, it's possible that we'll exhaust th=
e
    	 * kernel heap space before we ever run out of available physical
***************
*** 2492,2502 ****
    		return (1);
    #endif
    #else	/* !sun */
    	if (kmem_used() > (kmem_size() * 3) / 4)
    		return (1);
    #endif	/* sun */
   =20
- #else
    	if (spa_get_random(100) =3D=3D 0)
    		return (1);
    #endif
--- 2607,2680 ----
    		return (1);
    #endif
    #else	/* !sun */
+
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ /*
+  * Implement the new tunable free RAM algorithm.  We check the free pag=
es
+  * against the minimum specified target and the percentage that should =
be
+  * free.  If we're low we ask for ARC cache shrinkage.  If this is defi=
ned
+  * on a FreeBSD system the older checks are not performed.
+  *
+  * Check first to see if we need to init freepages, then test.
+  */
+ 	if (!freepages) {		/* If zero then (re)init */
+ 		vmsize =3D sizeof(vmtotal);
+ 		kernel_sysctlbyname(curthread, "vm.stats.vm.v_free_target", &vmtotal,=
 &vmsize, NULL, 0, NULL, 0);
+ 		freepages =3D vmtotal - (vmtotal / 10);
+ #ifdef	NEWRECLAIM_DEBUG
+ 		printf("ZFS ARC: Default vfs.zfs.arc_freepages to [%u] [%u less 10%%]=
\n", freepages, vmtotal);
+ #endif	/* NEWRECLAIM_DEBUG */
+ 	}
+
+ 	vmsize =3D sizeof(vmtotal);
+         kernel_sysctlbyname(curthread, "vm.stats.vm.v_page_count", &vmt=
otal, &vmsize, NULL, 0, NULL, 0);
+ 	vmsize =3D sizeof(vmfree);
+         kernel_sysctlbyname(curthread, "vm.stats.vm.v_free_count", &vmf=
ree, &vmsize, NULL, 0, NULL, 0);
+ #ifdef	NEWRECLAIM_DEBUG
+ 	if (percent_target !=3D oldpercent) {
+ 		printf("ZFS ARC: Reservation percent change to [%d], [%d] pages, [%d]=
 free\n", percent_target, vmtotal, vmfree);
+ 		oldpercent =3D percent_target;
+ 	}
+ 	if (freepages !=3D oldfreepages) {
+ 		printf("ZFS ARC: Low RAM page change to [%d], [%d] pages, [%d] free\n=
", freepages, vmtotal, vmfree);
+ 		oldfreepages =3D freepages;
+ 	}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 	if (!vmtotal) {
+ 		vmtotal =3D 1;	/* Protect against divide by zero */
+ 				/* (should be impossible, but...) */
+ 	}
+ /*
+  * Now figure out how much free RAM we require to call the ARC cache st=
atus
+  * "ok".  Add the percentage specified of the total to the base require=
ment.
+  */
+
+ 	if (vmfree < freepages + ((vmtotal / 100) * percent_target)) {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval !=3D 1) {
+ 			printf("ZFS ARC: RECLAIM total %u, free %u, free pct (%u), reserved =
(%u), target pct (%u)\n", vmtotal, vmfree, ((vmfree * 100) / vmtotal), fr=
eepages, percent_target);
+ 			xval =3D 1;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		return(1);
+ 	} else {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval !=3D 0) {
+ 			printf("ZFS ARC: NORMAL total %u, free %u, free pct (%u), reserved (=
%u), target pct (%u)\n", vmtotal, vmfree, ((vmfree * 100) / vmtotal), fre=
epages, percent_target);
+ 			xval =3D 0;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		return(0);
+ 	}
+
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
    	if (kmem_used() > (kmem_size() * 3) / 4)
    		return (1);
    #endif	/* sun */
   =20
    	if (spa_get_random(100) =3D=3D 0)
    		return (1);
    #endif


--=20
-- Karl
karl@denninger.net





--------------ms030707020708050300080101
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIFTzCC
BUswggQzoAMCAQICAQgwDQYJKoZIhvcNAQEFBQAwgZ0xCzAJBgNVBAYTAlVTMRAwDgYDVQQI
EwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM
TEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkqhkiG9w0BCQEWIGN1c3Rv
bWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0MB4XDTEzMDgyNDE5MDM0NFoXDTE4MDgyMzE5
MDM0NFowWzELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExFzAVBgNVBAMTDkthcmwg
RGVubmluZ2VyMSEwHwYJKoZIhvcNAQkBFhJrYXJsQGRlbm5pbmdlci5uZXQwggIiMA0GCSqG
SIb3DQEBAQUAA4ICDwAwggIKAoICAQC5n2KBrBmG22nVntVdvgKCB9UcnapNThrW1L+dq6th
d9l4mj+qYMUpJ+8I0rTbY1dn21IXQBoBQmy8t1doKwmTdQ59F0FwZEPt/fGbRgBKVt3Quf6W
6n7kRk9MG6gdD7V9vPpFV41e+5MWYtqGWY3ScDP8SyYLjL/Xgr+5KFKkDfuubK8DeNqdLniV
jHo/vqmIgO+6NgzPGPgmbutzFQXlxUqjiNAAKzF2+Tkddi+WKABrcc/EqnBb0X8GdqcIamO5
SyVmuM+7Zdns7D9pcV16zMMQ8LfNFQCDvbCuuQKMDg2F22x5ekYXpwjqTyfjcHBkWC8vFNoY
5aFMdyiN/Kkz0/kduP2ekYOgkRqcShfLEcG9SQ4LQZgqjMpTjSOGzBr3tOvVn5LkSJSHW2Z8
Q0dxSkvFG2/lsOWFbwQeeZSaBi5vRZCYCOf5tRd1+E93FyQfpt4vsrXshIAk7IK7f0qXvxP4
GDli5PKIEubD2Bn+gp3vB/DkfKySh5NBHVB+OPCoXRUWBkQxme65wBO02OZZt0k8Iq0i4Rci
WV6z+lQHqDKtaVGgMsHn6PoeYhjf5Al5SP+U3imTjF2aCca1iDB5JOccX04MNljvifXgcbJN
nkMgrzmm1ZgJ1PLur/ADWPlnz45quOhHg1TfUCLfI/DzgG7Z6u+oy4siQuFr9QT0MQIDAQAB
o4HWMIHTMAkGA1UdEwQCMAAwEQYJYIZIAYb4QgEBBAQDAgWgMAsGA1UdDwQEAwIF4DAsBglg
hkgBhvhCAQ0EHxYdT3BlblNTTCBHZW5lcmF0ZWQgQ2VydGlmaWNhdGUwHQYDVR0OBBYEFHw4
+LnuALyLA5Cgy7T5ZAX1WzKPMB8GA1UdIwQYMBaAFF3U3hpBZq40HB5VM7B44/gmXiI0MDgG
CWCGSAGG+EIBAwQrFilodHRwczovL2N1ZGFzeXN0ZW1zLm5ldDoxMTQ0My9yZXZva2VkLmNy
bDANBgkqhkiG9w0BAQUFAAOCAQEAZ0L4tQbBd0hd4wuw/YVqEBDDXJ54q2AoqQAmsOlnoxLO
31ehM/LvrTIP4yK2u1VmXtUumQ4Ao15JFM+xmwqtEGsh70RRrfVBAGd7KOZ3GB39FP2TgN/c
L5fJKVxOqvEnW6cL9QtvUlcM3hXg8kDv60OB+LIcSE/P3/s+0tEpWPjxm3LHVE7JmPbZIcJ1
YMoZvHh0NSjY5D0HZlwtbDO7pDz9sZf1QEOgjH828fhtborkaHaUI46pmrMjiBnY6ujXMcWD
pxtikki0zY22nrxfTs5xDWGxyrc/cmucjxClJF6+OYVUSaZhiiHfa9Pr+41okLgsRB0AmNwE
f6ItY3TI8DGCBQowggUGAgEBMIGjMIGdMQswCQYDVQQGEwJVUzEQMA4GA1UECBMHRmxvcmlk
YTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExDMRwwGgYD
VQQDExNDdWRhIFN5c3RlbXMgTExDIENBMS8wLQYJKoZIhvcNAQkBFiBjdXN0b21lci1zZXJ2
aWNlQGN1ZGFzeXN0ZW1zLm5ldAIBCDAJBgUrDgMCGgUAoIICOzAYBgkqhkiG9w0BCQMxCwYJ
KoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDAzMTkxODA5MjNaMCMGCSqGSIb3DQEJBDEW
BBRruoAjO8dXZsy2+jSoLTdoJbE0yDBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL
BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA
MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIG0BgkrBgEEAYI3EAQxgaYwgaMwgZ0xCzAJBgNV
BAYTAlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoT
EEN1ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkq
hkiG9w0BCQEWIGN1c3RvbWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0AgEIMIG2BgsqhkiG
9w0BCRACCzGBpqCBozCBnTELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExEjAQBgNV
BAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1zIExMQzEcMBoGA1UEAxMTQ3Vk
YSBTeXN0ZW1zIExMQyBDQTEvMC0GCSqGSIb3DQEJARYgY3VzdG9tZXItc2VydmljZUBjdWRh
c3lzdGVtcy5uZXQCAQgwDQYJKoZIhvcNAQEBBQAEggIAVkJSc75QWxWKgpV4VADi3nanpWeu
syEPEYEYjegOSM/YAbSCGdjs+g8JwUk5zFqJmjbJ8EfowliRw/8iNXJa0J3qxuctyzcs3rLb
/4/m5PHTRwdNFfxubBLgtdfuUYZcrU2dARvXaiqlRVDlgI+y3MtSxc6Us8hJHusCJpJaE6oH
9sKd5yRtVQh2SDtiNveF9CYP4u80ExkX08g/xaaftspM5gpDhWfRWu4vhO3A0bELomAzbbAn
23uwyUiGKcL++UeffIFNa53KOG8idYLNwzvY+rOGreIwEk8UZIZUzc/HppIB7+4eG2usOWl9
cbnxYvoZ/tXORJ1H1okRm3XF39wSbR5YRj2pdkEbKNFRuf/+Gpp8xTUt9xnYwLTuC9/KmB5R
NwZc9h+XmWRw9+mf5uheq6H543PPmbNcgzRn1qHiNWlUEil6BDqk0RJv+M0OFNulRGB/WsRF
CSxemmXC5+6rF40GRLOMjMf08GEJai4+ftr7NdXpfnS4EUjH25TmK3++BrTymZTcDLqHXkaR
mYGkh5asoPV2sUtN2a0IoB+pBPR8Ekha1VflXD8Ujs1tzdeXctGy2qIZHEHthttjEEXMXgCq
gxQN+WzDbWC3i4JQmxzWbNQXMU5mp5biokja4PpLfVkc26rTfaioAcx84WHuJjJlYeoTdkxr
Eq5SGzkAAAAAAAA=
--------------ms030707020708050300080101--





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5329DD53.2020308>