From owner-freebsd-fs@FreeBSD.ORG Mon Mar 24 11:50:02 2014 Return-Path: Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 12A47A6E for ; Mon, 24 Mar 2014 11:50:02 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id D735B8DA for ; Mon, 24 Mar 2014 11:50:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.8/8.14.8) with ESMTP id s2OBo1aA029494 for ; Mon, 24 Mar 2014 11:50:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.8/8.14.8/Submit) id s2OBo1Oc029493; Mon, 24 Mar 2014 11:50:01 GMT (envelope-from gnats) Date: Mon, 24 Mar 2014 11:50:01 GMT Message-Id: <201403241150.s2OBo1Oc029493@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org Cc: From: Karl Denninger Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: Karl Denninger List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Mar 2014 11:50:02 -0000 The following reply was made to PR kern/187594; it has been noted by GNATS. From: Karl Denninger To: bug-followup@FreeBSD.org, karl@fs.denninger.net Cc: Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix Date: Mon, 24 Mar 2014 06:41:16 -0500 This is a cryptographically signed message in MIME format. --------------ms090509050705090705090709 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Update: 1. Patch is still good against latest arc.c change (associated with new=20 flags on the pool). 2. Change default low memory warning for the arc to cnt.v_page_count; no = margin. This appears to provide the best performance and does not cause = problems with inact pages or other misbehavior on my test systems. 3. Expose the return flag (arc_shrink_needed) so if you care to watch it = for some reason, you can. *** arc.c.original Sun Mar 23 14:56:01 2014 --- arc.c Sun Mar 23 15:12:15 2014 *************** *** 18,23 **** --- 18,95 ---- * * CDDL HEADER END */ + + /* Karl Denninger (karl@denninger.net), 3/20/2014, FreeBSD-specific + * + * If "NEWRECLAIM" is defined, change the "low memory" warning that cau= ses + * the ARC cache to be pared down. The reason for the change is that t= he + * apparent attempted algorithm is to start evicting ARC cache when fre= e + * pages fall below 25% of installed RAM. This maps reasonably well to= how + * Solaris is documented to behave; when "lotsfree" is invaded ZFS is t= old + * to pare down. + * + * The problem is that on FreeBSD machines the system doesn't appear to= be + * getting what the authors of the original code thought they were look= ing at + * with its test -- or at least not what Solaris did -- and as a result= that + * test never triggers. That leaves the only reclaim trigger as the "p= aging + * needed" status flag, and by the time * that trips the system is alre= ady + * in low-memory trouble. This can lead to severe pathological behavio= r + * under the following scenario: + * - The system starts to page and ARC is evicted. + * - The system stops paging as ARC's eviction drops wired RAM a bit. + * - ARC starts increasing its allocation again, and wired memory grows= =2E + * - A new image is activated, and the system once again attempts to pa= ge. + * - ARC starts to be evicted again. + * - Back to #2 + * + * Note that ZFS's ARC default (unless you override it in /boot/loader.= conf) + * is to allow the ARC cache to grab nearly all of free RAM, provided n= obody + * else needs it. That would be ok if we evicted cache when required. + * + * Unfortunately the system can get into a state where it never + * manages to page anything of materiality back in, as if there is acti= ve + * I/O the ARC will start grabbing space once again as soon as the memo= ry + * contention state drops. For this reason the "paging is occurring" f= lag + * should be the **last resort** condition for ARC eviction; you want t= o + * (as Solaris does) start when there is material free RAM left BUT the= + * vm system thinks it needs to be active to steal pages back in the at= tempt + * to never get into the condition where you're potentially paging off + * executables in favor of leaving disk cache allocated. + * + * To fix this we change how we look at low memory, declaring two new + * runtime tunables and one status. + * + * The new sysctls are: + * vfs.zfs.arc_freepages (free pages required to call RAM "sufficient")= + * vfs.zfs.arc_freepage_percent (additional reservation percentage, def= ault 0) + * vfs.zfs.arc_shrink_needed (shows "1" if we're asking for shrinking t= he ARC) + * + * vfs.zfs.arc_freepages is initialized from vm.v_free_target. + * This should insure that we allow the VM system to steal pages, + * but pare the cache before we suspend processes attempting to get mor= e + * memory, thereby avoiding "stalls." You can set this higher if you w= ish, + * or force a specific percentage reservation as well, but doing so may= + * cause the cache to pare back while the VM system remains willing to + * allow "inactive" pages to accumulate. The challenge is that image + * activation can force things into the page space on a repeated basis + * if you allow this level to be too small (the above pathological + * behavior); the defaults should avoid that behavior but the sysctls + * are exposed should your workload require adjustment. + * + * If we're using this check for low memory we are replacing the previo= us + * ones, including the oddball "random" reclaim that appears to fire fa= r + * more often than it should. We still trigger if the system pages. + * + * If you turn on NEWRECLAIM_DEBUG then the kernel will print on the co= nsole + * status messages when the reclaim status trips on and off, along with= the + * page count aggregate that triggered it (and the free space) for each= + * event. + */ + + #define NEWRECLAIM + #undef NEWRECLAIM_DEBUG + + /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights = reserved. * Copyright (c) 2013 by Delphix. All rights reserved. *************** *** 139,144 **** --- 211,223 ---- =20 #include =20 + #ifdef NEWRECLAIM + #ifdef __FreeBSD__ + #include + #include + #endif + #endif /* NEWRECLAIM */ + #ifdef illumos #ifndef _KERNEL /* set with ZFS_DEBUG=3Dwatch, to enable watchpoints on frozen buffers= */ *************** *** 203,218 **** --- 282,320 ---- int zfs_arc_shrink_shift =3D 0; int zfs_arc_p_min_shift =3D 0; int zfs_disable_dup_eviction =3D 0; + #ifdef NEWRECLAIM + #ifdef __FreeBSD__ + static int freepages =3D 0; /* This much memory is considered critical = */ + static int percent_target =3D 0; /* Additionally reserve "X" percent fr= ee RAM */ + static int shrink_needed =3D 0; /* Shrinkage of ARC cache needed? */ + #endif /* __FreeBSD__ */ + #endif /* NEWRECLAIM */ =20 TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max); TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min); TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit); + #ifdef NEWRECLAIM + #ifdef __FreeBSD__ + TUNABLE_INT("vfs.zfs.arc_freepages", &freepages); + TUNABLE_INT("vfs.zfs.arc_freepage_percent", &percent_target); + TUNABLE_INT("vfs.zfs.arc_shrink_needed", &shrink_needed); + #endif /* __FreeBSD__ */ + #endif /* NEWRECLAIM */ + SYSCTL_DECL(_vfs_zfs); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max,= 0, "Maximum ARC size"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min,= 0, "Minimum ARC size"); =20 + #ifdef NEWRECLAIM + #ifdef __FreeBSD__ + SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepages, CTLFLAG_RWTUN, &freepages= , 0, "ARC Free RAM Pages Required"); + SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepage_percent, CTLFLAG_RWTUN, &pe= rcent_target, 0, "ARC Free RAM Target percentage"); + SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_shrink_needed, CTLFLAG_RD, &shrink_n= eeded, 0, "ARC Memory Constrained (0 =3D no, 1 =3D yes)"); + #endif /* __FreeBSD__ */ + #endif /* NEWRECLAIM */ + /* * Note that buffers can be in one of 6 states: * ARC_anon - anonymous (discussed below) *************** *** 2438,2443 **** --- 2540,2550 ---- { =20 #ifdef _KERNEL + #ifdef NEWRECLAIM_DEBUG + static int xval =3D -1; + static int oldpercent =3D 0; + static int oldfreepages =3D 0; + #endif /* NEWRECLAIM_DEBUG */ =20 if (needfree) return (1); *************** *** 2476,2481 **** --- 2583,2589 ---- return (1); =20 #if defined(__i386) + /* * If we're on an i386 platform, it's possible that we'll exhaust the= * kernel heap space before we ever run out of available physical *************** *** 2492,2502 **** return (1); #endif #else /* !sun */ if (kmem_used() > (kmem_size() * 3) / 4) return (1); #endif /* sun */ =20 - #else if (spa_get_random(100) =3D=3D 0) return (1); #endif --- 2600,2664 ---- return (1); #endif #else /* !sun */ + + #ifdef NEWRECLAIM + #ifdef __FreeBSD__ + /* + * Implement the new tunable free RAM algorithm. We check the free pag= es + * against the minimum specified target and the percentage that should = be + * free. If we're low we ask for ARC cache shrinkage. If this is defi= ned + * on a FreeBSD system the older checks are not performed. + * + * Check first to see if we need to init freepages, then test. + */ + if (!freepages) { /* If zero then (re)init */ + freepages =3D cnt.v_free_target; + #ifdef NEWRECLAIM_DEBUG + printf("ZFS ARC: Default vfs.zfs.arc_freepages to [%u]\n", freepages)= ; + #endif /* NEWRECLAIM_DEBUG */ + } + #ifdef NEWRECLAIM_DEBUG + if (percent_target !=3D oldpercent) { + printf("ZFS ARC: Reservation percent change to [%d], [%d] pages, [%d]= free\n", percent_target, cnt.v_page_count, cnt.v_free_count); + oldpercent =3D percent_target; + } + if (freepages !=3D oldfreepages) { + printf("ZFS ARC: Low RAM page change to [%d], [%d] pages, [%d] free\n= ", freepages, cnt.v_page_count, cnt.v_free_count); + oldfreepages =3D freepages; + } + #endif /* NEWRECLAIM_DEBUG */ + /* + * Now figure out how much free RAM we require to call the ARC cache st= atus + * "ok". Add the percentage specified of the total to the base require= ment. + */ + + if (cnt.v_free_count < (freepages + ((cnt.v_page_count / 100) * percen= t_target))) { + #ifdef NEWRECLAIM_DEBUG + if (xval !=3D 1) { + printf("ZFS ARC: RECLAIM total %u, free %u, free pct (%u), reserved = (%u), target pct (%u)\n", cnt.v_page_count, cnt.v_free_count, ((cnt.v_fre= e_count * 100) / cnt.v_page_count), freepages, percent_target); + xval =3D 1; + } + #endif /* NEWRECLAIM_DEBUG */ + shrink_needed =3D 1; + return(1); + } else { + #ifdef NEWRECLAIM_DEBUG + if (xval !=3D 0) { + printf("ZFS ARC: NORMAL total %u, free %u, free pct (%u), reserved (= %u), target pct (%u)\n", cnt.v_page_count, cnt.v_free_count, ((cnt.v_free= _count * 100) / cnt.v_page_count), freepages, percent_target); + xval =3D 0; + } + #endif /* NEWRECLAIM_DEBUG */ + shrink_needed =3D 0; + return(0); + } + + #endif /* __FreeBSD__ */ + #endif /* NEWRECLAIM */ + if (kmem_used() > (kmem_size() * 3) / 4) return (1); #endif /* sun */ =20 if (spa_get_random(100) =3D=3D 0) return (1); #endif --=20 -- Karl karl@denninger.net --------------ms090509050705090705090709 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIFTzCC BUswggQzoAMCAQICAQgwDQYJKoZIhvcNAQEFBQAwgZ0xCzAJBgNVBAYTAlVTMRAwDgYDVQQI EwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM TEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkqhkiG9w0BCQEWIGN1c3Rv bWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0MB4XDTEzMDgyNDE5MDM0NFoXDTE4MDgyMzE5 MDM0NFowWzELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExFzAVBgNVBAMTDkthcmwg RGVubmluZ2VyMSEwHwYJKoZIhvcNAQkBFhJrYXJsQGRlbm5pbmdlci5uZXQwggIiMA0GCSqG SIb3DQEBAQUAA4ICDwAwggIKAoICAQC5n2KBrBmG22nVntVdvgKCB9UcnapNThrW1L+dq6th d9l4mj+qYMUpJ+8I0rTbY1dn21IXQBoBQmy8t1doKwmTdQ59F0FwZEPt/fGbRgBKVt3Quf6W 6n7kRk9MG6gdD7V9vPpFV41e+5MWYtqGWY3ScDP8SyYLjL/Xgr+5KFKkDfuubK8DeNqdLniV jHo/vqmIgO+6NgzPGPgmbutzFQXlxUqjiNAAKzF2+Tkddi+WKABrcc/EqnBb0X8GdqcIamO5 SyVmuM+7Zdns7D9pcV16zMMQ8LfNFQCDvbCuuQKMDg2F22x5ekYXpwjqTyfjcHBkWC8vFNoY 5aFMdyiN/Kkz0/kduP2ekYOgkRqcShfLEcG9SQ4LQZgqjMpTjSOGzBr3tOvVn5LkSJSHW2Z8 Q0dxSkvFG2/lsOWFbwQeeZSaBi5vRZCYCOf5tRd1+E93FyQfpt4vsrXshIAk7IK7f0qXvxP4 GDli5PKIEubD2Bn+gp3vB/DkfKySh5NBHVB+OPCoXRUWBkQxme65wBO02OZZt0k8Iq0i4Rci WV6z+lQHqDKtaVGgMsHn6PoeYhjf5Al5SP+U3imTjF2aCca1iDB5JOccX04MNljvifXgcbJN nkMgrzmm1ZgJ1PLur/ADWPlnz45quOhHg1TfUCLfI/DzgG7Z6u+oy4siQuFr9QT0MQIDAQAB o4HWMIHTMAkGA1UdEwQCMAAwEQYJYIZIAYb4QgEBBAQDAgWgMAsGA1UdDwQEAwIF4DAsBglg hkgBhvhCAQ0EHxYdT3BlblNTTCBHZW5lcmF0ZWQgQ2VydGlmaWNhdGUwHQYDVR0OBBYEFHw4 +LnuALyLA5Cgy7T5ZAX1WzKPMB8GA1UdIwQYMBaAFF3U3hpBZq40HB5VM7B44/gmXiI0MDgG CWCGSAGG+EIBAwQrFilodHRwczovL2N1ZGFzeXN0ZW1zLm5ldDoxMTQ0My9yZXZva2VkLmNy bDANBgkqhkiG9w0BAQUFAAOCAQEAZ0L4tQbBd0hd4wuw/YVqEBDDXJ54q2AoqQAmsOlnoxLO 31ehM/LvrTIP4yK2u1VmXtUumQ4Ao15JFM+xmwqtEGsh70RRrfVBAGd7KOZ3GB39FP2TgN/c L5fJKVxOqvEnW6cL9QtvUlcM3hXg8kDv60OB+LIcSE/P3/s+0tEpWPjxm3LHVE7JmPbZIcJ1 YMoZvHh0NSjY5D0HZlwtbDO7pDz9sZf1QEOgjH828fhtborkaHaUI46pmrMjiBnY6ujXMcWD pxtikki0zY22nrxfTs5xDWGxyrc/cmucjxClJF6+OYVUSaZhiiHfa9Pr+41okLgsRB0AmNwE f6ItY3TI8DGCBQowggUGAgEBMIGjMIGdMQswCQYDVQQGEwJVUzEQMA4GA1UECBMHRmxvcmlk YTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExDMRwwGgYD VQQDExNDdWRhIFN5c3RlbXMgTExDIENBMS8wLQYJKoZIhvcNAQkBFiBjdXN0b21lci1zZXJ2 aWNlQGN1ZGFzeXN0ZW1zLm5ldAIBCDAJBgUrDgMCGgUAoIICOzAYBgkqhkiG9w0BCQMxCwYJ KoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDAzMjQxMTQxMTZaMCMGCSqGSIb3DQEJBDEW BBSpjQAUz/irMw8ktf83fseTJbi4szBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIG0BgkrBgEEAYI3EAQxgaYwgaMwgZ0xCzAJBgNV BAYTAlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoT EEN1ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkq hkiG9w0BCQEWIGN1c3RvbWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0AgEIMIG2BgsqhkiG 9w0BCRACCzGBpqCBozCBnTELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExEjAQBgNV BAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1zIExMQzEcMBoGA1UEAxMTQ3Vk YSBTeXN0ZW1zIExMQyBDQTEvMC0GCSqGSIb3DQEJARYgY3VzdG9tZXItc2VydmljZUBjdWRh c3lzdGVtcy5uZXQCAQgwDQYJKoZIhvcNAQEBBQAEggIAgRTUPXy9Gqe69794f6zBeOsb1GYt t732rinQP9a/SadpluwziBBHL2O1NpjuaP/TPTCQIj0Tc7T02QJ8KPmsLVpRy9r115eLcQ8L Yp/jDpRwUXKn7690gNf4NknaqmQTkiT7GN8/knSyyj3Oy3rWaTbjoAYsG5Iiu2aPiNP86SvZ 60meUP6agmELnPRfpeJuixzB225n7o8X20wkiG1iJYSLHDceuPo4oy6/OStg+efxcxxOrBrq PIMTn5pXK0iNKLxgyHWm3We3jLXDq4NLBL844LJ1tuj1Axp++rwwhgs7aNHvwSwFc1iDh+KB UjxL0HTC5sapGdcyEFLcOW/SL400sZOlxBjmHYCHQ/2toNiUdc9CsOiDmgMrkFjOvHrWqsuX wHFra919HLtiqdUy3TxYLDh+3toa/1BW/DEEYDtWPqjWcoHIp2RasLAeJl9HAqlU/KgqfrUa eM0mnAEVa0qx5/KaGFqN1sl9EYhIJJgVTsQpb2Xk84p4c2ANxoK2uZ912pNHcq7tiplVd0F+ WuYrYVkaXh+QJARJo3+GPzc9UnErDHLQSMYLBVQzhuA7CRDo/Orb2kUubZxWsD+9ztL/A8Wd ElW4DDD/or1xFdCsFPllvxFdiwBGKLccyqyPHQzgVQS+Sgi0vL3Ph7RgKGkUf+qGJxBRs97s g58oRFcAAAAAAAA= --------------ms090509050705090705090709--