Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 27 Mar 2014 11:27:03 -0700
From:      <dteske@FreeBSD.org>
To:        "'Karl Denninger'" <karl@denninger.net>
Cc:        dteske@FreeBSD.org, freebsd-stable@freebsd.org
Subject:   RE: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
Message-ID:  <012301cf49ea$27dfff80$779ffe80$@FreeBSD.org>
In-Reply-To: <53341106.4060101@denninger.net>
References:  <201403261230.s2QCU3vI095105@freefall.freebsd.org> <8659e58b9fabd9f553c8be5da5dc61fd@mail.mikej.com> <53341106.4060101@denninger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
------=_NextPart_000_0124_01CF49AF.7B812780
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit



> -----Original Message-----
> From: Karl Denninger [mailto:karl@denninger.net]
> Sent: Thursday, March 27, 2014 4:53 AM
> To: freebsd-fs@freebsd.org; freebsd-stable@freebsd.org
> Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
> 
> On 3/27/2014 4:11 AM, mikej wrote:
> > I've been running the latest patch now on r263711 and want to give it
> > a +1
> >
> > No ZFS knobs set and I must go out of my way to have my system swap.
> >
> > I hope this patch gets a much wider review and can be put into the
> > tree permanently.
> >
> > Karl, thanks for the working on this.
> >
> > Regards,
> >
> > Michael Jung
> No problem; I was being driven insane by the stalls and related bad
> behavior... and there's that old saw about complaining about something
> without proposing a fix for it (I've done it!) being "less than optimum"
> so.... :-)
> 
> Hopefully wider review (and, if the general consensus is similar to what
> I've seen here and what you're reporting as well, inclusion in the
> codebase) will come.
> 
> On my sandbox system I have to get truly abusive before I can get the
> system to swap now, but that load is synthetic and we all know what
> sometimes happens when you try to extrapolate from synthetic loads to
> real production ones.
> 

We (vicor) are currently putting your patch through the ringer for stable/8
in an effort to mass-deploy it to hundreds of servers (dozens of which are
relying on production ZFS, several of which have been negatively impacted
by current ARC strategy -- tasks that used to finish in 6 hours or less are
taking longer than a day due to being swapped out under ARC pressure).

We're very excited about your patch and expect to see a kernel running
with it start deployment in mid-April and fully deployed by mid-May.


> What really has my attention is the impact on systems running live
> production loads.
> 

Lots of those, but it will take a little time to trickle out to the
production
machines. Part of the delay was waiting to see when your patch would
stop changing ;D (all good changes, btw... like getting rid of sysctl usage
from within the kernel). I do believe the last thing I merged for our test
lab was March 24th -- and it's changed yet again on March 26th, so I've
got another iteration to churn before we can even start testing in the
test-lab) (smiles)

NB: The patch violates style(9), so I've actually been maintaining a
modified version of your patch for our internal keeping. I've attached
the modified Mar 24th patch which goes against stable/8. Also, it's
uber annoying to have to decode your contextual diff while trying to
translate for style(9) appropriate-ness (if you should switch to unified
diff, also make sure you pass -p to generate function tags so I know
which hunk is where -- merging into stable/8 was unpleasant without
that additional context). In my attached stable/8 patch, you should see
what I'm referring to at the onset of each hunk.

ASIDE: It's no big deal because your patch is only one file, but it's
almost always preferred to generate the patch with full paths to each
file (e.g., generate the patch at the head of the tree *or* go in and
modify the patch-file header afterward to reflect full paths).

(smiles -- sorry for picking nits)


> It has entirely changed the character of those machines, working
> equally-well for both pure ZFS machines and mixed UFS/ZFS systems. One
> of these systems that gets pounded on pretty good and has a
> moderately-large configuration (~10TB of storage, 2 Xeon quad-core
> processors and 24GB of RAM serving a combination of Samba users
> internally, a decently-large Postgres installation supporting an
> externally-facing web forum and blog application, email and similar
> things) has been completely transformed from being "frequently
> challenged" by its workload to literally loafing 90%+ of the day. DBMS
> response times have seen their standard deviation drop by an order of
> magnitude with best-response times down for one of the most-common
> query
> sequences (~30 separate ops) from ~180ms to ~140.
> 

This is most excellent. I can't wait to get it into production! Like you,
the
machines that we have that are struggling are:

a. beefy (24-48 cores, 24-48GB of RAM, 6-12TB ZFS)
b. Using a combination of UFS and ZFS simultaneoulsy

> This particular machine has a separate pool for the system itself (root,
> usr and var) which was formerly UFS because it had to be in order to
> avoid the worst of the "stall" bad behavior.  It also has two other
> pools on it, one for read-nearly-only data sets that are comprised of
> very large files that are almost archival in character and a second that
> has the system's "working set" on it.  The latter has a separate intent
> log; I had a cache SSD drive on it as well but have recently dropped
> that as with these changes it no longer produces a material improvement
> in performance.  I'm frankly not sure the intent log is helping any more
> either but I've yet to drop it and instrument the results -- it used to
> be *necessary* to avoid nasty problems during busy periods.
> [snip]
> 
> At present, coming off the overnight that has an activity spike for
> routine in-house backup activity from connected PCs but is otherwise the
> "low point" of activity shows 1GB of free memory, an "auto-tuned" amount
> of 12.9GB of ARC cache (with a maximum size of 22.3) and inactive pages
> have remained stable.  Wired memory is almost 19GB with Postgres using a
> sizable chunk of it.  Cache efficiency is claimed to be 98.9% (!)
> That'll go down somewhat over the day but during the busiest part of the
> day it remains well into the 90s which I'm sure has a heck of a lot to
> do with the performance improvements....
> 
> Cross-posted over to -STABLE in the hope of expanding review and testing
> by others.
> 

I need to produce a new cleaned-up patch from your March 26th changes.
Hopefully the stream of changes is complete... or should I wait?

NB: Cross-posting is generally frowned upon. Create a separate post to each
list next time please.
-- 
Devin

_____________
The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.

------=_NextPart_000_0124_01CF49AF.7B812780
Content-Type: text/plain;
	name="karld.zfs_arc_newreclaim(cleaned).stable8patch.txt"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="karld.zfs_arc_newreclaim(cleaned).stable8patch.txt"

--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c.orig	2014-03-08 =
04:54:47.000000000 -0800=0A=
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c	2014-03-13 =
00:00:39.000000000 -0700=0A=
@@ -18,6 +18,78 @@=0A=
  *=0A=
  * CDDL HEADER END=0A=
  */=0A=
+=0A=
+/* Karl Denninger (karl@denninger.net), 3/20/2014, FreeBSD-specific=0A=
+ *=0A=
+ * If "NEWRECLAIM" is defined, change the "low memory" warning that =
causes=0A=
+ * the ARC cache to be pared down. The reason for the change is that the=0A=
+ * apparent attempted algorithm is to start evicting ARC cache when free=0A=
+ * pages fall below 25% of installed RAM. This maps reasonably well to =
how=0A=
+ * Solaris is documented to behave; when "lotsfree" is invaded ZFS is =
told=0A=
+ * to pare down.=0A=
+ *=0A=
+ * The problem is that on FreeBSD machines the system doesn't appear to =
be=0A=
+ * getting what the authors of the original code thought they were =
looking at=0A=
+ * with its test -- or at least not what Solaris did -- and as a result =
that=0A=
+ * test never triggers. That leaves the only reclaim trigger as the =
"paging=0A=
+ * needed" status flag, and by the time that trips the system is already=0A=
+ * in low-memory trouble. This can lead to severe pathological behavior=0A=
+ * under the following scenario:=0A=
+ * - The system starts to page and ARC is evicted.=0A=
+ * - The system stops paging as ARC's eviction drops wired RAM a bit.=0A=
+ * - ARC starts increasing its allocation again, and wired memory grows.=0A=
+ * - A new image is activated, and the system once again attempts to =
page.=0A=
+ * - ARC starts to be evicted again.=0A=
+ * - Back to #2=0A=
+ *=0A=
+ * Note that ZFS's ARC default (unless you override it in =
/boot/loader.conf)=0A=
+ * is to allow the ARC cache to grab nearly all of free RAM, provided =
nobody=0A=
+ * else needs it. That would be ok if we evicted cache when required.=0A=
+ *=0A=
+ * Unfortunately the system can get into a state where it never=0A=
+ * manages to page anything of materiality back in, as if there is =
active=0A=
+ * I/O the ARC will start grabbing space once again as soon as the =
memory=0A=
+ * contention state drops. For this reason the "paging is occurring" =
flag=0A=
+ * should be the **last resort** condition for ARC eviction; you want to=0A=
+ * (as Solaris does) start when there is material free RAM left BUT the=0A=
+ * vm system thinks it needs to be active to steal pages back in the =
attempt=0A=
+ * to never get into the condition where you're potentially paging off=0A=
+ * executables in favor of leaving disk cache allocated.=0A=
+ *=0A=
+ * To fix this we change how we look at low memory, declaring three new=0A=
+ * runtime tunables.=0A=
+ *=0A=
+ * The new sysctls are:=0A=
+ * vfs.zfs.arc_freepages (free pages required to call RAM "sufficient")=0A=
+ * vfs.zfs.arc_freepage_percent (additional reservation percentage, =
default 0)=0A=
+ * vfs.zfs.arc_shrink_needed (shows "1" if we're asking for shrinking =
the ARC)=0A=
+ *=0A=
+ * vfs.zfs.arc_freepages is initialized from vm.v_free_target,=0A=
+ * This should insure that we allow the VM system to steal pages,=0A=
+ * but pare the cache before we suspend processes attempting to get more=0A=
+ * memory, thereby avoiding "stalls." You can set this higher if you =
wish,=0A=
+ * or force a specific percentage reservation as well, but doing so may=0A=
+ * cause the cache to pare back while the VM system remains willing to=0A=
+ * allow "inactive" pages to accumulate. The challenge is that image=0A=
+ * activation can force things into the page space on a repeated basis=0A=
+ * if you allow this level to be too small (the above pathological=0A=
+ * behavior); the defaults should avoid that behavior but the sysctls=0A=
+ * are exposed should your workload require adjustment.=0A=
+ *=0A=
+ * If we're using this check for low memory we are replacing the =
previous=0A=
+ * ones, including the oddball "random" reclaim that appears to fire far=0A=
+ * more often than it should. We still trigger if the system pages.=0A=
+ *=0A=
+ * If you turn on NEWRECLAIM_DEBUG then the kernel will print on the =
console=0A=
+ * status messages when the reclaim status trips on and off, along with =
the=0A=
+ * page count aggregate that triggered it (and the free space) for each=0A=
+ * event.=0A=
+ */=0A=
+=0A=
+#define NEWRECLAIM=0A=
+#undef NEWRECLAIM_DEBUG=0A=
+=0A=
+=0A=
 /*=0A=
  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights =
reserved.=0A=
  * Copyright 2011 Nexenta Systems, Inc.  All rights reserved.=0A=
@@ -136,6 +208,13 @@=0A=
 =0A=
 #include <vm/vm_pageout.h>=0A=
 =0A=
+#ifdef NEWRECLAIM=0A=
+#ifdef __FreeBSD__=0A=
+#include <sys/sysctl.h>=0A=
+#include <sys/vmmeter.h>=0A=
+#endif=0A=
+#endif /* NEWRECLAIM */=0A=
+=0A=
 #ifdef illumos=0A=
 #ifndef _KERNEL=0A=
 /* set with ZFS_DEBUG=3Dwatch, to enable watchpoints on frozen buffers =
*/=0A=
@@ -193,16 +272,42 @@ int zfs_arc_grow_retry =3D 0;=0A=
 int zfs_arc_shrink_shift =3D 0;=0A=
 int zfs_arc_p_min_shift =3D 0;=0A=
 int zfs_disable_dup_eviction =3D 0;=0A=
+#ifdef NEWRECLAIM=0A=
+#ifdef __FreeBSD__=0A=
+static int freepages =3D 0; /* This much memory is considered critical =
*/=0A=
+static int percent_target =3D 0; /* Additionally reserve "X" percent =
free RAM */=0A=
+static int shrink_needed =3D 0; /* Shrinkage of ARC cache needed? */=0A=
+#endif /* __FreeBSD__ */=0A=
+#endif /* NEWRECLAIM */=0A=
 =0A=
 TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);=0A=
 TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);=0A=
 TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);=0A=
+#ifdef NEWRECLAIM=0A=
+#ifdef __FreeBSD__=0A=
+TUNABLE_INT("vfs.zfs.arc_freepages", &freepages);=0A=
+TUNABLE_INT("vfs.zfs.arc_freepage_percent", &percent_target);=0A=
+TUNABLE_INT("vfs.zfs.arc_shrink_needed", &shrink_needed);=0A=
+#endif /* __FreeBSD__ */=0A=
+#endif /* NEWRECLAIM */=0A=
+=0A=
 SYSCTL_DECL(_vfs_zfs);=0A=
 SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max, 0,=0A=
     "Maximum ARC size");=0A=
 SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min, 0,=0A=
     "Minimum ARC size");=0A=
 =0A=
+#ifdef NEWRECLAIM=0A=
+#ifdef __FreeBSD__=0A=
+SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepages, CTLFLAG_RWTUN, =
&freepages, 0,=0A=
+    "ARC Free RAM Pages Required");=0A=
+SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepage_percent, CTLFLAG_RWTUN,=0A=
+    &percent_target, 0, "ARC Free RAM Target percentage");=0A=
+SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_shrink_needed, CTLFLAG_RD, =
&shrink_needed,=0A=
+    0, "ARC Memory Constrained (0 =3D no, 1 =3D yes)");=0A=
+#endif /* __FreeBSD__ */=0A=
+#endif /* NEWRECLAIM */=0A=
+=0A=
 /*=0A=
  * Note that buffers can be in one of 6 states:=0A=
  *	ARC_anon	- anonymous (discussed below)=0A=
@@ -2360,8 +2465,12 @@ static int needfree =3D 0;=0A=
 static int=0A=
 arc_reclaim_needed(void)=0A=
 {=0A=
-=0A=
 #ifdef _KERNEL=0A=
+#ifdef NEWRECLAIM_DEBUG=0A=
+	static int xval =3D -1;=0A=
+	static int oldpercent =3D 0;=0A=
+	static int oldfreepages =3D 0;=0A=
+#endif /* NEWRECLAIM */=0A=
 =0A=
 	if (needfree)=0A=
 		return (1);=0A=
@@ -2400,6 +2509,7 @@ arc_reclaim_needed(void)=0A=
 		return (1);=0A=
 =0A=
 #if defined(__i386)=0A=
+=0A=
 	/*=0A=
 	 * If we're on an i386 platform, it's possible that we'll exhaust the=0A=
 	 * kernel heap space before we ever run out of available physical=0A=
@@ -2416,11 +2526,79 @@ arc_reclaim_needed(void)=0A=
 		return (1);=0A=
 #endif=0A=
 #else	/* !sun */=0A=
+=0A=
+#ifdef NEWRECLAIM=0A=
+#ifdef __FreeBSD__=0A=
+	/*=0A=
+	 * Implement the new tunable free RAM algorithm. We check the free=0A=
+	 * pages against the minimum specified target and the percentage that=0A=
+	 * should be free. If we're low we ask for ARC cache shrinkage. If this=0A=
+	 * is defined on a FreeBSD system the older checks are not performed.=0A=
+	 *=0A=
+	 * Check first to see if we need to init freepages, then test.=0A=
+	 */=0A=
+	if (!freepages) { /* If zero then (re)init */=0A=
+		freepages =3D cnt.v_free_target;=0A=
+#ifdef NEWRECLAIM_DEBUG=0A=
+		printf("ZFS ARC: Default vfs.zfs.arc_freepages to [%u]\n",=0A=
+		    freepages);=0A=
+#endif /* NEWRECLAIM_DEBUG */=0A=
+	}=0A=
+#ifdef NEWRECLAIM_DEBUG=0A=
+	if (percent_target !=3D oldpercent) {=0A=
+		printf("ZFS ARC: Reservation percent change to [%d], [%d] "=0A=
+		    "pages, [%d] free\n", percent_target, cnt.v_page_count,=0A=
+		    cnt.v_free_count);=0A=
+		oldpercent =3D percent_target;=0A=
+	}=0A=
+	if (freepages !=3D oldfreepages) {=0A=
+		printf("ZFS ARC: Low RAM page change to [%d], [%d] pages, "=0A=
+		    "[%d] free\n", freepages, cnt.v_page_count,=0A=
+		    cnt.v_free_count);=0A=
+		oldfreepages =3D freepages;=0A=
+	}=0A=
+#endif /* NEWRECLAIM_DEBUG */=0A=
+	/*=0A=
+	 * Now figure out how much free RAM we require to call the ARC cache=0A=
+	 * status "ok". Add the percentage specified of the total to the base=0A=
+	 * requirement.=0A=
+	 */=0A=
+=0A=
+	if (cnt.v_free_count < (freepages + ((cnt.v_page_count / 100) *=0A=
+	    percent_target))) {=0A=
+#ifdef NEWRECLAIM_DEBUG=0A=
+		if (xval !=3D 1) {=0A=
+			printf("ZFS ARC: RECLAIM total %u, free %u, free pct "=0A=
+			    "(%u), reserved (%u), target pct (%u)\n",=0A=
+			    cnt.v_page_count, cnt.v_free_count,=0A=
+			    ((cnt.v_free_count * 100) / cnt.v_page_count),=0A=
+			    freepages, percent_target);=0A=
+			xval =3D 1;=0A=
+		}=0A=
+#endif /* NEWRECLAIM_DEBUG */=0A=
+		shrink_needed =3D 1;=0A=
+		return(1);=0A=
+	} else {=0A=
+#ifdef NEWRECLAIM_DEBUG=0A=
+		if (xval !=3D 0) {=0A=
+			printf("ZFS ARC: NORMAL total %u, free %u, free pct "=0A=
+			    "(%u), reserved (%u), target pct (%u)\n",=0A=
+			    cnt.v_page_count, cnt.v_free_count,=0A=
+			    ((cnt.v_free_count * 100) / cnt.v_free_count),=0A=
+			    freepages, percent_target);=0A=
+			xval =3D 0;=0A=
+		}=0A=
+#endif /* NEWRECLAIM_DEBUG */=0A=
+		shrink_needed =3D 0;=0A=
+		return(0);=0A=
+}=0A=
+#endif /* __FreeBSD__ */=0A=
+#endif /* NEWRECLAIM */=0A=
+=0A=
 	if (kmem_used() > (kmem_size() * 3) / 4)=0A=
 		return (1);=0A=
 #endif	/* sun */=0A=
 =0A=
-#else=0A=
 	if (spa_get_random(100) =3D=3D 0)=0A=
 		return (1);=0A=
 #endif=0A=

------=_NextPart_000_0124_01CF49AF.7B812780--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?012301cf49ea$27dfff80$779ffe80$>