Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 30 Nov 2017 21:32:13 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Larry McVoy <lm@mcvoy.com>
Cc:        Warner Losh <imp@bsdimp.com>, Scott Long <scottl@netflix.com>, Kevin Bowling <kbowling@llnw.com>, Drew Gallatin <gallatin@netflix.com>, "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
Subject:   Re: small patch for pageout. Comments?
Message-ID:  <20171130193213.GF2272@kib.kiev.ua>
In-Reply-To: <20171130184923.GA30262@mcvoy.com>
References:  <20171130173424.GA811@mcvoy.com> <CANCZdfqL9ZsKTfFi%2BvsCTh3yaNjtwaYYY3fvivdbNybBnujawg@mail.gmail.com> <20171130184923.GA30262@mcvoy.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Nov 30, 2017 at 10:49:23AM -0800, Larry McVoy wrote:
> On Thu, Nov 30, 2017 at 11:37:35AM -0700, Warner Losh wrote:
> > On Thu, Nov 30, 2017 at 10:34 AM, Larry McVoy <lm@mcvoy.com> wrote:
> > 
> > > In a recent numa meeting that Scott called, Jeff suggested a small
> > > patch to the pageout daemon (included below).
> > >
> > > It's rather dramatic the difference it makes for me.  If I arrange to
> > > thrash the crap out of memory, without this patch the kernel is so
> > > borked with all the processes in disk wait that I can't kill them,
> > > I can't reboot, my only option is to power off.
> > >
> > > With the patch there is still some borkage, the kernel is randomly
> > > killing processes because of out of mem, it should kill one of my
> > > processes that is causing the problem but it doesn't, it killed
> > > random stuff like dhclient, getty (logged me out), etc.
> > >
> > > But the system is responsive.
> > >
> > > What the patch does is say "if we have more than one core, don't sleep
> > > in pageout, just keep running until we freed enough mem".
> > >
> > > Comments?
> > >
> > 
> > Just to confirm why this patch works.
> > 
> > For UP systems, we have to pause here to allow work to complete, otherwise
> > we can't switch to their threads to complete the I/Os. For MP, however, we
> > can continue to schedule more work because that work can be completed on
> > other CPUs. This parallelism greatly increases the pageout rate, allowing
> > the system to keep up better when some ass-hat process (or processes) is
> > thrashing memory.
> 
> Yep.
> 
> > I'm pretty sure the UP case was also designed to not flood the lower layers
> > with work, starving other consumers. Does this result in undo flooding, and
> > would we get better results if we could schedule up to the right amount of
> > work rather flooding in the MP case?
> 
> I dunno if there is a "right amount".  I could make it a little smarter by
> keeping track of how many pages we freed and sleep if we freed none in a 
> scan (which seems really unlikely).
> 
> All I know for sure is that without this you can lock up the system to
> the point it takes a power cycle to unwedge it.  With this the system
> is responsive.
> 
> Rather than worrying about the smartness, I'd argue this is an improvement,
> ship it, and then I can go look at how the system decides to kill processes
> (because that's currently busted).

The issue with the current OOM is that it is too conservative and hard
to tune automatically. I tries to measure the pagedaemon progress and
cautiously steps back if some forward progress is made. It is apparently
hard to make it more aggressive but not to reintroduce too eager behaviour
of the previous OOM, where small machines without or with small swap get
OOM triggered without much justification.

I have a patch that complements the pagedaemon progress detection
by the simple but apparently effective detection of the page fault'
time allocation stalls. It worked for me on some relatively large
machines (~64-128G) with too many multi-gig processes and no swap. Such
load made the machine catatonic, OOM triggered but it required long
reshuffling of the pages from active to inactive queues or hand-tuning
vm.pageout_oom_seq (lowering it with the risk of allowing false
positives). With the patch, I got them recovered in 30 seconds (this is
configurable).

diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
index ece496407c2..560983c632e 100644
--- a/sys/vm/vm_fault.c
+++ b/sys/vm/vm_fault.c
@@ -134,6 +134,16 @@ static void vm_fault_dontneed(const struct faultstate *fs, vm_offset_t vaddr,
 static void vm_fault_prefault(const struct faultstate *fs, vm_offset_t addra,
 	    int backward, int forward);
 
+static int vm_pfault_oom_attempts = 3;
+SYSCTL_INT(_vm, OID_AUTO, pfault_oom_attempts, CTLFLAG_RWTUN,
+    &vm_pfault_oom_attempts, 0,
+    "");
+
+static int vm_pfault_oom_wait = 10;
+SYSCTL_INT(_vm, OID_AUTO, pfault_oom_wait, CTLFLAG_RWTUN,
+    &vm_pfault_oom_wait, 0,
+    "");
+
 static inline void
 release_page(struct faultstate *fs)
 {
@@ -531,7 +541,7 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
 	vm_pindex_t retry_pindex;
 	vm_prot_t prot, retry_prot;
 	int ahead, alloc_req, behind, cluster_offset, error, era, faultcount;
-	int locked, nera, result, rv;
+	int locked, nera, oom, result, rv;
 	u_char behavior;
 	boolean_t wired;	/* Passed by reference. */
 	bool dead, hardfault, is_first_object_locked;
@@ -543,6 +553,8 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
 	hardfault = false;
 
 RetryFault:;
+	oom = 0;
+RetryFault_oom:;
 
 	/*
 	 * Find the backing store object and offset into it to begin the
@@ -787,7 +799,17 @@ RetryFault:;
 			}
 			if (fs.m == NULL) {
 				unlock_and_deallocate(&fs);
-				VM_WAITPFAULT;
+				vm_waitpfault(vm_pfault_oom_wait);
+				if (vm_pfault_oom_wait < 0 ||
+				    oom < vm_pfault_oom_wait) {
+					oom++;
+					goto RetryFault_oom;
+				}
+				if (bootverbose)
+					printf(
+	"proc %d (%s) failed to alloc page on fault, starting OOM\n",
+					    curproc->p_pid, curproc->p_comm);
+				vm_pageout_oom(VM_OOM_MEM);
 				goto RetryFault;
 			}
 		}
diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
index 4711af9d16d..67c8ddd4b92 100644
--- a/sys/vm/vm_page.c
+++ b/sys/vm/vm_page.c
@@ -2724,7 +2724,7 @@ vm_page_alloc_fail(vm_object_t object, int req)
  *	  this balance without careful testing first.
  */
 void
-vm_waitpfault(void)
+vm_waitpfault(int wp)
 {
 
 	mtx_lock(&vm_page_queue_free_mtx);
@@ -2734,7 +2734,7 @@ vm_waitpfault(void)
 	}
 	vm_pages_needed = true;
 	msleep(&vm_cnt.v_free_count, &vm_page_queue_free_mtx, PDROP | PUSER,
-	    "pfault", 0);
+	    "pfault", wp * hz);
 }
 
 struct vm_pagequeue *
diff --git a/sys/vm/vm_pageout.h b/sys/vm/vm_pageout.h
index 2cdb1492fab..bf09d7142d0 100644
--- a/sys/vm/vm_pageout.h
+++ b/sys/vm/vm_pageout.h
@@ -96,11 +96,10 @@ extern bool vm_pages_needed;
  *	Signal pageout-daemon and wait for it.
  */
 
-extern void pagedaemon_wakeup(void);
+void pagedaemon_wakeup(void);
 #define VM_WAIT vm_wait()
-#define VM_WAITPFAULT vm_waitpfault()
-extern void vm_wait(void);
-extern void vm_waitpfault(void);
+void vm_wait(void);
+void vm_waitpfault(int wp);
 
 #ifdef _KERNEL
 int vm_pageout_flush(vm_page_t *, int, int, int, int *, boolean_t *);



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20171130193213.GF2272>