Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 8 Dec 2017 12:15:43 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Larry McVoy <lm@mcvoy.com>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: OOM problem?
Message-ID:  <20171208101543.GC2272@kib.kiev.ua>
In-Reply-To: <20171208011430.GA16016@mcvoy.com>
References:  <20171208011430.GA16016@mcvoy.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Dec 07, 2017 at 05:14:30PM -0800, Larry McVoy wrote:
> Hi hackers,
> 
> I've been playing around on a box that Netflix loaned me, I'm thinking
> about novel ways to deal with NUMA issues.
> 
> I ran into a problem with the kernel, wanted to check in and see if
> anyone cares (I've got a couple different ways that it could be fixed
> but if noone cares I'll drop it).  It's sort of an ugly problem in that
> when it happens your only recourse is to power cycle the machine, you
> can't kill off the processes causing the problem.
> 
> I was trying to create benchmarks that would show what the system could do
> if you locked things down to different NUMA domains (BTW, the NUMA stuff
> is a complete red herring, the problem I'm about to describe happens if
> NUMA support isn't enabled).
> 
> The machine is running 12.0-CURRENT FreeBSD 12.0-CURRENT #13 ce7b9882181
> with a few diffs I did for debugging and a tweak to the pageout daemon
> suggested by Jeff.  It is a 256GB of RAM machine configured with no swap
> space (that detail is important).
> 
> I created a set of 10 processes that malloced 25GB each and read it
> repeatedly.  That was enough memory pressure to use up all of free mem.
> 
> Here is the problem.  All of these "misbehaved" (by using lots of ram)
> processes go to sleep, I believe in vm_wait().  They are all waiting
> for more ram so the pageout daemon is kicked but to no avail, all the
> ram is tied up in the processes that want more ram.  The pageout daemon
> kicks out what it can but it quickly gets to the point that it scans
> everything and finds nothing (I know this because I added debugging to
> show that's what it is doing).
> 
> The OOM code kicks in and it behaves poorly.  It doesn't kill any of
> the big processes, those are all sleeping without PCATCH on so they are
> skipped.
What is the proof for this statement ?

A process waiting for a page in the fault handler must receive the page
to get out of the handler, even if the system is in OOM.  The OOM and
vm_fault() coordinate to make the fault handler notice that the process
was killed, in which case the page is requested with highest priority
(i.e. page allocator is allowed to reach deep into the system reserve),
to enable the process to reach the ast as fast as possible. At the
kernel/user boundary, it is killed than.

See the P_KILLED() checks in vm_fault().

I do not remember, and the code inspection confirmed my memory, that OOM
scan does not skip uninterruptible sleeps.

> The OOM code starts killing off anything it can find, it was
> killing getty, ssh, bash, dhclient.  One buglet is that, in my opinion,
> it finds stuff to kill that it probably shouldn't.  Anything that init
> will respawn is fine, anything that would not be respawned should be 
> run as not killable.  Seems like an audit of those processes might be
> in order.
> 
> I know that you'll ask why no swap?  Just add swap and the problem
> goes away.  Does it?  I don't think so, that's just kicking the can
> down the road.  If we add 256GB of swap now we have a 512GB bag to fill,
> fill that and I think we're right back to where we started.
> 
> What are the ideas for fixing it?  I've got two.  I think the first
> one is a bit hard to get right and I'm not sure if the second one will
> work (sorry, it's been a long time since I was a kernel hack, like SunOS
> 4.x long time).
> 
> A) Don't allocate more mem than you have.  This problem exists simply
>    because the system allowed malloc to return more space than the
>    system had.  If the system kept track of all the mem it has (ram
>    plus swap) and when processes asked for an allocation that pushed it
>    over that limit, fail that allocation.  It's yet another globally
>    locked thing (though Jeff's NUMA stuff may make that better), you
>    have to keep track of allocations and frees (as in on exit(2) not
>    free(3)), that's why I think it's detail oriented to do it this way.
>    Probably the right way but has to be done carefully and someone has
>    to care enough to keep watching that this doesn't get broken.
This behaviour can be requested by disabling overcommit.   See tuning(7).
The code might rot from the time it was done, because this feature often
asked for, but rarely used for real.

> 
> B) Sleep with PCATCH, if that doesn't work, loop sleeping for a period, 
>    wake up and see if you are signaled.  I'm rusty enough that I don't
>    remember if msleep() with PCATCH will catch signals or not (I don't
>    remember a msleep(), that might be a BSD thing and not a SunOS thing).
>    But whatever, either it catches signals or you replace that sleep with
>    a loop that sleeps for a second or so, wakes up and looks to see if it's
>    been signaled and if so dies, else goes back to sleep waiting for pageout
>    and/or OOM to free some mem.
Not exactly this, but something close, was done by the patch I provided to
you already.

> 
> I kinda like B better because it seems harder to have that approach bit rot.
> I'm wondering if anyone cares about this problem.  If no, fine.  If yes,
> I can cons up a test case and hand that off to someone who wants to fix
> the problem.  If noone wants to fix it, I'll give it a try but I'd like
> feedback on the above approaches, not interested in going down a rathole
> for no good reason.
> 
> Thanks,
> 
> --lm
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20171208101543.GC2272>