Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 8 Dec 2017 17:34:29 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Larry McVoy <lm@mcvoy.com>
Cc:        freebsd-hackers@freebsd.org, pho@freebsd.org
Subject:   Re: OOM problem?
Message-ID:  <20171208153429.GJ2272@kib.kiev.ua>
In-Reply-To: <20171208150121.GH16028@mcvoy.com>
References:  <20171208011430.GA16016@mcvoy.com> <20171208101543.GC2272@kib.kiev.ua> <20171208150121.GH16028@mcvoy.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Dec 08, 2017 at 07:01:21AM -0800, Larry McVoy wrote:
> On Fri, Dec 08, 2017 at 12:15:43PM +0200, Konstantin Belousov wrote:
> > > The OOM code kicks in and it behaves poorly.  It doesn't kill any of
> > > the big processes, those are all sleeping without PCATCH on so they are
> > > skipped.
> > What is the proof for this statement ?
> 
> I let the system run overnight trying to find more memory and it never
> killed any of the big processes.
> 
> I am able to log in and kill -9 would not kill them.
The wait channel of the stuck process and its kernel backtrace is the
first step to investigate.

> 
> I tried a reboot and that hung.
> 
> It took a power cycle to get the machine back.
> 
> I've done this multiple times and always get the same result.
> 
> > A process waiting for a page in the fault handler must receive the page
> > to get out of the handler, even if the system is in OOM.  
> 
> I may be confusing you because this is not the normal page fault on a file
> code path (at least I think it is not).  The process is indeed faulting
> in pages but they are pages that were allocated via whatever malloc calls
> these days (in SunOS it mmapped /dev/zero, before that it was sbrk(2),
> I dunno what FreeBSD does, I couldn't find malloc in src/lib, I see that
> it's jemalloc but /usr/src/lib/libc/stdlib/jemalloc has no files?)
Backtrace would answer this question easily.

> 
> I think we are landing in vm_wait() but I can put some debugging in there
> and confirm that if that helps.
There is special version of vm_wait(), vm_waitpfault(), done initially
to easily distiguish page faults waiting for a page vs. other
unsatisfied page allocations by the name of the wait channel.

> 
> > > A) Don't allocate more mem than you have.  This problem exists simply
> > >    because the system allowed malloc to return more space than the
> > >    system had.  If the system kept track of all the mem it has (ram
> > >    plus swap) and when processes asked for an allocation that pushed it
> > >    over that limit, fail that allocation.  It's yet another globally
> > >    locked thing (though Jeff's NUMA stuff may make that better), you
> > >    have to keep track of allocations and frees (as in on exit(2) not
> > >    free(3)), that's why I think it's detail oriented to do it this way.
> > >    Probably the right way but has to be done carefully and someone has
> > >    to care enough to keep watching that this doesn't get broken.
> > This behaviour can be requested by disabling overcommit.   See tuning(7).
> > The code might rot from the time it was done, because this feature often
> > asked for, but rarely used for real.
> 
> Seems like that should be on by default, no?
Of course no. Both program's authors and users are accustomed to the
overcommit. I.e., programs freely allocate huge UVA but limit actual
(faulted in) memory usage, and do fork(2) while owning huge virtual
allocations. This is a common behaviour for the languages runtimes with
gc, but other programs also do this.

> 
> > > B) Sleep with PCATCH, if that doesn't work, loop sleeping for a period, 
> > >    wake up and see if you are signaled.  I'm rusty enough that I don't
> > >    remember if msleep() with PCATCH will catch signals or not (I don't
> > >    remember a msleep(), that might be a BSD thing and not a SunOS thing).
> > >    But whatever, either it catches signals or you replace that sleep with
> > >    a loop that sleeps for a second or so, wakes up and looks to see if it's
> > >    been signaled and if so dies, else goes back to sleep waiting for pageout
> > >    and/or OOM to free some mem.
> > Not exactly this, but something close, was done by the patch I provided to
> > you already.
> 
> I need to double check but I'm pretty sure I'm running with your patch at
> least some version of it.  Doesn't help.  Would it help if I packaged up
> a test case?  Right now I'm using something like this:
> 
>     cd LMbench2+/src
>     for i in 1 2 3 4 5 6 7 8 9 0
>     do	../bin/*/lat_mem_rd 25g 4096 &
>     done
> 
> but I could make something simpler.  I'm willing to keep pushing on this
> if that's helpful but if you'd prefer to debug it yourself I can package
> up a test case.  Should probably do that anyway.
Yes, the reproduction case and machine parameters to reproduce would
allow me to see system state and do additional experiments.  Please send
the scripts to me and Peter Holm (pho, I Cc: ed him).

On Fri, Dec 08, 2017 at 07:03:33AM -0800, Larry McVoy wrote:
> On Fri, Dec 08, 2017 at 12:16:58PM +0200, Konstantin Belousov wrote:
> > On Fri, Dec 08, 2017 at 08:18:21AM +0000, Johannes Lundberg wrote:
> > > Regarding potential oom overhaul. Personally I like the idea of an oom
> > > signal. The idea comes from iOS where applications get a callback when
> > > system memory is low and they're given a chance to free unused
> > > resources or resources that can easily be recreated, before getting
> > > killed completely.
> > The OOM signal is a topic which was discussed to death many times before.
> > The summary is that it does not work, because you need to provide pages
> > for userspace to be able to handle the signal.
> 
> Just for the record, what I was proposing wasn't as ambitious as what 
> Johannes suggested (while I like his idea it's "weird" and it's unlikely
> that Firefox et al would use it unless we got Linux to have the same 
> thing).
> 
> I was just suggesting that processes sleeping in vm_wait() wake up once
> in a while to respect signals, as in, if I kill -9 that process I want it
> to go away.  Currently, it doesn't.
This cannot work.  Currently vm_fault() must either call pmap_enter() to
install pte into page table, pointing to the proper page, or return an
error.  Error must be returned only for the actual cause, i.e. we should
not return a code (similar to EFAULT, but it is Mach error, not errno)
when we have some transient problem unrelated to the process address map.

The reason is that vm_fault() handles not only page faults from userspace,
but also kernel accesses.  The caller of vm_fault() might not be the
trap() routine which handles faults, but other kernel code like uiomove(9)
called from a subsystem.  In other words, signal might be impossible to
deliver (e.g. by terminating the process) in the context which called
vm_fault().

So even if we detect a signal in vm_waitpfault(), we still must allocate
the page. And if we must allocate it, there is no point in checking for
signals.  We already speed up allocation in noted that the process was
killed.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20171208153429.GJ2272>